14 Standard Setting Methods For Pass Fail Decisions On High Stakes Objective Structured Clinical Examinations A Validity Study
14 Standard Setting Methods For Pass Fail Decisions On High Stakes Objective Structured Clinical Examinations A Validity Study
An International Journal
To cite this article: Naveed Yousuf, Claudio Violato & Rukhsana W. Zuberi (2015) Standard
Setting Methods for Pass/Fail Decisions on High-Stakes Objective Structured Clinical
Examinations: A Validity Study, Teaching and Learning in Medicine, 27:3, 280-291, DOI:
10.1080/10401334.2015.1044749
Download by: [The Aga Khan University] Date: 10 September 2017, At: 22:36
Teaching and Learning in Medicine, 27(3), 280–291
Copyright Ó 2015, Taylor & Francis Group, LLC
ISSN: 1040-1334 print / 1532-8015 online
DOI: 10.1080/10401334.2015.1044749
Claudio Violato
Department of Medical Education, University Ambrosiana, Milan, Italy
Rukhsana W. Zuberi
Department for Educational Development, Aga Khan University, Karachi, Pakistan
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
280
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
TABLE 1
Comparison of different standard setting methods
Standard Setting Methods Underpinning Methodology Strengths Limitations
A. Norm-Referenced Identify cutoff scores at a defined point Easy to calculate Cutoff scores are dependent on the cohort’s
Methods on the range of student scores performance or mean scores
Do not take into account the examination
content or expected competency of students
Cutoff scores can only be identified after the
examination is administered and scored
A fixed number of students may fail and pass
the examination depending upon the
method, regardless of students’ competence
1. Mean–1SD Identify cutoff score one standard
deviation below the cohort mean
score
Identify cutoff score 1.5 standard
2. Mean–1.5SD deviations below the cohort mean
score
281
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
282
TABLE 1
Comparison of different standard setting methods (Continued)
Standard Setting Methods Underpinning Methodology Strengths Limitations
1. Angoff Cutoff scores are identified by exerts The cutoff scores are determined and Resource intensive in terms of the required
reviewing the test items based on the known to both the faculty and the number of experts and their time, and exper-
probability of the border-line students before the administration of tise in the method
students to perform on the test the examination
2. Borderline Group Exerts review the students’ Cutoff scores are based on students Utilizes scores of students graded as border-
performance on the test, and grade actual performance on the test line only
these performances as unsatisfactory, Easy to use Requires large cohort sizes to have reliable
borderline and satisfactory. The number of borderline students
mean score of the borderline students Cutoff score is susceptible to move towards
is used as the cutoff score the extreme scores of the borderline students
on either side
3. Borderline Regression Exerts review the students’ Cutoff scores are based on students Assumption of a linear association between
performance on the test, and grade actual performance on the test performance scores and global grades
these performances as unsatisfactory, Scores and grades of the entire Requires some expertise in statistics
borderline and satisfactory. cohort is taken into account Cutoff score is influenced by extreme scores
Performance scores of the cohort are in global grade
regressed on the grades, and the
score corresponding to the borderline
grade is identified as the cutoff score
C. Cluster Analysis Classify students into groups/clusters of Does not require expert judgments in Requires substantial expertise in statistics
similar performances using cutoff score setting directly as it is a Requires large cohort sizes for appropriate
mathematical concepts of distance statistical technique for classifying classification
(how far the two performances are) students into groups of varying Subjective decisions regarding number of
and similarity (how close the two abilities. clusters and method of analysis may influ-
performances are). ence the cutoff scores
Criterion-based or absolute methods identify cutoff scores in the discipline/specialty, and frequently a member from other
based on the level of competence expected of students on the specialty for a multidisciplinary review.
content being examined and are, thus, preferred for compe- During OSCEs, students’ performance on each station was
tence-based assessments like OSCEs.30,44 These can further be marked by clinical faculty using a 7-point rating scale ranging
categorized into examination-centered (e.g., Angoff) methods from 0 to 6 for each item, and a global rating (GR) at the end.
and examinee-centered methods (e.g., borderline group [BL- A 7-point rating scale was developed to have reliable scores of
G] and borderline regression [BL-R]; Table 1). students’ performance on high-stakes OSCEs.49 Items on each
More recently, cluster analysis has been proposed as a stan- station are specific and relevant to the task being assessed. The
dard setting method and is mainly of two types: clustering or GR is based on seven different performance grades: poor, bor-
hierarchical model, and partitioning or k-means model.46,47 derline fail, borderline pass, good, very good, excellent, and
The k-means cluster analysis categorizes data into a required outstanding. GR did not contribute toward the station score
number of homogenous groups, where k refers to the number but was used for quality assurance and standard setting
of groups. Because standard setting involves categorizing stu- procedures.
dents based on their abilities or competence, k-means cluster Clinical faculty members were invited as examiners for
analysis has been proposed for standard setting by researchers OSCE examinations. Each examiner scored a specific sta-
in the field.28,46–48 Cluster analysis identify groups of similar tion, and hence there were as many examiners as the num-
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
performances in a cohort using mathematical concepts of dis- ber of stations (14–16) in the examination. These faculty
tance and similarity among performances. Cluster analysis is, members were familiar with the OSCE format as they
thus, considered objective in that human judgments in the cut- were involved in the development, review, and/or standard
off score setting are not used. Thus, it is proposed for evalua- setting of the OSCEs, and most have been OSCE exam-
tion and validation of other standard setting methods requiring iners before. Examiners were briefed about the purpose of
expert judgments like Angoff and borderline methods.46,47 the examination, format, and rating of OSCE stations, and
Despite the differences among standard setting methods, any other issues or concerns were addressed in a preexami-
the outcomes of the authentic methods (i.e., cutoff scores and nation meeting held prior to the examination. Examiners
pass/fail decisions) should be similar for the same examina- were told to make independent judgments of students’ per-
tion, if not the same. formance using GR, irrespective of the ratings on station
items. All stations assessing history taking, communication,
and/or counseling skills involved standardized patients
Purpose (SPs). Stations assessing physical examination or proce-
The purpose of the present study was to investigate various dural skills involved SPs or mannequins, as appropriate.
standard setting methods for OSCEs based on the convergent SPs did not rate the students.
validity evidences by comparing the commonly used methods All stations were weighed equally. Total examinee score
against each other and against cluster analysis as a prospective was calculated by averaging the individual station scores.
standard setting method. Students were required to pass OSCE examination to prog-
ress to subsequent year of study. Unsuccessful students
have an opportunity for a second attempt after remediation.
METHODS If they fail the re-sit OSCE, they have to repeat the year
of study.
Participants
Standard setting methods. We used four norm-based
We included 30 OSCE stations that were administered at
methods (Wijnen, Cohen, Mean–1SD, and Mean–1.5SD),
least twice in undergraduate medical education (UGME) Years
three criterion-based methods (Angoff, BL-G, and BL-R), and
2 and 3 high-stakes certifying examinations in 2010, 2011, and
three variants of cluster analysis.
2012 at the Aga Khan University Medical College (AKU-MC)
in Karachi, Pakistan. A total of 393 medical students (183 men Norm-referenced methods. We determined cutoff scores
[46.56%] and 210 women [53.44%]) participated in these using Mean–1SD, Mean–1.5SD, Wijnen, and Cohen methods
examinations. by using the formulae stated in Table 1.
Criterion-referenced methods. For the three criterion-ref-
erenced methods, we selected content experts as judges based
Procedures on the following criteria:
OSCE station development and assessment. OSCE sta-
tions were developed by clinical faculty according to a table 1. Have sufficient experience of teaching the particular
of specifications. Each OSCE station was then reviewed by a level of students.
group of four to seven faculty members facilitated by an edu- 2. Have been involved with assessment of students at the
cational expert. The faculty members included content experts level, and for content, under consideration.
284 N. YOUSUF, C. VIOLATO, R. W. ZUBERI
Cutoff scores using modified Angoff method were identi- high scorers and assumed that the valid cutoff score would lie
fied by a group of four to seven judges facilitated by an educa- at some point between low and average scorers. We employed
tional expert prior to administration of OSCE stations in the three different cluster analysis variants for standard setting—
examination. The educational expert briefed the judges about three-cluster mean (TC-M), three-cluster regression (TC-R),
the purpose and steps of standard setting process followed by and three-cluster contrast (TC-C) methods. The mean and con-
a short discussion on qualities of a borderline student. We fol- trast methods using cluster analysis are reported in the litera-
lowed the reported modified Angoff procedure with slight var- ture with some variations.28,46–48 We studied the TC-R
iation.34 The Angoff method gives freedom to judges to method based on the principle of BL-R method for the first
estimate any probability of a borderline student to perform the time.
item correctly in percentage from 0 to 100.34 For the present For TC-M, the mean of both low and average scorers was
study, we projected items of the OSCE station along with per- calculated for each station. Midpoint of these two mean score
centages corresponding to each point on the rating scale values, that is, of low and average scorers, was then identified
rounded off to nearest 5th percentage value. We requested the as cutoff score for that OSCE station.
judges to select probabilities from those identified on the rat- For TC-R, OSCE scores were regressed on the three cluster
ing scale. The question posed was, “What is the probability of groups using linear regression for each station. The cutoff
a borderline student to perform this item correctly?” Each score was identified using formula:
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
To average Cohen’s kappa values for each method, we con- Number of students on each station ranged from 193 to 294.
verted kappa values into z score using Fisher’s z-transforma- MS ranged from 52.48% to 77.87%.
tion, and then transformed averaged z values back into kappa The mean psychometric indices of 30 OSCE stations were
using hyperbolic tangent (TanH) function. as follows: reliability coefficient D 0.76 (SD D 0.12); SEM D
5.66 (SD D 1.38); coefficient of determination D 0.47 (SD D
0.19); and intergrade discrimination D 7.19 (SD D 1.89).
RESULTS
Analysis of OSCE Scores Cutoff Scores and Pass/Fail Rates by Different Standard
Stations’ tasks, number of students, years of administration, Setting Methods
MS, and standard deviation are shown in Table 2. Seven sta- Range, quartiles, median, and distribution of cutoff scores
tions assessed history taking, 17 physical examination, two and fail rates identified by standard setting methods are
communication and counseling, and four procedural skills. shown in Figures 1 and 2. The most to least stringent among
TABLE 2
Objective structured clinical examination (OSCE) stations
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
TABLE 3
Cohen’s kappa among different standard setting methods
Standard Setting Methods Kappaa Z-Transformation MZ M Kappab MZ M Kappac
TABLE 3
Cohen’s kappa among different standard setting methods (Continued)
Standard Setting Methods Kappaa Z-Transformation MZ M Kappab MZ M Kappac
referenced methods with strong underpinning for competency- 5. Schoonheim-Klein M, Walmsley AD, Habets L, van der Velden U,
based assessments. To overcome the latter, we repeated the Manogue M. An implementation strategy for introducing an OSCE into a
complete analysis excluding four norm-based methods and dental school. European Journal of Dental Education 2005;9:143–9.
6. Schoonheim-Klein M, Muijtjens A, Habets L, Manogue M, Van der
again found that BL-R method showed maximum convergent Vleuten C, Hoogstraten J, et al. On the reliability of a dental OSCE, using
validity evidence among the three criterion-referenced and SEM: Effect of different days. European Journal of Dental Education.
three cluster analysis methods. Because student entry criteria 2008;12:131–137.
or curriculum did not change in the last few years, we do not 7. Newble DI. Eight years’ experience with a structured clinical examina-
tion. Medical Education 1988;22:200–4.
expect combining scores from different years to affect our
8. Boursicot KA, Roberts TE, Pell G. Standard setting for clinical compe-
results. Other limitations include the study conducted in only tence at graduation from medical school: A comparison of passing scores
one medical school, inclusion of only UGME Year 2 and 3 across five medical schools. Advances in Health Sciences Education
OSCEs, and those that were repeated at least twice. Further- 2006;11:173–83.
more, the choices made in the present study for cluster analysis 9. Carraccio C, Englander R. The objective structured clinical examination:
may have influenced the results. A step in the direction of competency-based evaluation. Archives of Pedi-
atrics & Adolescent Medicine 2000;154:736–41.
The present study focused on convergent validity evidence 10. McKnight J, Rideout E, Brown B, Ciliska D, Patton D, Rankin J, et al.
only; however, the choice of standard setting methods by insti- The objective structured clinical examination: An alternative approach to
tutions also depends on other factors, for example, faculty assessing student clinical performance. Journal of Nursing Education
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
25. Norcini J, Boulet J. Methodological issues in the use of standardized 40. Kaufman DM, Mann KV, Muijtjens AM, van der Vleuten CP. A compari-
patients for assessment. Teaching and Learning in Medicine son of standard-setting procedures for an OSCE in undergraduate medical
2003;15:293–7. education. Academic Medicine 2000;75:267–71.
26. Brannick MT, Rol-Korkmarz HT, Prewett M. A systematic review of the 41. Humphrey-Murto S, MacFadyen JC. Standard setting: A comparison of
reliability of objective structured clinical examination scores. Medical case-author and modified borderline-group methods in a small-scale
Education 2011;45:1181–9. OSCE. Academic Medicine 2002;77:729–32.
27. Walsh M, Bailey PH, Koren I. Objective structured clinical evaluation of 42. Kramer A, Muijtjens A, Jansen K, Dusman H, Tan L, van der Vleuten C.
clinical competence: An integrative review. Journal of Advanced Nursing Comparison of a rational and an empirical standard setting procedure for
2009;65:1584–95. an OSCE. Objective structured clinical examinations. Medical Education
28. Sireci SG. Using cluster analysis to solve the problem of standard setting. 2003;37:132–9.
Paper presented at: Annual Meeting of the American Psychological Asso- 43. Wood TJ, Humphrey-Murto SM, Norman GR. Standard setting in a small
ciation, Standard Setting: A Mixed Bag of Judgement, Psychometrics, scale OSCE: A comparison of the modified borderline-group method and
and Policy symposium; August 14, 1995; New York, NY. the borderline regression method. Advances in Health Sciences Education
29. Cusimano MD. Standard setting in medical education. Academic Medi- 2006;11:115–22.
cine 1996;71:S112–20. 44. Turnbull JM. What is . . . normative versus criterion-referenced assess-
30. Norcini JJ. Standards and reliability in evaluation: when rules of thumb ment. Medical Teacher 1989;11:145–50.
don’t apply. Academic Medicine 1999;74:1088–90. 45. Zieky MJ, Perie M, Livingston SA. Cutscores: A manual for setting
31. Norcini JJ. Setting standards on educational tests. Medical Education standards of performance on educational and occupational tests. Lexing-
2003;37:464–9. ton, KY: Educational Testing Service, 2008.
Downloaded by [The Aga Khan University] at 22:36 10 September 2017
32. Wayne DB, Fudala MJ, Butter J, Siddall VJ, Feinglass J, Wade LD, et al. 46. Sireci SG, Robin F. Using cluster analysis to facilitate standard setting.
Comparison of two standard-setting methods for advanced cardiac life Applied Measurement in Education 1999;12:301–25.
support training. Academic Medicine 2005;80:S63–6. 47. Violato C, Marini A, Lee C. A validity study of expert judgement
33. Cizek GJ, Bunch MB. Standard setting: A guide to establishing and eval- procedures for setting cutoff scores on high stakes credentialing
uating performance standards on tests. Thousand Oaks, CA: Sage, 2007. examinations using cluster analysis. Evaluation & The Health Profes-
34. Downing SM, Tekian A, Yudkowsky R. Procedures for establishing sion 2003;26:59–72.
defensible absolute passing scores on performance examinations in health 48. Hess B, Subhiyah RG, Giordano C. Convergence between cluster
professions education. Teaching and Learning in Medicine 2006;18:50–7. analysis and the Angoff method for setting minimum passing scores
35. Jalili M, Hejri SM, Norcini JJ. Comparison of two methods of standard on credentialing examinations. Evaluation & the Health Professions
setting: The performance of the three-level Angoff method. Medical Edu- 2007;30:362–75.
cation 2011;45:1199–208. 49. Streiner AL, Norman RG. Health measurement scales. A practical guide to
36. Shulruf B, Turner R, Poole P, Wilkinson T. The Objective Borderline their development and use. New York, NL: Oxford University Press, 2008.
Method (OBM): A probability-based model for setting up an objective 50. Pell G, Fuller R, Homer M, Roberts T. How to measure the quality of the
pass/fail cut-off score in medical programme assessments. Advances in OSCE: Review of metrics—AMEE guide no. 49. Medical Teacher
Health Sciences Education 2013;18:231–44. 2010;32:802–11.
37. Norcini JJ, Shea JA. The credibility and comparability of standards. 51. Schoonheim-Klein M, Muijtjens A, Habets L, Manogue M, van der
Applied Measurement in Education 1997;10:39–59. Vleuten C, van der Velden U. Who will pass the dental OSCE? Compari-
38. Kilminster S, Roberts T. Standard setting for OSCEs: Trial of borderline son of the Angoff and the borderline regression standard setting methods.
approach. Advances in Health Sciences Education 2004;9:201–9. European Journal of Dental Education 2009;13:162–71.
39. Boulet JR, De Champlain AF, McKinley DW. Setting defensible perfor- 52. Hobma SO, Ram PM, Muijtjens AM, Grol RP, van der Vleuten CP. Set-
mance standards on OSCEs and standardized patient examinations. Medi- ting a standard for performance assessment of doctor–patient communica-
cal Teacher 2003;25:245–9. tion in general practice. Medical Education 2004;38:1244–52.