0% found this document useful (0 votes)
83 views

14 Standard Setting Methods For Pass Fail Decisions On High Stakes Objective Structured Clinical Examinations A Validity Study

MEDICAL EDUCATION

Uploaded by

rafia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

14 Standard Setting Methods For Pass Fail Decisions On High Stakes Objective Structured Clinical Examinations A Validity Study

MEDICAL EDUCATION

Uploaded by

rafia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Teaching and Learning in Medicine

An International Journal

ISSN: 1040-1334 (Print) 1532-8015 (Online) Journal homepage: http://www.tandfonline.com/loi/htlm20

Standard Setting Methods for Pass/Fail Decisions


on High-Stakes Objective Structured Clinical
Examinations: A Validity Study

Naveed Yousuf, Claudio Violato & Rukhsana W. Zuberi

To cite this article: Naveed Yousuf, Claudio Violato & Rukhsana W. Zuberi (2015) Standard
Setting Methods for Pass/Fail Decisions on High-Stakes Objective Structured Clinical
Examinations: A Validity Study, Teaching and Learning in Medicine, 27:3, 280-291, DOI:
10.1080/10401334.2015.1044749

To link to this article: http://dx.doi.org/10.1080/10401334.2015.1044749

Published online: 09 Jul 2015.

Submit your article to this journal

Article views: 630

View related articles

View Crossmark data

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=htlm20

Download by: [The Aga Khan University] Date: 10 September 2017, At: 22:36
Teaching and Learning in Medicine, 27(3), 280–291
Copyright Ó 2015, Taylor & Francis Group, LLC
ISSN: 1040-1334 print / 1532-8015 online
DOI: 10.1080/10401334.2015.1044749

Standard Setting Methods for Pass/Fail Decisions on


High-Stakes Objective Structured Clinical
Examinations: A Validity Study
Naveed Yousuf
Department for Educational Development, Aga Khan University, Karachi, Pakistan

Claudio Violato
Department of Medical Education, University Ambrosiana, Milan, Italy

Rukhsana W. Zuberi
Department for Educational Development, Aga Khan University, Karachi, Pakistan
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

Keywords standard setting, OSCE, validity, psychometrics, clinical


Construct: Authentic standard setting methods will skills assessment
demonstrate high convergent validity evidence of their outcomes,
that is, cutoff scores and pass/fail decisions, with most other
methods when compared with each other. Background: The
objective structured clinical examination (OSCE) was established BACKGROUND
for valid, reliable, and objective assessment of clinical skills in Clinical skills acquisition is central to health professions
health professions education. Various standard setting methods education (HPE). Ensuring competence in clinical skills among
have been proposed to identify objective, reliable, and valid graduating health professionals is one of the primary responsi-
cutoff scores on OSCEs. These methods may identify different
cutoff scores for the same examinations. Identification of valid bilities of certifying or licensure awarding institutions. Objec-
and reliable cutoff scores for OSCEs remains an important issue tive structured clinical examinations (OSCEs) were introduced
and a challenge. Approach: Thirty OSCE stations administered at to make assessment of clinical skills objective, structured, and
least twice in the years 2010–2012 to 393 medical students in standardized.1,2 Subsequently, OSCE is widely practiced for
Years 2 and 3 at Aga Khan University are included. Psychometric clinical skills assessment in both undergraduate and postgradu-
properties of the scores are determined. Cutoff scores and pass/
fail decisions of Wijnen, Cohen, Mean–1.5SD, Mean–1SD, ate HPE.2–18 Over past three decades, the role of OSCE has
Angoff, borderline group and borderline regression (BL-R) been established as performance-based assessment to enhance
methods are compared with each other and with three variants of validity and reliability of competence decisions.19–26 The qual-
cluster analysis using repeated measures analysis of variance and ity of decisions based on OSCEs, however, are dependent upon
Cohen’s kappa. Results: The mean psychometric indices on the 30 the number and content of stations, faculty development, stan-
OSCE stations are reliability coefficient D 0.76 (SD D 0.12);
standard error of measurement D 5.66 (SD D 1.38); coefficient of dardized patient training, scoring method, total examination
determination D 0.47 (SD D 0.19), and intergrade discrimination time, and standards for passing the examination.27 Identifica-
D 7.19 (SD D 1.89). BL-R and Wijnen methods show the highest tion of valid and reliable cutoff scores, specifically for OSCEs,
convergent validity evidence among other methods on the defined still remains an important issue and a major challenge.28–35 An
criteria. Angoff and Mean–1.5SD demonstrated least convergent invalid cutoff score may lead to wrong decisions with negative
validity evidence. The three cluster variants showed substantial
convergent validity with borderline methods. Conclusions: consequences for both candidates and society.36
Although there was a high level of convergent validity of Wijnen Various standard setting methods are proposed to identify
method, it lacks the theoretical strength to be used for objective, reliable, and valid cutoff scores based on strong meth-
competency-based assessments. The BL-R method is found to odological rigor and rationale.6, 29–31,33, 34,37–39 These methods
show the highest convergent validity evidences for OSCEs with have their own strengths and limitations and may identify differ-
other standard setting methods used in the present study. We also
found that cluster analysis using mean method can be used for ent cutoff scores for the same examinations.6,29, 40–43 Standard
quality assurance of borderline methods. These findings should setting methods are broadly categorized into norm-referenced
be further confirmed by studies in other settings. and criterion-referenced methods (Table 1).6,29, 31,40–44
Norm-referenced or relative methods identify the cutoff
score relative to performance of the group taking the examina-
tion and are thus dependent upon performance of examin-
ees.6,33,40,45 They are more suited for selection purposes,
Correspondence may be sent to Naveed Yousuf, Department for
Educational Development, Faculty of Health Sciences, Aga Khan where a defined number of top scoring examinees are
University, P.O. Box 3500, Stadium Road, Karachi 74800, Pakistan. selected.6 Examples include Mean–1.5SD, Wijnen, and Cohen
E-mail: [email protected] methods (Table 1).

280
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

TABLE 1
Comparison of different standard setting methods
Standard Setting Methods Underpinning Methodology Strengths Limitations

A. Norm-Referenced Identify cutoff scores at a defined point  Easy to calculate  Cutoff scores are dependent on the cohort’s
Methods on the range of student scores performance or mean scores
 Do not take into account the examination
content or expected competency of students
 Cutoff scores can only be identified after the
examination is administered and scored
 A fixed number of students may fail and pass
the examination depending upon the
method, regardless of students’ competence
1. Mean–1SD Identify cutoff score one standard
deviation below the cohort mean
score
Identify cutoff score 1.5 standard
2. Mean–1.5SD deviations below the cohort mean
score

3. Wijnen Mean Score–(1.96*SEM)  Lowers the cutoff scores in case of


low standard error of measurement
(SEM) / low reliability
4. Cohen (Mean ScoreC(1.65*SD))*0.6  Takes into account the best students’
performance as a reference point
rather than the mean scores
 Identifies cutoff scores at 60% of the
95th percentile rank score of the
cohort
B. Criterion-Referenced The cutoff scores are based on the  Involving experts for judgments  Resource intensive
Methods expected competence of students on enhances face validity  Dependent on experts judgments which may
the content being examined  Preferred method for competence- be considered subjective/biased
based examinations
 Takes into account the expected
competence of the students on the
content being examined
 All students may pass or fail the
examination based on their ability/
performance

(Continued on next page)

281
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

282
TABLE 1
Comparison of different standard setting methods (Continued)
Standard Setting Methods Underpinning Methodology Strengths Limitations

1. Angoff Cutoff scores are identified by exerts  The cutoff scores are determined and  Resource intensive in terms of the required
reviewing the test items based on the known to both the faculty and the number of experts and their time, and exper-
probability of the border-line students before the administration of tise in the method
students to perform on the test the examination
2. Borderline Group Exerts review the students’  Cutoff scores are based on students  Utilizes scores of students graded as border-
performance on the test, and grade actual performance on the test line only
these performances as unsatisfactory,  Easy to use  Requires large cohort sizes to have reliable
borderline and satisfactory. The number of borderline students
mean score of the borderline students  Cutoff score is susceptible to move towards
is used as the cutoff score the extreme scores of the borderline students
on either side
3. Borderline Regression Exerts review the students’  Cutoff scores are based on students  Assumption of a linear association between
performance on the test, and grade actual performance on the test performance scores and global grades
these performances as unsatisfactory,  Scores and grades of the entire  Requires some expertise in statistics
borderline and satisfactory. cohort is taken into account  Cutoff score is influenced by extreme scores
Performance scores of the cohort are in global grade
regressed on the grades, and the
score corresponding to the borderline
grade is identified as the cutoff score
C. Cluster Analysis Classify students into groups/clusters of  Does not require expert judgments in  Requires substantial expertise in statistics
similar performances using cutoff score setting directly as it is a  Requires large cohort sizes for appropriate
mathematical concepts of distance statistical technique for classifying classification
(how far the two performances are) students into groups of varying  Subjective decisions regarding number of
and similarity (how close the two abilities. clusters and method of analysis may influ-
performances are). ence the cutoff scores

Note: SEM D standard error of measurement.


STANDARD SETTING METHODS FOR OSCES 283

Criterion-based or absolute methods identify cutoff scores in the discipline/specialty, and frequently a member from other
based on the level of competence expected of students on the specialty for a multidisciplinary review.
content being examined and are, thus, preferred for compe- During OSCEs, students’ performance on each station was
tence-based assessments like OSCEs.30,44 These can further be marked by clinical faculty using a 7-point rating scale ranging
categorized into examination-centered (e.g., Angoff) methods from 0 to 6 for each item, and a global rating (GR) at the end.
and examinee-centered methods (e.g., borderline group [BL- A 7-point rating scale was developed to have reliable scores of
G] and borderline regression [BL-R]; Table 1). students’ performance on high-stakes OSCEs.49 Items on each
More recently, cluster analysis has been proposed as a stan- station are specific and relevant to the task being assessed. The
dard setting method and is mainly of two types: clustering or GR is based on seven different performance grades: poor, bor-
hierarchical model, and partitioning or k-means model.46,47 derline fail, borderline pass, good, very good, excellent, and
The k-means cluster analysis categorizes data into a required outstanding. GR did not contribute toward the station score
number of homogenous groups, where k refers to the number but was used for quality assurance and standard setting
of groups. Because standard setting involves categorizing stu- procedures.
dents based on their abilities or competence, k-means cluster Clinical faculty members were invited as examiners for
analysis has been proposed for standard setting by researchers OSCE examinations. Each examiner scored a specific sta-
in the field.28,46–48 Cluster analysis identify groups of similar tion, and hence there were as many examiners as the num-
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

performances in a cohort using mathematical concepts of dis- ber of stations (14–16) in the examination. These faculty
tance and similarity among performances. Cluster analysis is, members were familiar with the OSCE format as they
thus, considered objective in that human judgments in the cut- were involved in the development, review, and/or standard
off score setting are not used. Thus, it is proposed for evalua- setting of the OSCEs, and most have been OSCE exam-
tion and validation of other standard setting methods requiring iners before. Examiners were briefed about the purpose of
expert judgments like Angoff and borderline methods.46,47 the examination, format, and rating of OSCE stations, and
Despite the differences among standard setting methods, any other issues or concerns were addressed in a preexami-
the outcomes of the authentic methods (i.e., cutoff scores and nation meeting held prior to the examination. Examiners
pass/fail decisions) should be similar for the same examina- were told to make independent judgments of students’ per-
tion, if not the same. formance using GR, irrespective of the ratings on station
items. All stations assessing history taking, communication,
and/or counseling skills involved standardized patients
Purpose (SPs). Stations assessing physical examination or proce-
The purpose of the present study was to investigate various dural skills involved SPs or mannequins, as appropriate.
standard setting methods for OSCEs based on the convergent SPs did not rate the students.
validity evidences by comparing the commonly used methods All stations were weighed equally. Total examinee score
against each other and against cluster analysis as a prospective was calculated by averaging the individual station scores.
standard setting method. Students were required to pass OSCE examination to prog-
ress to subsequent year of study. Unsuccessful students
have an opportunity for a second attempt after remediation.
METHODS If they fail the re-sit OSCE, they have to repeat the year
of study.
Participants
Standard setting methods. We used four norm-based
We included 30 OSCE stations that were administered at
methods (Wijnen, Cohen, Mean–1SD, and Mean–1.5SD),
least twice in undergraduate medical education (UGME) Years
three criterion-based methods (Angoff, BL-G, and BL-R), and
2 and 3 high-stakes certifying examinations in 2010, 2011, and
three variants of cluster analysis.
2012 at the Aga Khan University Medical College (AKU-MC)
in Karachi, Pakistan. A total of 393 medical students (183 men Norm-referenced methods. We determined cutoff scores
[46.56%] and 210 women [53.44%]) participated in these using Mean–1SD, Mean–1.5SD, Wijnen, and Cohen methods
examinations. by using the formulae stated in Table 1.
Criterion-referenced methods. For the three criterion-ref-
erenced methods, we selected content experts as judges based
Procedures on the following criteria:
OSCE station development and assessment. OSCE sta-
tions were developed by clinical faculty according to a table 1. Have sufficient experience of teaching the particular
of specifications. Each OSCE station was then reviewed by a level of students.
group of four to seven faculty members facilitated by an edu- 2. Have been involved with assessment of students at the
cational expert. The faculty members included content experts level, and for content, under consideration.
284 N. YOUSUF, C. VIOLATO, R. W. ZUBERI

Cutoff scores using modified Angoff method were identi- high scorers and assumed that the valid cutoff score would lie
fied by a group of four to seven judges facilitated by an educa- at some point between low and average scorers. We employed
tional expert prior to administration of OSCE stations in the three different cluster analysis variants for standard setting—
examination. The educational expert briefed the judges about three-cluster mean (TC-M), three-cluster regression (TC-R),
the purpose and steps of standard setting process followed by and three-cluster contrast (TC-C) methods. The mean and con-
a short discussion on qualities of a borderline student. We fol- trast methods using cluster analysis are reported in the litera-
lowed the reported modified Angoff procedure with slight var- ture with some variations.28,46–48 We studied the TC-R
iation.34 The Angoff method gives freedom to judges to method based on the principle of BL-R method for the first
estimate any probability of a borderline student to perform the time.
item correctly in percentage from 0 to 100.34 For the present For TC-M, the mean of both low and average scorers was
study, we projected items of the OSCE station along with per- calculated for each station. Midpoint of these two mean score
centages corresponding to each point on the rating scale values, that is, of low and average scorers, was then identified
rounded off to nearest 5th percentage value. We requested the as cutoff score for that OSCE station.
judges to select probabilities from those identified on the rat- For TC-R, OSCE scores were regressed on the three cluster
ing scale. The question posed was, “What is the probability of groups using linear regression for each station. The cutoff
a borderline student to perform this item correctly?” Each score was identified using formula:
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

judge then independently identified the probability from per-


centages shown with the rating scale. The judges then shared
their selected probabilities. For each item, if one or more Passing score D Intergrade discrimination  ðxÞ C Constant;
judges differed more than 20% from others, they were asked
where x is midpoint between low and average scorers.
to discuss this with other judges and reconsider their judgment.
For TC-C method, three cluster groups are regressed on
Any remaining outlier judgments were not included for the
OSCE station scores using linear regression. Using low and
item. Judgments for each item were averaged to get a mean
average scoring groups as contrast groups, cutoff score is iden-
passing score for that item. The item mean passing scores
tified using formula:
were averaged to obtain the station cutoff scores in
percentages.
Passing score D ðy  ConstantÞ / Intergrade discrimination;
To determine cutoff score for each station using the BL-G
method, we calculated the mean score of all students rated as where y is midpoint between low scorers and average scorers.
borderline fail and borderline pass on the GR for that station.
For the BL-R method, we regressed the OSCE station
scores on GR using linear regression. Cutoff score was Analysis of OSCE Scores
obtained as follows: The mean scores (MS), standard deviation, reliability coef-
ficient using Cronbach’s alpha, and standard error of measure-
Passing score D Intergrade discrimination  ðxÞ C Constant; ment (SEM) of scores on each OSCE station was determined.
Intergrade discrimination and coefficient of determination
were calculated as described by Pell, Fuller, Homer, and Rob-
where x is the midpoint of borderline fail and borderline pass erts.50 Intergrade discrimination is the average increase in
groups on GR scale. OSCE score with increase of one grade on GR scale. The coef-
ficient of determination indicates the proportion of variance in
OSCE scores explained by GR on that station. Both measures
Data Analyses indicate discrimination among students on GR based on OSCE
Statistical Package for Social Sciences version 19 (IBM scores.
Corp. Armonk, NY, USA) was used for data analyses.
Cluster analysis. To explore the natural number of distinct
groups in our students based on their performance, we catego- Comparison of Standard Setting Methods for Evidence of
rized the OSCE scores of each station into two to seven clus- Convergent Validity
ters using k-means cluster analysis. We followed an iterative The cutoff scores and pass/fail decisions of standard setting
procedure until the convergence is achieved at p < .02 which methods were compared with each other. For evidence of con-
ensured minimum performance variances within each cluster vergent validity with other methods, a method had to
and maximum performance variances between different clus-
ters. We observed that for all stations three clusters were iden-  fail to show a statistically different cutoff score on
tified within two iterations confirming natural fit of the data repeated measures analysis of variance (ANOVA) at
into three groups. Based on performance patterns of the three 95% confidence level, and
clusters on OSCE stations, we named them low, average, and  have Cohen’s kappa value of at least 0.60.
STANDARD SETTING METHODS FOR OSCES 285

To average Cohen’s kappa values for each method, we con- Number of students on each station ranged from 193 to 294.
verted kappa values into z score using Fisher’s z-transforma- MS ranged from 52.48% to 77.87%.
tion, and then transformed averaged z values back into kappa The mean psychometric indices of 30 OSCE stations were
using hyperbolic tangent (TanH) function. as follows: reliability coefficient D 0.76 (SD D 0.12); SEM D
5.66 (SD D 1.38); coefficient of determination D 0.47 (SD D
0.19); and intergrade discrimination D 7.19 (SD D 1.89).
RESULTS
Analysis of OSCE Scores Cutoff Scores and Pass/Fail Rates by Different Standard
Stations’ tasks, number of students, years of administration, Setting Methods
MS, and standard deviation are shown in Table 2. Seven sta- Range, quartiles, median, and distribution of cutoff scores
tions assessed history taking, 17 physical examination, two and fail rates identified by standard setting methods are
communication and counseling, and four procedural skills. shown in Figures 1 and 2. The most to least stringent among

TABLE 2
Objective structured clinical examination (OSCE) stations
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

Years of No. of %Mean


OSCE Stations Administration Students Scores (SD)

Examination of Nasal Polyp (Mass)a 2010,11 196 72.79 (16.33)


Examination of Ear for Otosclerosis (Abnormal Bone Growth)a 2010,12 195 79.32 (13.48)
History of Quinsy (Peritonsillar Abscess)a 2011,12 199 72.48 (15.96)
History of Sleep Apnea (Pauses in Breathing)a 2010,11 196 62.17 (12.99)
History of Laryngeal Carcinomaa 2010,12 195 63.39 (11.95)
Basic Life Supportb 2010,11,12 294 70.95 (12.92)
Examination of Blood Pressure and Radial Pulseb 2011,12 195 65.10 (10.49)
Breaking Bad Newsb 2010,12 196 68.73 (8.84)
Examination of Breastb 2010,11,12 294 64.15 (9.04)
Examination of Cerebellar Function(a)b 2011,12 195 66.11 (14.96)
Digital Rectal Examinationb 2010,11,12 294 67.64 (11.57)
Examination of Posterior Chestb 2010,11,12 294 66.04 (11.86)
History of Abdominal Painb 2010,11 197 76.29 (14.17)
History of Feverb 2010,11,12 294 69.25 (9.29)
History of Joint Painb 2011,12 195 62.21 (10.87)
Intramuscular Injectionb 2011,12 195 72.45 (13.49)
Examination of Jugular Venous Pressure and Perform Peak Flowb 2011,12 195 72.59 (9.56)
Examination of Liver and Spleenb 2011,12 195 63.94 (8.97)
Counselling for Myths and Misconceptions 2010,11 196 69.82 (14.93)
(About Maintenance of Health, Drug Effects)b
Examination of Anterior Segment of Eyea 2010,12 194 57.93 (12.57)
Examination of Eye for Corneal Opacitya 2011,12 199 63.43 (9.08)
Examination of Eye for Ocular Motilitya 2010,11 195 52.48 (8.46)
Fundoscopy of the Eyea 2010,11,12 294 69.40 (14.17)
History of Painful Red Eye in Glaucomaa 2010,12 194 69.02 (16.02)
Visual Field Testinga 2010,11,12 294 77.87 (11.99)
Examination of Precordiumc 2010,11,12 293 76.40 (11.22)
Examination of Cerebellar Function(b)c 2010,11 193 72.17 (9.65)
Examination of Thorax in Chronic Obstructive Pulmonary Diseasec 2010,12 193 66.88 (8.45)
Examination of Inguino-Scrotal Swellingc 2010,11 193 66.73 (14.46)
Examination of a Lumpc 2010,11,12 210 70.49 (15.29)
a
Year 3 Otolaryngology and Ophthalmology OSCE.
b
Year 2 End-of-Year OSCE (assess history taking, physical examination, procedures and longitudinal theme).
c
Year 3 Grand OSCE (integrated OSCE: family medicine, internal medicine, surgery and longitudinal theme).
286 N. YOUSUF, C. VIOLATO, R. W. ZUBERI
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

FIG. 1. Distribution of cutoff scores identified by different standard setting


methods. Note: OSCE D objective structured clinical examination.

norm-referenced standard setting methods were Wijnen,


Mean–1SD, Cohen, and Mean–1.5SD, whereas among crite-
rion-referenced methods were BL-G, BL-R, and Angoff meth-
ods. Among three-cluster variants, the most to least stringent
were TC-R, TC-M, and TC-C methods. Overall, TC-R had the
highest, whereas Mean–1.5SD method had the lowest median
and mean cutoff scores. FIG. 3. Similarities of cutoff scores identified by different standard setting
methods as identified by repeated measures analysis of variance. Note: BL-R
D borderline regression; TC-C D three-cluster contrast; TC-M D three-cluster
Comparison of Cutoff-Scores on Repeated Measures mean; BL-G D borderline group.
ANOVA at 95% Confidence Level
Cutoff scores identified by Angoff method did not show using ANOVA at 95% confidence level (see Figure 3). This
significant difference with eight of the nine other methods was followed by Wijnen, BL-R, and TC-C methods, as their
cutoff scores did not show significant difference with seven
methods. On the other hand, Mean–1.5SD and TC-R methods
did not show significant different cutoff scores with only three
other methods.
Borderline and three-cluster variants identified statistically
different cutoff scores within their variants, however, did not
show any difference when compared across variants.

Cohen’s Kappa Among Standard Setting Methods


Cohen’s kappa values among standard setting methods on
6,622 decisions for the 30 OSCE stations are shown
in Table 3. Cohen’s kappa values ranged from 0.32 between
Mean–1.5SD and TC-R, to 0.85 between BL-G and BL-R
methods. Mean–1SD method had the highest mean kappa
value with other methods (0.72), whereas TC-R had the lowest
mean kappa value (0.49). BL-G and BL-R had the highest
mean kappa value among criterion-referenced methods,
FIG. 2. Distribution of failure rates identified by different standard setting whereas TC-M had the highest mean kappa value among
methods. three-cluster variants.
STANDARD SETTING METHODS FOR OSCES 287

TABLE 3
Cohen’s kappa among different standard setting methods
Standard Setting Methods Kappaa Z-Transformation MZ M Kappab MZ M Kappac

Wijnen Cohen 0.65 0.77 0.81 0.67 0.82 0.68


Mean–1.5SD 0.48 0.52
Mean–1SD 0.82 1.15
Angoff 0.55 0.62 0.84 0.69
BL-Group 0.75 0.98
BL-Regression 0.73 0.92
TC-Mean 0.73 0.94 0.81 0.67
TC-Regression 0.62 0.73
TC-Contrast 0.64 0.76
Cohen Wijnen 0.65 0.77 0.77 0.65 0.74 0.63
Mean–1.5SD 0.6 0.7
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

Mean–1SD 0.69 0.86


Angoff 0.56 0.64 0.68 0.59
BL-Group 0.61 0.71
BL-Regression 0.61 0.7
TC-Mean 0.61 0.71 0.76 0.64
TC-Regression 0.5 0.55
TC-Contrast 0.78 1.03
Mean-1.5SD Wijnen 0.48 0.52 0.62 0.55 0.53 0.49
Cohen 0.6 0.7
Mean–1SD 0.56 0.64
Angoff 0.44 0.47 0.51 0.47
BL-Group 0.44 0.48
BL-Regression 0.53 0.59
TC-Mean 0.43 0.46 0.47 0.43
TC-Regression 0.32 0.34
TC-Contrast 0.54 0.6
Mean-1SD Wijnen 0.82 1.15 0.88 0.71 0.9 0.72
Cohen 0.69 0.86
Mean–1.5SD 0.56 0.64
Angoff 0.68 0.83 0.95 0.74
BL-Group 0.77 1.03
BL-Regression 0.76 0.99
TC-Mean 0.75 0.97 0.87 0.7
TC-Regression 0.63 0.74
TC-Contrast 0.72 0.9
Angoff Wijnen 0.55 0.62 0.64 0.56 0.6 0.53
Cohen 0.56 0.64
Mean–1.5SD 0.44 0.47
Mean–1SD 0.68 0.83
BL-Group 0.5 0.55 0.58 0.52
BL-Regression 0.54 0.6
TC-Mean 0.58 0.66 0.55 0.5
TC-Regression 0.46 0.49
TC-Contrast 0.46 0.5

(Continued on next page)


288 N. YOUSUF, C. VIOLATO, R. W. ZUBERI

TABLE 3
Cohen’s kappa among different standard setting methods (Continued)
Standard Setting Methods Kappaa Z-Transformation MZ M Kappab MZ M Kappac

BL-Group Wijnen 0.75 0.98 0.8 0.66 0.83 0.68


Cohen 0.61 0.71
Mean–1.5SD 0.44 0.48
Mean–1SD 0.77 1.03
Angoff 0.5 0.55 0.9 0.72
BL-Regression 0.85 1.25
TC-Mean 0.75 0.98 0.84 0.68
TC-Regression 0.68 0.83
TC-Contrast 0.61 0.7
BL-Regression Wijnen 0.73 0.92 0.8 0.67 0.82 0.68
Cohen 0.61 0.7
Mean–1.5SD 0.53 0.59
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

Mean–1SD 0.76 0.99


Angoff 0.54 0.6 0.93 0.73
BL-Group 0.85 1.25
TC-Mean 0.7 0.87 0.77 0.65
TC-Regression 0.61 0.7
TC-Contrast 0.63 0.74
TC-Mean Wijnen 0.73 0.94 0.77 0.65 0.84 0.69
Cohen 0.61 0.71
Mean–1.5SD 0.43 0.46
Mean–1SD 0.75 0.97
Angoff 0.58 0.66 0.84 0.68
BL-Group 0.75 0.98
BL-Regression 0.7 0.87
TC-Regression 0.81 1.12 0.99 0.76
TC-Contrast 0.69 0.85
TC-Regression Wijnen 0.62 0.73 0.59 0.53 0.69 0.6
Cohen 0.5 0.55
Mean–1.5SD 0.32 0.34
Mean–1SD 0.63 0.74
Angoff 0.46 0.49 0.68 0.59
BL-Group 0.68 0.83
BL-Regression 0.61 0.7
TC-Mean 0.81 1.12 0.92 0.73
TC-Contrast 0.61 0.71
TC-Contrast Wijnen 0.64 0.76 0.81 0.67 0.75 0.64
Cohen 0.76 0.98
Mean–1.5SD 0.54 0.6
Mean–1SD 0.72 0.9
Angoff 0.46 0.5 0.66 0.58
BL-Group 0.63 0.75
BL-Regression 0.63 0.74
TC-Mean 0.69 0.85 0.78 0.65
TC-Regression 0.61 0.71
Note: BL D borderline; TC D three-cluster.
a
Cohen’s kappa on 6,622 decisions on 30 objective structured clinical examinations used in the present study.
b
Mean kappa value separately with norm referenced, criterion referenced and cluster analysis methods.
c
Mean kappa value with other standard setting methods used in the present study.
STANDARD SETTING METHODS FOR OSCES 289

Combining Convergent Validity Evidences Together DISCUSSION


BL-R and Wijnen methods showed convergent validity evi- The 30 stations well represented the content and skills
dences with six other methods on the defined criteria (see taught and assessed during UGME Years 2 and 3. Mean psy-
Figure 4). Mean–1SD, BL-G, TC-M, and TC-C showed con- chometric indices of station scores were satisfactory. The mean
vergent validity evidences with five, whereas Cohen and TC-R reliability coefficient of 0.76 was comparable with a mean of
with three other methods. Angoff and Mean–1.5SD methods 0.78 reported by a comprehensive systematic literature review
demonstrated evidence of convergent validity with only on reliability of OSCEs.26 Also, a Cronbach’s alpha of at least
Mean–1SD and Cohen method, respectively. 0.70 is recommended for high-stakes OSCEs.50
Rank order of norm-referenced methods on cutoff scores
was expected and inherent in their calculations.33,42, 45 Among
criterion-referenced methods the most to least stringent found
were BL-G, BL-R, and Angoff methods. There are reports that
BL-G and BL-R methods produce realistic cutoff scores as
compared to Angoff method.42,51,52
TC-R and Mean-1.5SD methods showed least evidence of
convergent validity among all methods using repeated meas-
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

ures ANOVA at 95% confidence level. The reason could have


been TC-R and Mean-1.5SD being the most and least stringent
of all methods used, respectively.
The fact that borderline and three-cluster variants did not
show any difference on cutoff-scores against each other using
repeated measures ANOVA at 95% confidence level was
indicative of the convergent validity of both methods. Border-
line methods were dependent on content experts’ judgments,
whereas three-cluster variants are mathematically derived.
Previous work has shown convergence of cluster analyses
with Ebel method for OSCEs.47
On combining validity evidences together using the
defined criteria, Wijnen and BL-R methods showed conver-
gent validity with most other methods. The high level of con-
vergent validity demonstrated by Wijnen method can be an
incidental finding as it is based on the station MS and SEM.
The theoretical basis of Wijnen method being norm-refer-
enced is not strong enough to be recommended for compe-
tency-based assessments like OSCEs. BL-R has a strong
theoretical underpinning, has significant face validity, and
has been proposed as preferred method for OSCEs in many
other studies.42,43,51
The TC-M method has shown high convergence with bor-
derline methods. Hence, cluster analysis using mean method
can be used for quality assurance of borderline methods. Clus-
ter analysis should be investigated further as one of the pro-
spective standard setting methods.
Cluster analysis, however, has its own limitations, as it
requires large cohort size, expertise in statistics and explora-
tion of performance data adequately to identify natural number
FIG. 4. Convergent validity evidences of standard setting methods. Note: of clusters. Cluster analysis does not require expert judgments
The circles/parentheses and values represent convergence validity on repeated in cutoff score setting directly; however, subjective decisions
measures analysis of variance using cutoff scores and Cohen’s kappa using regarding number of clusters and method of analysis may
pass/fail agreements, respectively. Superscript 2 indicates that the values on influence the cutoff scores.
the extreme left are the kappa values between TC-C and other methods indi-
Limitations of the present study included the use of conver-
cated. Superscript 3 indicates that the values on the extreme right are kappa
values between TC-M and other methods. BL-R D borderline regression; TC- gent validity evidence using scores of stations repeated in 2 or
C D three-cluster contrast; TC-M D three-cluster mean; BL-G D borderline 3 years as a single cohort and comparing norm-referenced
group. methods with weak theoretical basis against criterion-
290 N. YOUSUF, C. VIOLATO, R. W. ZUBERI

referenced methods with strong underpinning for competency- 5. Schoonheim-Klein M, Walmsley AD, Habets L, van der Velden U,
based assessments. To overcome the latter, we repeated the Manogue M. An implementation strategy for introducing an OSCE into a
complete analysis excluding four norm-based methods and dental school. European Journal of Dental Education 2005;9:143–9.
6. Schoonheim-Klein M, Muijtjens A, Habets L, Manogue M, Van der
again found that BL-R method showed maximum convergent Vleuten C, Hoogstraten J, et al. On the reliability of a dental OSCE, using
validity evidence among the three criterion-referenced and SEM: Effect of different days. European Journal of Dental Education.
three cluster analysis methods. Because student entry criteria 2008;12:131–137.
or curriculum did not change in the last few years, we do not 7. Newble DI. Eight years’ experience with a structured clinical examina-
tion. Medical Education 1988;22:200–4.
expect combining scores from different years to affect our
8. Boursicot KA, Roberts TE, Pell G. Standard setting for clinical compe-
results. Other limitations include the study conducted in only tence at graduation from medical school: A comparison of passing scores
one medical school, inclusion of only UGME Year 2 and 3 across five medical schools. Advances in Health Sciences Education
OSCEs, and those that were repeated at least twice. Further- 2006;11:173–83.
more, the choices made in the present study for cluster analysis 9. Carraccio C, Englander R. The objective structured clinical examination:
may have influenced the results. A step in the direction of competency-based evaluation. Archives of Pedi-
atrics & Adolescent Medicine 2000;154:736–41.
The present study focused on convergent validity evidence 10. McKnight J, Rideout E, Brown B, Ciliska D, Patton D, Rankin J, et al.
only; however, the choice of standard setting methods by insti- The objective structured clinical examination: An alternative approach to
tutions also depends on other factors, for example, faculty assessing student clinical performance. Journal of Nursing Education
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

expertise, resources, number of students, and policies. It is 1987;26:39–41.


important to note that the standard setting is a process to 11. Ross M, Carroll G, Knight J, Chamberlai M, Fothergill-Bourbonnais F,
Linton J. Using the OSCE to measure clinical skills performance in nurs-
enhance the defensibility of the pass/fail decisions, and as ing. Journal of Advanced Nursing 1988;13:45–56.
such there is no one valid gold standard method. The conver- 12. Bujack L, McMillan M, Dwyer J, Hazeton M. Assessing comprehensive
gent validity evidence sought in the present study is only one nursing performance: the Objective Structural Clinical Assessment
of the many evidences required to determine the appropriate- (OSCA) Part 1—Development of the assessment strategy. Nurse Educa-
tion Today 1991;11:179–84.
ness of the standard setting methods.
13. Bujack L, McMillan M, Dwyer J, Hazelton M. Assessing comprehensive
A strength of the present study was the comparison of many nursing performance: the Objective Structured Clinical Assessment
commonly used standard setting methods for OSCEs. The (OSCA) Part 2—Report of the evaluation project. Nurse Education Today
results were based on large sample size. The Cohen’s kappa, 1991;11:248–55.
for example, was based on 6,662 decisions over 30 stations. In 14. Bramble K. Nurse practitioner education: Enhancing performance
addition, we attempted to investigate convergent validity of through the use of the objective structured clinical assessment. The Jour-
nal of Nursing Education 1994;33:59–65.
standard setting methods for OSCEs using stringent criteria of 15. O’Neill A, McCall JM. Objectively assessing nursing practices: A curric-
repeated measures ANOVA at 95% confidence level and ular development. Nurse Education Today 1996;16:121–6.
Cohen’s kappa value of at least 0.60. 16. Nicol M, Freeth D. Assessment of clinical skills: A new approach to an
old problem. Nurse Education Today 1998;18:601–9.
17. Khattab AD, Rawlings B. Assessing nurse practitioner students using a
CONCLUSION modified objective structured clinical examination (OSCE). Nurse Educa-
BL-R method was found to show the highest convergent tion Today 2001;21:541–50.
validity evidence for OSCEs with other methods used in 18. Rushton P, Eggett D. Comparison of written and oral examinations in a
the present study. We also found that mean method cluster baccalaureate medical-surgical nursing course. Journal of Professional
Nursing 2003;19:142–8.
analysis can be used for quality assurance of borderline 19. Cohen R, Reznick RK, Taylor BR, Provan J, Rothma A. Reliability and
methods. These findings need to be further confirmed by validity of the objective structured clinical examination in assessing surgi-
replicating the present study in other settings and investi- cal residents. The American Journal of Surgery 1990;160:302–5.
gating other validity and feasibility evidence of different 20. Sloan DA, Donnelly MB, Schwartz RW, Felts JL, Blue AV, Strodel WE.
standard setting methods. The use of the objective structured clinical examination (OSCE) for eval-
uation and instruction in graduate medical education. Journal of Surgical
Research 1996;63:225–30.
21. Sloan DA, Donnelly MB, Schwartz RW, Vasconez HC, Plymale M,
REFERENCES Kenady DE. Critical assessment of the head and neck clinical skills of
1. Harden RM, Stevenson M, Downie WW, Wilson GM. Assessment of general surgery residents. World Journal of Surgery 1998;22:229–35.
clinical competence using objective structured examination. British Medi- 22. Wilkinson TJ, Newble DI, Frampton CM. Standard setting in an
cal Journal 1975;1:447. objective structured clinical examination: Use of global ratings of
2. Harden RM, Gleeson FA. Assessment of clinical competence using an borderline performance to determine the passing score. Medical Edu-
objective structured clinical examination (OSCE). Medical Education cation 2001;35:1043–9.
1979;13:41–54. 23. Newble DI, Swanson DB. Psychometric characteristics of the objective
3. Newble D. Techniques for measuring clinical competence: Objective structured clinical examination. Medical Education 1988;22:325–34.
structured clinical examinations. Medical Education 2004;38:199–203. 24. Swanson DB, Clauser BE, Case SM. Clinical skills assessment with stan-
4. Patricio M, Juliao M, Faraleira F, Young M, Norman G, Carneiro AV. A dardized patients in high-stakes tests: A framework for thinking about
comprehensive checklist for reporting the use of OSCEs. Medical score precision, equating, and security. Advances in Health Sciences
Teacher 2009;31:112–24. Education 1999;4:67–106.
STANDARD SETTING METHODS FOR OSCES 291

25. Norcini J, Boulet J. Methodological issues in the use of standardized 40. Kaufman DM, Mann KV, Muijtjens AM, van der Vleuten CP. A compari-
patients for assessment. Teaching and Learning in Medicine son of standard-setting procedures for an OSCE in undergraduate medical
2003;15:293–7. education. Academic Medicine 2000;75:267–71.
26. Brannick MT, Rol-Korkmarz HT, Prewett M. A systematic review of the 41. Humphrey-Murto S, MacFadyen JC. Standard setting: A comparison of
reliability of objective structured clinical examination scores. Medical case-author and modified borderline-group methods in a small-scale
Education 2011;45:1181–9. OSCE. Academic Medicine 2002;77:729–32.
27. Walsh M, Bailey PH, Koren I. Objective structured clinical evaluation of 42. Kramer A, Muijtjens A, Jansen K, Dusman H, Tan L, van der Vleuten C.
clinical competence: An integrative review. Journal of Advanced Nursing Comparison of a rational and an empirical standard setting procedure for
2009;65:1584–95. an OSCE. Objective structured clinical examinations. Medical Education
28. Sireci SG. Using cluster analysis to solve the problem of standard setting. 2003;37:132–9.
Paper presented at: Annual Meeting of the American Psychological Asso- 43. Wood TJ, Humphrey-Murto SM, Norman GR. Standard setting in a small
ciation, Standard Setting: A Mixed Bag of Judgement, Psychometrics, scale OSCE: A comparison of the modified borderline-group method and
and Policy symposium; August 14, 1995; New York, NY. the borderline regression method. Advances in Health Sciences Education
29. Cusimano MD. Standard setting in medical education. Academic Medi- 2006;11:115–22.
cine 1996;71:S112–20. 44. Turnbull JM. What is . . . normative versus criterion-referenced assess-
30. Norcini JJ. Standards and reliability in evaluation: when rules of thumb ment. Medical Teacher 1989;11:145–50.
don’t apply. Academic Medicine 1999;74:1088–90. 45. Zieky MJ, Perie M, Livingston SA. Cutscores: A manual for setting
31. Norcini JJ. Setting standards on educational tests. Medical Education standards of performance on educational and occupational tests. Lexing-
2003;37:464–9. ton, KY: Educational Testing Service, 2008.
Downloaded by [The Aga Khan University] at 22:36 10 September 2017

32. Wayne DB, Fudala MJ, Butter J, Siddall VJ, Feinglass J, Wade LD, et al. 46. Sireci SG, Robin F. Using cluster analysis to facilitate standard setting.
Comparison of two standard-setting methods for advanced cardiac life Applied Measurement in Education 1999;12:301–25.
support training. Academic Medicine 2005;80:S63–6. 47. Violato C, Marini A, Lee C. A validity study of expert judgement
33. Cizek GJ, Bunch MB. Standard setting: A guide to establishing and eval- procedures for setting cutoff scores on high stakes credentialing
uating performance standards on tests. Thousand Oaks, CA: Sage, 2007. examinations using cluster analysis. Evaluation & The Health Profes-
34. Downing SM, Tekian A, Yudkowsky R. Procedures for establishing sion 2003;26:59–72.
defensible absolute passing scores on performance examinations in health 48. Hess B, Subhiyah RG, Giordano C. Convergence between cluster
professions education. Teaching and Learning in Medicine 2006;18:50–7. analysis and the Angoff method for setting minimum passing scores
35. Jalili M, Hejri SM, Norcini JJ. Comparison of two methods of standard on credentialing examinations. Evaluation & the Health Professions
setting: The performance of the three-level Angoff method. Medical Edu- 2007;30:362–75.
cation 2011;45:1199–208. 49. Streiner AL, Norman RG. Health measurement scales. A practical guide to
36. Shulruf B, Turner R, Poole P, Wilkinson T. The Objective Borderline their development and use. New York, NL: Oxford University Press, 2008.
Method (OBM): A probability-based model for setting up an objective 50. Pell G, Fuller R, Homer M, Roberts T. How to measure the quality of the
pass/fail cut-off score in medical programme assessments. Advances in OSCE: Review of metrics—AMEE guide no. 49. Medical Teacher
Health Sciences Education 2013;18:231–44. 2010;32:802–11.
37. Norcini JJ, Shea JA. The credibility and comparability of standards. 51. Schoonheim-Klein M, Muijtjens A, Habets L, Manogue M, van der
Applied Measurement in Education 1997;10:39–59. Vleuten C, van der Velden U. Who will pass the dental OSCE? Compari-
38. Kilminster S, Roberts T. Standard setting for OSCEs: Trial of borderline son of the Angoff and the borderline regression standard setting methods.
approach. Advances in Health Sciences Education 2004;9:201–9. European Journal of Dental Education 2009;13:162–71.
39. Boulet JR, De Champlain AF, McKinley DW. Setting defensible perfor- 52. Hobma SO, Ram PM, Muijtjens AM, Grol RP, van der Vleuten CP. Set-
mance standards on OSCEs and standardized patient examinations. Medi- ting a standard for performance assessment of doctor–patient communica-
cal Teacher 2003;25:245–9. tion in general practice. Medical Education 2004;38:1244–52.

You might also like