0% found this document useful (0 votes)
5 views

Use of Statistical Analysis of Cytologic Interpretation to Determine the Causes of Interobserver Disagreement and in Quality Improvement

This study investigates interobserver disagreement in cytologic interpretation using statistical analyses on 80 cervicovaginal smears, primarily diagnosed as ASCUS. Results showed significant differences in diagnostic classification among observers, with poor interobserver agreement confirmed by kappa analysis. The findings suggest that variations in diagnostic thresholds and accuracy can be distinguished through distribution, threshold, and ROC curve analyses, which may aid in quality improvement efforts in cytopathology.

Uploaded by

sk LO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Use of Statistical Analysis of Cytologic Interpretation to Determine the Causes of Interobserver Disagreement and in Quality Improvement

This study investigates interobserver disagreement in cytologic interpretation using statistical analyses on 80 cervicovaginal smears, primarily diagnosed as ASCUS. Results showed significant differences in diagnostic classification among observers, with poor interobserver agreement confirmed by kappa analysis. The findings suggest that variations in diagnostic thresholds and accuracy can be distinguished through distribution, threshold, and ROC curve analyses, which may aid in quality improvement efforts in cytopathology.

Uploaded by

sk LO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

212

CANCER
CYTOPATHOLOGY

Use of Statistical Analysis of Cytologic Interpretation


to Determine the Causes of Interobserver
Disagreement and in Quality Improvement

Andrew A. Renshaw, M.D. BACKGROUND. Disagreements in cytologic interpretation can have several causes,
Kenneth R. Lee, M.D. including differences in diagnostic threshold and diagnostic accuracy. These can
Scott R. Granter, M.D. be distinguished by a combination of statistical analyses.
METHODS. For demonstration purposes, a nonrandom collection of 80 cervicovagi-
Division of Cytology, Departments of Pathology, nal smears, the majority of which (74) were originally diagnosed as atypical cells
Brigham & Women’s Hospital and Harvard Med- of undetermined significance (ASCUS), were reviewed by 3 separate observers and
ical School, Boston, Massachusetts. classified as either negative, negative and reactive, ASCUS favor reactive, ASCUS
not otherwise specified, ASCUS suggestive of a squamous intraepithelial lesion
(SIL), low grade SIL, or high grade SIL. The results were compared with correspond-
ing biopsies and analyzed with distribution analysis, the kappa statistic, threshold
analysis, and receiver operating characteristic (ROC) curve analysis.
RESULTS. Distribution analysis of diagnoses from the three observers demonstrated
statistically significant differences in how cases were classified and a low level
of agreement. Kappa analysis confirmed a very poor interobserver agreement.
Threshold analysis revealed that one observer used a threshold between negative
and ASCUS that was statistically more specific but less sensitive than the other
observers. ROC curve analysis showed that another observer was more accurate
than this observer.
CONCLUSIONS. Variation in cytologic interpretation may have several causes. Distri-
bution, threshold, and ROC analysis allow distinction between differences in diag-
nostic accuracy and diagnostic thresholds. This approach to analyzing cytologic
interpretation may be useful for quality improvement efforts. Cancer (Cancer Cyto-
pathol) 1997;81:212–9. q 1997 American Cancer Society.

KEYWORDS: cytopathology, papanicolaou smear, cervicovaginal, kappa statistic,


receiver operating characteristic curves, statistics.

T he interpretation of cytologic specimens may vary between labora-


tories and between observers. For example, in cervicovaginal (Pa-
panicolaou [Pap]) smears, the incidence of atypical squamous cells
of undetermined significance (ASCUS) varies considerably between
laboratories,1 as does the rate of squamous intraepithelial lesions (SIL)
on subsequent biopsies.2 – 4 Although the interobserver agreement for
Pap smears is good over the entire range of diagnoses,5 the interob-
server agreement for ASCUS is poor.3,5 – 7 Although the causes of this
variability currently are unknown, there are several different statistical
Address for reprints: Andrew Renshaw, M.D.,
methods available to analyze this problem. These include distribution
Department of Pathology, Brigham & Women’s
Hospital, 75 Francis St., Boston, MA 02115. analysis, kappa statistics, threshold analysis, and receiver operating
characteristic (ROC) curve analysis.
Received April 17, 1997; revision received June Distribution analysis is performed by assigning each diagnostic
10, 1997, accepted June 12, 1997. category a numeric value. The resulting distribution of diagnoses usu-

q 1997 American Cancer Society

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


Statistical Analysis of Cytologic Results/Renshaw et al. 213

ally will not be normal, and analysis of the distribution TABLE 1


usually will require nonparametric methods. The Distribution of Diagnoses
Mann – Whitney U test is a commonly used method
Observer 1 Observer 2 Observer 3
to compare medians of nonparametric distributions.
Using this test, one can determine whether the median Negativea SIL Negative SIL Negative SIL
diagnosis of one observer is different from another,
and whether one observer is more likely to diagnose Negativeb 3 1 4 0 9 4
N-rct 8 0 3 1 11 10
a lesion with a greater degree of atypia compared with
A-rct 14 10 5 3 6 7
another observer. Atypical 16 12 19 16 10 3
The most common method for examining the de- A-sugsil 4 4 15 8 9 5
gree of variability between observers is with the kappa LSIL 4 3 2 3 3 3
statistic.5,8,9 This statistic measures the amount of diag- HSIL 0 1 0 1 0 0
nostic agreement between several observers or be-
SIL: squamous intraepithelial lesion; N-rct: negative with reactive changes; A-rct: atypical squamous
tween repeated observations by the same observer. cells of undetermined significance favor reactive changes; Atypical: atypical squamous cells of undeter-
Although the kappa statistic and a consensus diagnosis mined significance not otherwise specified; A-sugsil: atypical squamous cells of undetermined signifi-
as a gold standard have been used as a measure of cance suggestive of squamous intraepithelial lesion; LSIL: low grade squamous intraepithelial lesion;
‘‘accuracy,’’ 5 this is not correct. Accuracy is a measure HSIL: high grade squamous intraepithelial lesion.
a
Biopsy diagnosis.
of how often a particular diagnosis agrees with a stan- b
Cytologic diagnosis.
dard. That standard must be measured independently
of the test that is being performed. Using a consensus
diagnosis to compare diagnoses made by the same
method as the consensus is a measure of the spread continuous display of how sensitivity and specificity
or variability of diagnoses, not accuracy. To illustrate interact. The area under such a curve is a measure of
this further, it is possible for most observers to agree the overall accuracy of an observer or test. ROC analy-
on an answer (high kappa value) and still be wrong sis has been used in nongynecologic19 – 21 and gyneco-
(low accuracy). However, when a consensus diagnosis logic cytology,22,23 histology,24 and interlaboratory set-
is the gold standard and kappa analysis is the test, by tings.25 A limitation of ROC analysis is that although
definition, the consensus diagnosis cannot have a low this method may be an excellent measure of overall
accuracy. accuracy, it does not determine the accuracy for an
Examining the sensitivity and specificity of diag- individual point or threshold along that curve. For ex-
noses at particular thresholds (threshold analysis) is ample, it is possible for one observer to have a higher
one of the most common methods of evaluating cyto- overall accuracy whereas another observer is more ac-
logic interpretations.10 The sensitivity and specificity curate at a particular threshold.
will be affected directly by the point at which the low To assess and illustrate the advantages of this
and high threshold for a diagnosis is set. A lower combined statistical method to distinguish causes of
threshold will have a higher sensitivity, and a higher diagnostic disagreement, the authors determined the
threshold with have a higher specificity. For example, sources of interobserver variability and differences in
assuming two observers are equally accurate and that diagnostic accuracy using the diagnosis of ASCUS on
one observer’s criteria for ASCUS are more sensitive Pap smears as a model.
and less specific than the other, it can be concluded
that this observer has a lower threshold for ASCUS. METHODS
In cytology practice there are often multiple Case Selection and Review
thresholds, or diagnoses. Criteria that result in a higher Cases were retrieved from the files of the Cytology
sensitivity often result in a lower specificity. In this Division of the Department of Pathology, Brigham &
situation, with multiple thresholds and variable sensi- Women’s Hospital in Boston, Massachusetts. Eighty
tivities and specificities, it can be difficult to determine cervicovaginal smears with biopsy follow-up within 90
whether one set of criteria is more accurate than an- days of the smear were selected. Seventy-four were
other. To assess the overall accuracy of cytologic inter- originally diagnosed as ASCUS, 2 as negative, 2 as low
pretation, ROC curve analysis is the method of grade SIL (LSIL), and 2 as high grade SIL (HSIL). Forty-
choice.11 – 18 In brief, ROC curves are constructed by eight smears were found to be SIL on subsequent bi-
calculating the true-positive rate (sensitivity) and opsy (31 LSIL and 17 HSIL) and 32 were negative. Cases
false-positive rate (1-specificity) of different observers were selected to be a mixture of those in which the
or tests at multiple thresholds. These different thresh- original favored cytologic diagnosis (i.e., favor SIL or
olds are plotted, and the points connected to give a favor reactive) correlated with the corresponding bi-

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


214 CANCER (CANCER CYTOPATHOLOGY) August 25, 1997 / Volume 81 / Number 4

FIGURE 1. Distribution of diagnoses for the three observers.

TABLE 2 result. Cases were classified as negative, negative with


Diagnoses of the Three Observersa reactive changes, ASCUS favor reactive changes, AS-
CUS not otherwise specified (NOS) (atypical), ASCUS
Observer 1 Observer 2 Observer 3
suggestive of a SIL, LSIL, or HSIL. For statistical analy-
Mean 3.7 4.1 3.1 sis, these categories were assigned a number on a scale
Median 4.0 4.0 3.0 from 1 to 7, respectively. In each case the original
Variance 1.7 1.4 2.5 diagnosis from the biopsy was used as the gold stan-
a dard. Patients with a biopsy diagnosis of either LSIL
negative Å 1; negative with reactive changes Å 2; atypical squamous cells of undetermined significance
favor reactive changes Å 3; atypical squamous cells of undetermined significance not otherwise speci- or HSIL were interpreted as having disease, whereas
fied Å 4; atypical squamous cells of undetermined significance suggestive of a squamous intraepithelial patients with other diagnoses were interpreted as not
lesion Å 5; low grade squamous intraepithelial lesion Å 6; high grade squamous intraepithelial lesion having disease.
Å 7.

Statistical Analysis
Distribution analysis was performed by assigning a
opsy and those in which it did not. This selection bias numeric value (1 to 7) to each diagnosis (negative
was used to magnify any differences in interobserver through HSIL, respectively) and performing a Mann –
variability, diagnostic threshold, or diagnostic accu- Whitney U test to determine whether the median diag-
racy. When reviewing the cases, each author knew that nosis was different. Kappa analysis was performed us-
the original diagnosis for most cases was ASCUS, that ing a one-tailed test model. Kappa values õ 0.40 repre-
each had a subsequent biopsy that was either negative sented poor agreement, and values ú 0.60 represented
or SIL, and that the cases were selected to be difficult. good agreement. Statistical analysis of diagnostic
However, the authors did not know the actual distribu- thresholds was performed using a two-tailed chi-
tion of diagnoses, what was favored, or the biopsy re- square test. ROC analysis was performed using the
sults. All cases were diagnosed according to the CORROC2 program available from the Department of
Bethesda system.26 Biopsy reports were used as the Radiology at the University of Chicago, written in part
gold standard; individual biopsies were not reviewed by Dr. Charles E. Metz and Helen B. Kronan.27 The
for this study. program calculates a maximum likelihood estimate
All cases were examined by each author and with- based on an effective pair of underlying bivariate-nor-
out clinical information or knowledge of the biopsy mal decision variable distributions. Statistical differ-

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


Statistical Analysis of Cytologic Results/Renshaw et al. 215

FIGURE 2. Level of agreement between the three observers. SIL: squamous intraepithelial lesion.

ences in accuracy are calculated using a two-tailed TABLE 3


univariate Z score test of the differences between the Comparison between the Three Observers’ Diagnoses
areas under the two ROC curves.
Observer 1 vs. 2 Observer 1 vs. 3 Observer 2 vs. 3

RESULTS P Valuea 0.009 0.018 0.00008


The distribution of diagnoses for the three observers Kappa value 0.37 0.21 0.15
is shown in Figure 1 and Table 1. Statistical analysis a
Two-tailed Mann–Whitney U test for differences in medians between the three observers’ diagnoses.
of these responses is shown in Table 2. Because none
of the distributions in Figure 1 was normal (normal
distribution test or Z test), variance rather than stan-
dard deviations were reported. A two-tailed Mann – TABLE 4
Whitney U test for the three observers, Table 3, Sensitivity and Specificity for the Prediction of SIL on Biopsy of the
showed that the median diagnosis of each was signifi- Three Observers at Different Thresholds
cantly different from the others. That is, Observer 3
Observer 1 Observer 2 Observer 3
was statistically more likely to diagnose a case as a
lesser degree of atypia than either Observer 1 and 2, Sens Spec Sens Spec Sens Spec
and Observer 2 was statistically more likely to diagnose
a case with a higher degree of atypia than either Ob- ASCUS 97 23 97 15 56 42
server 1 or 3. SIL 16 92 13 96 9 94
The percentage of cases in which all three, two of
SIL: squamous intraepithelial lesion; ASCUS: atypical squamous cells of undetermined significance;
three, or none of the observers had the same cytologic Sens: sensitivity; Spec: specificity.
diagnosis is shown in Figure 2. Results were stratified
according to biopsy diagnoses. Complete agreement
was obtained in only 11% of cases, and only 2 cases
that were originally diagnosed as ASCUS. Using all 7 shown in Table 3. The highest level of agreement (be-
categories and all 3 observers, the kappa value was tween Observers 1 and 2) had a kappa value of only
0.19. To increase the level of interobserver agreement, 0.37, which was poor (Table 3).
the diagnostic categories were reduced to only nega- Using only the three cytologic categories (negative,
tive, ASCUS, and SIL; the resulting kappa values are ASCUS, and SIL), the sensitivity and specificity for a diag-

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


216 CANCER (CANCER CYTOPATHOLOGY) August 25, 1997 / Volume 81 / Number 4

FIGURE 3. Receiver operator characteristic curves for the three observers.

TABLE 5 sensitive than Observer 3. Analysis of the differences in


Statistical Significance of the Differences in Sensitivity and Specificity specificity at different thresholds showed that Observer
at Two Thresholds for the Three Observersa
3 was significantly more specific for SIL on biopsy at the
Observer 1 vs. 2 Observer 1 vs. 3 Observer 2 vs. 3 negative/ASCUS threshold than either Observer 1 or 2.
The ROC curves for the three observers are shown
Sensitivity in Figure 3. A summary of the observers overall accuracy
Negative vs. ASCUS 0.43 0.08 0.006 is shown in Table 6. The significance of the differences
ASCUS vs. SIL 0.67 0.67 0.67
in accuracy between each of the two curves is shown
Specificity
Negative vs. ASCUS 1.0 0.0004 0.0004 in Table 7. The accuracy of Observer 1 was statistically
ASCUS vs. SIL 0.72 0.71 0.69 significantly higher than that of Observer 3.

ASCUS: atypical squamous cells of undetermined significance; SIL: squamous intraepithelial lesion.
a
Two-tailed P values. DISCUSSION
This study illustrates the use of several different meth-
ods to examine interobserver variability and accuracy
in the interpretation of cervicovaginal smears. Al-
nosis of SIL on biopsy for each observer with a diagnosis though cases were not randomly selected, they pro-
of ASCUS or SIL is shown in Table 4. These thresholds vided an excellent group of cases to demonstrate the
represent different points along the ROC curve as dem- usefulness of these various methods in determining
onstrated in Figure 3. Specifically, when the threshold the causes of interobserver variability. A valid interpre-
was set as ASCUS, all cytologic cases diagnosed as nega- tation of ASCUS variability awaits application of these
tive were interpreted as negative and all cytologic cases methods of analysis to a larger, unbiased, and random
diagnosed as either ASCUS or SIL were interpreted as sample,28 – 32 which the authors currently are trying to
positive. In contrast, when the threshold was set at SIL, assemble.
all negative and ASCUS cytology cases were interpreted The current data illustrate the fact that poor inter-
as negative and only cytology cases interpreted as SIL observer variability may have several causes. The dis-
were interpreted as positive. Analysis of the differences tribution analysis data demonstrates that each of the
in these thresholds is shown in Table 5. Observer 2 was three observers categorized these cases differently.
significantly (P õ 0.05) more sensitive in predicting SIL Observer 3 was statistically more likely to categorize
on biopsy at the negative/ASCUS threshold than Ob- a case as less atypical than the other observers, and
server 3, and Observer 1 was almost statistically more Observer 2 was more likely to categorize a case as more

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


Statistical Analysis of Cytologic Results/Renshaw et al. 217

TABLE 6 Several additional observations can be reached


Accuracya for the Three Observers from this analysis. First, although interobserver agree-
ment is desirable, it is not necessary for the interpreta-
Observer 1 Observer 2 Observer 3
tion of different observers to be relatively accurate. In
Accuracy 0.66 0.59 0.51 this study all three observers were more accurate than
Standard deviation 0.06 0.07 0.07 chance (admittedly a very low level of overall accu-
a
racy), but interobserver agreement was very low. Simi-
Accuracy as determined by area under the curve.
larly, others have shown that it is possible, although
admittedly uncommon, to have good diagnostic repro-
ducibility and accuracy without as good interobserver
agreement.33 For example, consider ten observers ex-
TABLE 7
Statistical Significance of the Differences Between the ROC Curvesa amining ten cases and each diagnosing nine correctly
and one incorrectly, but the case that is diagnosed
Observer 1 vs. 2 Observer 1 vs. 3 Observer 2 vs. 3 incorrectly is different for each observer. The test is
repeated and the exact same results are achieved. In
P value 0.24 0.017 0.12
this scenario, there is poor interobserver agreement
ROC: receiver operating characteristic. (no case has concordance of all 10 observers; overall
a
Two-tailed P value. concordance, 80%), yet each observer was 90% accu-
rate and 100% reproducible. Although this scenario is
somewhat unlikely, it serves to illustrate that accuracy,
reproducibility, and interobserver agreement are not
atypical than the other observers. The percentage of synonymous. The results of kappa analysis may be
cases with complete agreement among all three ob- misleading if the causes of the variability are not inves-
servers was low, and kappa analysis confirmed that the tigated. Of course, interobserver agreement is neces-
interobserver agreement was poor, even after reducing sary to establish criteria for diagnosis, may have im-
the number of categories from seven to three. Thus, portant medicolegal implications, and is desirable for
as shown by others,5,7 the level of interobserver agree- a laboratory in light of the many quality control efforts
ment for the diagnosis of ASCUS in cervicovaginal that rely on review of cases by more than one observer.
smears was poor. However, kappa analysis could not However, interobserver agreement is not necessary or
determine the cause of the variability. This requires sufficient for diagnostic accuracy.
threshold and ROC curve analysis. In addition, simple analysis of the distribution of
Despite the poor interobserver agreement, the di- diagnoses can aid in understanding the source of dis-
agnostic thresholds set by Observers 1 and 2 (as ob- agreement determined by kappa analysis. The distri-
served in Tables 4, 5, and 6) were very similar. In con- bution of diagnoses clearly showed that Observer 2
trast, Observer 3 was using a statistically different was more likely to call a smear more atypical, and
threshold than the other two observers. This clearly Observer 3 was more likely to diagnose a smear as less
was a source of some of the interobserver variability atypical than the other two observers. The results of
in this study. However, it was not possible from the this type of analysis delineate in what way an observer
threshold data alone to determine whether there were should change his or her interpretation to more closely
differences in diagnostic accuracy. In this situation, agree with others. If attaining a higher level of interob-
ROC analysis may provide an answer. In this study, server reproducibility is part of a laboratory’s goals,
ROC analysis confirmed that all three observers were these results may be very useful in quality improve-
more accurate at classifying the cases than chance ment efforts.
alone (area Å 0.50). However, Observer 1 (and not Although one is able to determine the thresholds
Observer 2) was significantly more accurate at classify- and accuracy of different observers from these data,
ing the cases than Observer 3. ROC analysis combined one is not able to determine which set of diagnostic
with threshold analysis demonstrates that the thresh- criteria are optimum or most clinically useful. The di-
olds used by Observers 2 and 3 represent a trade-off agnostic usefulness of a particular diagnosis depends
between sensitivity and specificity, whereas the on the situation. For example, although having a high
thresholds established by Observer 1 were more accu- accuracy is desirable, it may not always be clinically
rate than those of Observer 3. Thus, in this study, poor useful. An observer may be more accurate overall be-
interobserver agreement was the result of differences cause he or she is very good at distinguishing ASCUS
in both diagnostic thresholds as well as in diagnostic favor reactive from ASCUS NOS, but this higher accu-
accuracy. racy may not be useful to the clinician. Similarly,

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


218 CANCER (CANCER CYTOPATHOLOGY) August 25, 1997 / Volume 81 / Number 4

whether a threshold with a higher sensitivity is more squamous cells of undetermined significance: correlative
histologic and follow-up studies from an academic center.
useful than one with a higher specificity generally de-
Diagn Cytopathol 1997;16:1–7.
pends on whether the test is being used as a screening 3. Howell LP, Davis RL. Follow-up of Papanicolaou smears di-
or a diagnostic test. In addition, it is possible that an agnosed as atypical squamous cells of undetermined sig-
observer with a lower overall accuracy may be more nificance. Diagn Cytopathol 1996;14:20–4.
diagnostically useful than an observer with a higher 4. Kaye KS, Dhurandhar NR. Atypical cells of undetermined
accuracy due to differences in diagnostic thresholds. significance: follow-up biopsy and Pap smear findings. Am
For example, because of the uncertainties in patient J Clin Pathol 1993;99:332.
5. Cocchi V, Carretti D, Fanti S, Baldazzi P, Casotti MT, Piazza
management with a diagnosis of ASCUS, some clini-
R, et al. Intralaboratory quality assurance in cervical/vaginal
cians may find an observer who is more willing to cytology: evaluation of intercytologist diagnostic reproduc-
diagnose borderline specimens as either negative or ibility. Diagn Cytopathol 1997;16:87–92.
SIL more useful than one who is more accurate but 6. Sidaway MK, Tabbara SO. Reactive change and atypical
diagnoses many cases as ASCUS. Nevertheless, the au- squamous cells of undetermined significance in Papanico-
thors believe selecting the most diagnostically useful laou smears: a cytohistologic correlation. Diagn Cytopathol
1993;9:423–9.
criteria is much easier when the differences between 7. Sherman ME, Schiffman MH, Lorincz AT, Manos MM, Scott
diagnostic thresholds and accuracy are clearly defined. DR, Kurman RJ, et al. Toward objective quality assurance in
This allows one to determine the value of several diag- cervical cytopathology. Correlation of cytopathologic diag-
nostic criteria more effectively and objectively and se- noses with detection of high-risk human papillomavirus
lect those that are most appropriate. types. Am J Clin Pathol 1994;102:182–7.
8. Epstein JI, Grignon DJ, Humphrey PA, McNeal JE, Sester-
Finally, although ROC curve analysis is clearly use-
henn IA, Troncoso P, et al. Interobserver reproducibility in
ful, the main disadvantage is that there must be a gold the diagnosis of prostatic intraepithelial neoplasia. Am J Surg
standard. Fortunately, biopsy provides a reasonable Pathol 1995;19:873–86.
gold standard for most cytologic studies and often is 9. Allam CK, Bostwick DG, Hayes JA, Upton MP, Wade GG,
available. Although biopsy has limitations related to Domanowski GF, et al. Interobserver variability in the diag-
both sampling and interpretation, especially in cervi- nosis of high grade prostatic intraepithelial neoplasia and
adenocarcinoma. Mod Pathol 1996;9:742–51.
covaginal smears,34 – 39 inherent errors using it as the
10. Raab SS, Isacson C, Layfield LJ, Lenel JC, Slagel DD, Thomas
gold standard should be random and thus should not PA. Atypical glandular cells of undetermined significance.
affect comparisons of accuracy between observers ex- Am J Clin Pathol 1995;104:574–82.
amining large samples. 11. Beck JR, Shultz EK. The use of relative operating characteris-
In conclusion, this study demonstrated that a tic (ROC) curves in test performance evaluation. Arch Pathol
complete understanding of the results of cytologic in- Lab Med 1986;110:13–20.
12. Raab SS. Diagnostic accuracy in cytopathology. Diagn Cyto-
terpretation only can come from a comprehensive sta-
pathol 1994;10:68–75.
tistical analysis. Kappa analysis can determine the 13. Hanley JA, McNeil BJ. The meaning and use of the area
level of interobserver agreement, but distribution under a receiver operating characteristic (ROC) curve. Radi-
analysis allows one to determine in what way different ology 1982;143:29–36.
observers disagree. Threshold analysis is independent 14. McNeil BJ, Hanley JA. Statistical approaches to the analysis
of interobserver agreement, but provides only a lim- of receiver operating characteristic (ROC) curves. Med Decis
ited measure of accuracy. Overall accuracy can best Making 1984;4:137–50.
15. Hanley JA. Receiver operating characteristic (ROC) method-
be determined using ROC analysis. Poor interobserver
ology: the state of the art. Crit Rev Diagn Imaging 1989;
agreement can be the result of either differences in 29:307–35.
diagnostic thresholds or diagnostic accuracy. If one 16. Metz CE, Kronman HB. Statistical significance tests for bi-
wishes to improve one’s performance, then it is im- normal ROC curves. J Math Psych 1980;22:218–43.
portant to know in what way it is deficient; otherwise, 17. Dorfman DD, Alf E. Maximum-likelihood estimation param-
efforts may be misdirected. These different statistical eters of signal-detection theory and determination of confi-
dence intervals-rating-method data. J Math Psych 1969;
methods can be used for quality improvement in cyto-
6:487–96.
logic interpretation, and provide an objective mea- 18. Giard RWM, Hermans J. Interpretation of diagnostic cytol-
surement in determining overall diagnostic use- ogy with likelihood ratios. Arch Pathol Lab Med 1990;
fulness. 114:852–4.
19. Raab SS, Slagel DD, Jensen CS, Teague MW, Savell VH, Oz-
REFERENCES kutlu D, et al. Transitional cell carcinoma: cytologic criteria
1. Davey DD, Naryshkin S, Nielsen ML, Kline TS. Atypical squa- to improve diagnostic accuracy. Mod Pathol 1996;9:225–31.
mous cells of undertermined significance: interlaboratory 20. Cohen MB, Rodgers C, Hales MS, Gonzales JM, Ljung BME,
comparison and quality assurance monitors. Diagn Cytopa- Beckstead JH, et al. Influence of training and experience in
thol 1994;11:390–6. fine needle aspration biopsy of the breast. Arch Pathol Lab
2. Williams ML, Rimm DL, Pedigo MA, Frable WJ. Atypical Med 1987;111:518–20.

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto


Statistical Analysis of Cytologic Results/Renshaw et al. 219

21. Raab SS, Thomas PA, Lenel JC, Bottles K, Fitzsimmons KM, study in a colposcopy unit [abstract]. Mod Pathol 1997;
Zaleski MS, et al. Pathology and probability. Am J Clin Pathol 10:33A.
1995;103:588–93. 30. Flynn C, Pitman M. Cytohistological correlation of subclassi-
22. Bacus JW, Wiley EL, Galbraith W, Marshall PN, Wilbanks GD, fied ASCUS Pap smears [abstract]. Mod Pathol 1997;10:33A.
Weinstein RS. Malignant cell detection and cervical cancer 31. Collins LC, Wang HH, Abu-Jawdeh GM. Qualifiers of atypical
screening. Anal Quant Cytol Histol 1984;6:121–30. squamous cells of undetermined significance help in patient
23. Raab SS, Snider TE, Potts SA, McDaniel HL, Robinson RA, management. Mod Pathol 1996;9:677–81.
Nelson DL, et al. Atypical glandular cells of undetermined 32. Kline MJ, Davey DD. Atypical squamous cells of undeter-
significance. Diagnostic accuracy and interobserver variabil- mined significance qualified: a follow-up study. Diagn Cyto-
ity using select cytologic criteria. Am J Clin Pathol 1997; pathol 1996;14:380–4.
107:299–307. 33. Cramer SF. Interobserver variability in surgical pathology.
24. Langley FA, Buckley CH, Tasker M. The use of ROC curves In: Weinstein RS, editor. Advances in pathology and labora-
in histopathologic decision making. Anal Quant Cytol Histol tory medicine. Vol. 9. St. Louis: C.V. Mosby, 1996:3–82.
1985;7:167–73. 34. Rohr R. Quality assurance in gynecologic cytology. Am J Clin
Pathol 1990;94:754–8.
25. Hermann GA, Herrera N, Sugiura HT. Comparison of inter-
35. Dodd LG, Sneige N, Villarreal Y, Fanning CV, Staerkel GA,
laboratory survey data in terms of receiver operating chract-
Caraway NP, et al. Quality-assurance study of simultane-
eristic (ROC) indices. J Nucl Med 1982;23:525–31.
ously sampled, non-correlating cervical cytology and biop-
26. The Bethesda Committee. The Bethesda system for re-
sies. Diagn Cytopathol 1993;9:138–44.
porting cervical/vaginal cytologic diagnoses. Acta Cytol 36. Cramer H, Schlenk E. An analysis of discrepancies between
1993;37:115–24. the cervical cytologic diagnosis and subsequent histopatho-
27. Metz CE, Wang PL, Kronan HB. A new approach for testing logic diagnosis in 1260 cases [abstract]. Acta Cytol 1994;
the significance of differences between ROC curves mea- 38:812.
sured from correlated data. In: DeConinck F, editor. Infor- 37. Tritz DM, Weeks JA, Spires SE, Sattich M, Banks H, Cibull
mation processing in medical imaging. The Hague: Martinus ML, et al. Etiologies for non-correlating cervical cytologies
Nijhoff, 1984:432–45. and biopsies. Am J Clin Pathol 1995;103:594–7.
28. Chhieng DC, Taylor J, Schmee J, McKenna BJ. Cytologic cri- 38. Joste NE, Crum CP, Cibas ES. Cytologic/histologic correla-
teria for subclassification of ASCUS improve correlation with tion for quality control in cervicovaginal cytology. Am J Clin
biopsy outcome [abstract]. Mod Pathol 1997;10:32A. Pathol 1995;103:32–4.
29. Ettler HC, Downing P, Wright VC, Joseph MG. Atypical squa- 39. Jones BA, Novis DA. Cervical biopsy-cytology correlation.
mous cells of undetermined significance: a cytohistologic Arch Pathol Lab Med 1996;120:523–31.

/ 7306$$1277 07-30-97 14:12:39 ccyta W: Can Cyto

You might also like