0% found this document useful (0 votes)
379 views

ROC Curves - A Tutorial (Brown and Davis)

This document provides a tutorial on receiver operating characteristic (ROC) curves and related decision measures used in classification problems. It discusses key decision theory concepts like true positives, true negatives, false positives, false negatives. ROC curves plot the true positive rate vs. the false positive rate for different possible decision thresholds. The area under the ROC curve and other measures like predictive values and likelihood ratios are also introduced to help evaluate classification performance beyond just accuracy. Examples are provided to illustrate calculating and interpreting these measures.

Uploaded by

datamule
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
379 views

ROC Curves - A Tutorial (Brown and Davis)

This document provides a tutorial on receiver operating characteristic (ROC) curves and related decision measures used in classification problems. It discusses key decision theory concepts like true positives, true negatives, false positives, false negatives. ROC curves plot the true positive rate vs. the false positive rate for different possible decision thresholds. The area under the ROC curve and other measures like predictive values and likelihood ratios are also introduced to help evaluate classification performance beyond just accuracy. Examples are provided to illustrate calculating and interpreting these measures.

Uploaded by

datamule
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

www.elsevier.com/locate/chemolab

Receiver operating characteristics curves and related decision


measures: A tutorial
Christopher D. Brown a,*, Herbert T. Davis b
a
Ahura Corporation, 46 Jonspin Road, Wilmington, MA, 01887, United States
b
223 Mission Ridge Corrales, NM, 87048, United States

Received 5 February 2005; received in revised form 12 May 2005; accepted 20 May 2005
Available online 12 July 2005

Abstract

Chemometric and statistical tools for data reduction and analysis abound, but the end objective of most analytical undertakings is to make
informed decisions based on the data. Decision theory provides some highly instructive and intuitive tools to bridge the gap between data and
optimal decisions. This tutorial provides a user-centric introduction to receiver operator characteristic curves, and related measures such as
predictive values, likelihood ratios, and cost curves. Important considerations for choosing between these tools are discussed, as well as the
primary methods for determining confidence intervals on the various measures. Numerous worked examples illustrate the calculations, their
interpretation and potential drawbacks.
D 2005 Elsevier B.V. All rights reserved.

Keywords: Receiver operator characteristic; ROC curve; Classification; Likelihood ratio; Decision theory; Bayesian; Cost; Limit of detection

1. Introduction they are ultimately concerned with the context and con-
sequences of decisions from data, rather than the data itself.
The mathematical and statistical methods of chemo- A great many chemical decisions pertain to dichotomous
metrics are rather like sterile surgical instruments: they are conditions. A protein marker is present or not, the structure
often shiny, sometimes oddly shaped, and as a collection, can of a molecule is X or Y, the reaction obeys first-order
be very valuable in the gathering of information, but in the kinetics or second, one should terminate the reaction, or let
end the cutting, diagnosis and treatment is left to the experts. it progress. The information available to aid in these
This is because existing chemometric tools are focused on decisions is rarely perfect in chemical problems, so no
efficiently distilling information, but the decisions made matter how assiduous the chemist, the ultimate result is she
based on this information are highly domain-specific and may be right, or she may be wrong.
context-dependent; they depend on the available choices, the Data themselves are impotent. It is not until data are
quality of the information guiding the decision, and the interpreted against a set of rules that a decision is made,
possible outcomes. There are, however, several perspicuous and action will be taken, usually under the assumption that
techniques from decision theory that can guide the chemo- the decision is correct. When the data are ambiguous
metric surgeon by quantifying and graphically representing decisions may well be erroneous, and the ramifications of
decision-dependencies and probable outcomes. The methods this error define the Floss_ incurred. Decision theory is
reviewed in this tutorial, then, share at least this attribute — specifically concerned with how choices are, or should be
made under conditions of uncertainty such that loss is
minimized. Pure parameter estimation, in contrast, is
detached from decision-based loss, as an erroneously
* Corresponding author. Tel.: +1 505 797 7106. estimated value may not actually result in an incorrect
E-mail address: [email protected] (C.D. Brown). decision.
0169-7439/$ - see front matter D 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2005.05.004
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 25

How does one decide on a suitable rule for a decision Assume that y does indeed have some ability to
process? In published accounts of dichotomous decisions adequately discriminate between positive and negative
(yes/no, stop/start) based on continuous variables (e.g., events; the distributions cartooned in Fig. 1 illustrate one
temperature, concentration, density, probability), one com- such scenario. There are an infinite number of possible
monly sees the classification rate – the proportion of decision thresholds, and we have labeled three possibil-
correct decisions to total decisions – as a measure of the ities in the figure. At threshold t 1, calling all events
goodness of the rule. There are two critical flaws with this positive with y  t 1 would correctly identify nearly every
goodness measure; namely, it considers all incorrect positive event, although a large proportion of negative
decisions to be equally hazardous, and treats all outcomes events would inappropriately be called positive. At
as equally likely. These assumptions are oftentimes inap- threshold t 2 more of a balance is struck, as both positive
propriate in practical applications, and even if they are, there and negative events are missed, and finally at t 3 most
are a number of auxiliary measures of decision performance negative events are correctly identified, but a large
that can enhance insight. proportion of the positive events are erroneously deemed
In this tutorial we have aspired to provide an overview negative as well.
of the practical aspects of some decision theory measures, At a candidate threshold, t i, the outcomes of decision
including receiver operator characteristic (ROC) curves, process (D T) over n trials can be evaluated against a
area under the ROC curve, and related measures such as reference (R T), which for the time being we take to be
positive/negative predictive values, likelihood ratios, and Ftruth_ (determined by hindsight, design, or other means).
cost function analysis. Where possible, the theoretical Four possible outcomes can result for each trial: the decision
motivation of these measures is briefly discussed, but can be correctly positive (CP), correctly negative (CN),
theoretical depth was inevitably sacrificed in favor of our incorrectly positive (IP) or incorrectly negative (IN). (These
primary goal, which was to provide the interpretation for are also often called true positive, true negative, false
the measures and critical instructions for practical use. positive and false negative, respectively). A contingency
There are several other excellent introductory references table (sometimes called a confusion matrix) like that in
on this topic (many of them in the medical literature, Table 1 is often used to tabulate the outcomes. The cells CN,
e.g., [1– 3]), and the monographs by Green and Swets CP, IN, IP represent the number of trials that resulted in a
[4], and Pepe [5] cover many aspects in greater depth particular outcome, so CN + CP + IN + IP= n. If the decision
and breadth. We have also aspired in the text to provide procedure was flawless, all n trials would be correctly
readable references for the topics as they are introduced categorized as positive or negative and IP= IN = 0. More
herein. realistically, some trials will result in incorrect decisions IP
or IN. The classification rate, CR, discussed briefly above,
is simply
2. Classification rate, correct positive and correct
negative fractions CR ¼ ðCN þ CPÞ=n; ð2Þ

the number of correct trial outcomes out of the total number


2.1. Introduction to the measures
of trials. We will continue to use the term classification rate
because it is so widely used in the literature, but as time is
The root of a dichotomous decision process is a threshold
not involved some fields of research think it is more
(t)-based rule on a continuous variable, y, that will drive the
appropriately termed a fraction. The new terms that follow
decision, D, as positive or negative according to
will therefore be referred to as fractions.

þ if yt
D¼ ð1Þ
 if y<t

For instance y might be a scalar instrumental measure- t1 t2 t3 candidate decision


ment (e.g., pH, counts), a concentration estimate from a thresholds for y
number of events

- + - + - +
multivariate calibration model, or a probability from a
logistic or discriminant analysis model. In the decision
negative
vernacular the Fpositive_ label is conventionally assigned events
to the decision that results in the most drastic action (e.g., positive
critical process fault), but the decision could represent any events
dichotomy YES/NO, A/B, etc. Nevertheless, for clarity
and simplicity of terminology we will continue to use the
positive/negative demarcation. For convenience, we will continuous variable y
also assume that large values of y are more indicative of Fig. 1. An illustration of the occurrence of positive and negative events
positive decisions. when ordered by the continuous variable y.
26 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

Table 1 A. reference B. reference


A 2  2 contingency table
- + - +

decision

decision
for threshold t Reference
- 72 4 - 86 12
- + + 16 8 + 2 0
100 100
- CN IN
Decision
+ IP CP CR = (72+8)/100 = 0.80 CR = (86+0)/100 = 0.80
CPF = 8/(4+8) = 0.67 CPF = 0/(0+12) = 0
CN+IP=n- IN+CP=n+ n CNF = 72/(72+16) = 0.82 CNF = 86/(86+2) = 0.98
p+ = (4+8)/100 = 0.12 p+ = (12+0)/100 = 0.12
p- = (72+16)/100 = 0.88 p- = (86+2)/100 = 0.88
A deeper assessment of the decision process is given by Fig. 2. Two scenarios illustrating the calculation of various decision
the calculation of correct negative fraction (CNF) and measures.
correct positive fraction (CPF):1
CN CN
CNF ¼ ¼  ð3Þ tion for the population if p + and p  are representative of the
CN þ IP n
population parameters p + = Pr(R = +)and p  = Pr(R = ). If
CP CP this holds, CR is an estimate of the probability that any
CPF ¼ ¼ þ: ð4Þ
CP þ IN n given decision will be correct.
Individually, the CNF and CPF values express the fraction E ðCRÞ ¼ Prð D ¼ RÞ: ð9Þ
of truly negative events that were correctly deemed
negative, and the fraction of truly positive events that were This places a rather obvious limitation on classification
correctly deemed positive. Consequently, CPF and CNF do rate. It is only meaningful as an estimator of the probability
not depend on the actual numbers of positive or negative of a correct decision if the proportions of positive/negative
events in the trials. The CR, in contrast, is dependent on events are reasonable, a contingency that often is not noted
these quantities, as some simple substitutions reveal. in literature reports of classification rates. Indeed in many
focused studies the positive/negative proportions are inten-
CR ¼ pþ CPF þ p CNF: ð5Þ tionally skewed towards 0.5 so that a reasonable number of
+ 
Here p and p are the fractions of positive and negative both positive and negative cases can be investigated. If the
events observed in the trials: population parameters, p, are known from other sources
(e.g., literature, other studies) they should be substituted for
pþ ¼ nþ =n ¼ ðCP þ INÞ=n the sample estimates in Eq. (5).
p ¼ n =n ¼ ðCN þ IPÞ=n: ð6Þ Two scenarios shown in Fig. 2 illustrate calculations of
Eqs. (2) –(6).
(And it is implied that p + = (1  p ) and vice versa).
Provided the positive and negative trials used to determine 2.2. Univariate confidence intervals
CNF and CPF were a representative sampling of the
respective event distributions (that is, the trials didn’t The CR, CPF, CNF, and p are all proportions, and hence
involve only unusually easy or unusually difficult cases), all have binomial sampling distributions, and several
then CNF and CPF also have a probabilistic interpretation: standard sources such as Numerical Recipes [6] provide
CPF estimates the probability that the decision will be necessary details for exact interval calculation. Fig. 3
positive if the reference is truly positive, and CNF estimates graphically presents exact confidence intervals for propor-
the probability that the decision will be negative if the tions of 0.75 and 0.5 at a variety of sample sizes. Otherwise,
reference is truly negative.
E ðCNFÞ ¼ Prð D ¼  jR ¼  Þ ð7Þ A. n = 100
n = 50
n = 20
E ðCPFÞ ¼ Prð D ¼ þ jR ¼ þ Þ: ð8Þ n = 10
n=5

Due to its dependence on the proportion of positive/ 0.0 1.0


negative events, the CR only has a probabilistic interpreta- n=5
n = 10
1
Synonyms for CNF and CPF abound. CPF is variously called sensitivity B. n = 20
n = 50
or true positive rate (clinical use), hit rate and recall (signal detection theory, n = 100
machine learning), while synonyms for CNF include specificity and true
negative rate (clinical use), and (1-CNF) is often termed false-positive rate Fig. 3. Illustration of exact binomial 95% confidence intervals for true
or false-alarm rate (signal detection theory, machine learning). In statistical proportions A) 0.5, and B) 0.75 at a variety of sample sizes. The true
hypothesis testing, CNF is the significance level 1  a, and CPF is the proportion is indicated by the diamond, and the extent of the interval is
statistical power, 1  b. conveyed by the double-headed arrow.
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 27

we provide two well-known, but approximate expressions (1  a)%. For 95% univariate intervals, for example, the
for confidence intervals on proportions. rectangular area would only provide 90% coverage. To
If the number of events is reasonably large, the binomial define a rectangular region which truly covers (1  a)%,
distribution can be approximated by the normal distribution, therefore, the individual p
univariate
ffiffiffiffiffiffiffiffiffiffiffi confidence intervals are
and the following large-sample formula for the upper (s U) simply determined with 1  a% confidence. Returning to
and lower (s L) confidence bounds on an estimated the example in Fig. 2A, the rectangular confidence interval
proportion (Ĥ) can be applied: for the CPF, CNF pair suggests that with 95% confidence
h pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i the CPF is between 0.36 and 0.88 and the CNF is between
sU ; sL ¼ ŝsF za=2 ŝs ð1  ŝs Þ=n þ 1=ð2nÞ ð10Þ 0.71 and 0.89.

where z a / 2 is the value of the normal deviate encompassing


1  a of the normal curve (1.645, 1.960, 2.241, 2.576 for 90, 3. ROC curves, and area under the curve
95, 97.5, and 99% confidence respectively), and n is the
number in the denominator of the proportion. Snedecor and 3.1. Receiver operator characteristic curves
Cochran [7] advocate for the continuity correction term (1 /
(2n)), which generally makes the normal approximation The receiver operating characteristic (ROC) curve was
more accurate. This formula can be extremely erratic and introduced in World War II military radar operations as a
inaccurate if ns(1  s) is small (less than 10). means to characterize the operators’ ability to correctly
The Wilson interval [8], though more daunting formulai- identify friendly or hostile aircraft based on a radar signal.
cally, has much better statistical properties, and is reason- The loss incurred if a hostile aircraft is deemed friendly by
able for any nŝ (1  ŝ): mistake could be catastrophic, but at the same time military
" sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi # aircraft could not be sent to intercept an overwhelming
2 2
n za=2 2 ŝs ð1  ŝs Þ za=2 number of benign vessels. The ROC curve was devised as a
sU ; sL ¼ I s
ŝ þ Fz a=2 þ :
n þ z2a=2 2n n 4n2 graphical means to explore the trade-offs between these
competing losses at various decision thresholds when a
ð11Þ particular quantitative variable, y, is used to guide the
For the example in Fig. 2A, 95% confidence intervals using decision.
these two approaches are The simplicity and usefulness of the ROC approach was
recognized shortly thereafter for signal detection studies in
Point estimate Eq. (10) Eq. (11) psychophysics [11] and was a major focus of the 1964
CR 0.80 [0.72 0.88] [0.71 0.87] monograph by Green and Swets [4] and Egan’s text of 1975
CPF 0.67 [0.36 0.98] [0.39 0.86] [12]. Swets and Pickett’s 1982 publication on the evaluation
CNF 0.82 [0.73 0.91] [0.73 0.89] of diagnostic systems [13] precipitated a flood of applica-
p+ 0.12 [0.05 0.19] [0.07 0.20]
tions and theoretical advances in medical decision theory
that has not yet subsided. The method seems to have slowly
The approach in Eq. (10) tends to provide overly wide diffused into clinical chemistry applications [14 – 16] and
interval estimates, and in some cases (e.g., CPF) dramatically somewhat surprisingly, only recently the machine learning
so. We refer the reader to a recent and very comprehensive [17] community. The ROC curve, by all indications, remains
Ref. [9] for a detailed overview and discussion of the various essentially unused in non-clinical analytical and chemo-
binomial confidence interval estimation approximations. metric applications. Whether the under-utilization is due to a
lack of familiarity, lack of readily available software, or both
2.3. Joint confidence intervals is presently unclear.
The CPF and CNF we discussed above are characteristics
As the accuracy of the decision is fully described the by of the threshold used to make the decision, as well as the
the pair CPF and CNF, it is often more appropriate to report intrinsic confusion in the data itself at that threshold. Clearly
their joint confidence interval. As CPF and CNF are a different threshold could be chosen, and CPF and CNF can
calculated from independent trials (positive and negatively again be determined. Over many candidate thresholds, a
respectively), they are statistically independent. Elliptical table of CPF’s and CNF’s is assembled
joint confidence regions could certainly be derived condi-
t1 t2 t3 > tk
tional on distributional assumptions, but these are exces-
sively complex for the scope of this article. Instead, we CPF1 CPF2 CPF3 > CPFk
CNF1 CNF2 CNF3 > CNFk
report a simple method discussed in Pepe [5], and originally
proposed by Hilgers [10] for the construction of distribu-
tion-free rectangular coverage regions. The classic ROC curve is generated by plotting the CPFi
Since each univariate interval has (1  a)% coverage, on the vertical axis, and 1  CNFi on the horizontal axis,
the rectangular confidence region only covers (1  a)  leading to a summary graph like that shown in Fig. 4.
28 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

ROC curve
1.0
t2 t1 t1 t2 t3 t4 t5
t3
-+ -+ -+ -+ -+

e ss

CPF
0.5
t4 pr
oc
n
io
cis
de
ss
ele
us

t5
0
0 0.5 1.0 continuous variable y
1-CNF

Fig. 4. Generation of the ROC curve by evaluating the CPF and CNF at various decision thresholds on y.

An ideal decision variable would have a ROC curve that variable for detecting positive/negative events independent
passed through CPF = 1 and CNF = 1, which would corre- of the event rates. The classification rate at a certain decision
spond to a point in the top left corner of the ROC axis threshold, by contrast, can often appear exceptionally good
(indicated by a star in the figure). The top right (t 1: CPF = 1, or exceptionally bad, but if viewed as a CPF, CNF pair on
CNF = 0), and bottom left vertices (t 5: CPF = 0, CNF = 1) the ROC curve it will be immediately apparent if it
represent extreme decision thresholds under which every corresponds to chance detection, or is capable of satisfying
trial is unilaterally deemed positive, or negative; that is, an objective. The ROC curve is also free of parametric
there is really no y-based-decision to speak of. If the table assumptions. The marginal distributions of y for the positive
above is ordered by threshold value (either in increasing or and negative events can take any form—symmetric/asym-
decreasing order) then CPF and 1-CNF will both be metric, unimodal/multimodal—and the ROC curve obedi-
monotonic increasing or monotonic decreasing regardless ently follows as a simple descriptor of the decision space.
of the shape of the marginal distributions of events. In applications where detecting positive events with a
Therefore the derivative at any point on the ROC curve specified probability is critical—such as screening processes
must be greater than or equal to zero, which highly restricts which are intended to trigger follow-on investigations—the
the form the ROC curve can take. analyst can very rapidly examine the ROC curve for a
The ROC is also invariant to strictly increasing trans- decision process and assess the minimum incorrect positive
formations of y, which makes it a convenient representation fraction that will satisfy the required CPF threshold.
insensitive to scale. For example, since a positive decision Frequently the exact choice of where to operate the decision
implies y  t, CPF can also be written as process depends on secondary factors, such as the cost or
risk associated with missing positive events or incorrectly
E ðCPFÞ ¼ Prð y  tjR ¼ þ Þ: ð12Þ
identifying negative events as positive, a matter better suited
This statement still holds true under any monotonically for the application of cost functions, which are discussed in
increasing function f: Section 5.1.
Although we have used the term ROC curve above, the
E ðCPFÞ ¼ Prð f ð yÞ  f ðt ÞjR ¼ þ Þ: ð13Þ
tabulation of CPF’s and CNF’s merely provides a series of
It also results in another advantage to be discussed further points on the ROC axes. More points can be generated by
below. evaluating the CPF and CNF at more finely sampled
A dotted line is also drawn through the diagonal of the decision thresholds, but if the number of positive or
ROC axes in Fig. 4, traversing between the unilateral negative events is small in a data set under study, there
decision thresholds at t 1 and t 5. This is the line of chance. will only be a small number of Fallowable_ CPF or CNF’s.
Decision processes that result in a CPF/CNF pair that falls on For example, with only 5 positive events in a study, the
this line are no better than a proverbial coin-flip. To illustrate, points on the y-axis of the ROC curve will be limited to
a decision process with CPF = 0.2 and 1  CNF = 0.2 implies discrete values of 0/5, 1/5, 2/5, 3/5, 4/5, 5/5. A smooth
that out of 100 positive events we will correctly detect 20, approximation to these points is sometimes desirable.
but also incorrectly call 20 out of 100 negative events The literature reflects a host of both formal (parametric)
positive, so a positive decision is no more suggestive of a and informal (non-parametric) approaches for fitting the
truly positive event than a truly negative event. ROC curve. It is particularly challenging to fit an ROC
The ROC curve, then, provides a very simple graphical curve with a one-size-fits-all function because the marginal
view of the trade-space that is possible for a given decision distributions of the positive and negative events on y dictate
variable, and the general discriminating power of the the shape of the ROC curve, and hence the appropriate
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 29

model. In some instances the y distributions of the positive 3.2. The area under the ROC curve
and negative events are near-normal, or normalizable by
monotonic transformation (e.g., logarithm) and the so-called A single point on the ROC curve represents the
binormal ROC model [5] can be used. As noted above, the characteristics of a decision process at a fixed threshold,
ROC is invariant to monotonic transformation, so the choice and the ROC curve defines the operating characteristics that
of normalizing transformation will not affect the shape of are achievable based on y, but what can be said about the
the ROC curve. Although the binormal fitting method is not discriminating power of the continuous variable in general?
particularly complicated, it can appear somewhat obtuse. In The area under the ROC curve (AUC) is frequently used as
short, the critical parameters for fitting the binormal model just such a summary measure. The perfect ROC curve,
are the (optionally transformed) distribution means (l +,l ) which traverses the point CPF = CNF = 1, has an AUC of 1,
and standard deviations (r +,r ) or estimators thereof, and while a test with an ROC curve along the line of
the fitted ROC curve as a function of threshold value t is Fuselessness_ has an AUC of 0.5. Hanley and McNeil [18]
obtained from manipulations of the standard normal showed that, although the AUC seems like a crude summary
cumulative distribution using these parameters. measure, it actually has a probabilistic interpretation. It is

equal to the probability that y values for a randomly selected
ROCðt Þ ¼ U a þ bU1 ðt Þ : ð14Þ pair of positive and negative events will be correctly
ordered, which is rather fantastically termed the probability
U is the standard normal cumulative distribution function,
of stochastic domination in non-parametric statistics. The
and a and b are defined as
AUC, then, is the probability of stochastic domination of the
l̂ þ  l̂
l l r̂ 
r y-values for positive events over negative events, and due to
a¼ b¼ : ð15Þ the condensation of information, the reader should note that

r̂ rþ

very differently shaped ROC curves can have the same
We provide a short MATLAB (The Mathworks, Natick, AUC. Pepe [5] prefers a more general interpretation of the
MA) program in the appendix for estimating the binormal AUC as the average CPF over the entire range of possible
ROC curve by this method, and two examples of fitting CNF’s.
using this approach are given in Fig. 5. An interesting parallel meaning for the AUC arises if we,
The reader should proceed cautiously with parametric for the time being, ignore the ROC curve entirely, and think
ROC curve fitting methods, for while the points on the ROC just about the distributions of y for positive and negative
curve are free of distributional assumptions, fitting methods events. At the most elementary level, one might want to
can be quite reliant on distributional assumptions. In know whether the distributions of positive and negative
general, smooth approximations to the measured CPF, events had means that were statistically distinguishable. If
CNF pairs are of aesthetic or interpretational value only, the distributions were normal, a two-sample t-test would be
so there is often little impetus to expend much effort in the a logical starting point to answer this question. However, if
venture. the normal assumption is tenuous non-parametric methods
are preferred, and the non-parametric analog to the t-test is
the Wilcoxon rank-sum test, or synonymously the Mann –
1 Whitney U test, [7] which is precisely a non-parametric test
0.9
for stochastic domination.
Assume the following y-values were recorded for a series
0.8 A of trials with known reference:
0.7
y values 0.213 0.153 1.21  0.110 0.001 0.524 0.847 0.046
0.6
B (R = -)
CPF

0.5 y values 1.42 0.775 0.966 0.412 1.22 0.856 0.210 0.735 1.18
(R = +)
0.4 A. AUC = 0.84
B. AUC = 0.74
0.3
To perform the U test we first order the y values jointly from
0.2
lowest to highest:
0.1
1 2 3 4 5 6 7 8
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.110 0.046 0.001 0.153 0.210 0.213 0.412 0.524 ...
() () () () (+) () (+) ()
1-CNF
Fig. 5. Two examples of a smooth fit to the ROC curve estimated using the 9 10 11 12 13 14 15 16 17
binormal model. AUC is the area under the fitted curve, discussed later in 0.735 0.775 0.847 0.856 0.966 1.18 1.21 1.22 1.42
Section 3.2. (+) (+) () (+) (+) (+) () (+) (+)
30 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

and sum the ranks assigned to the values in each group: faced with a smorgasbord of literature approximations too
varied to enumerate here. In linear regression the confidence
W  ¼ ð  Þ rank sum : 1 þ 2 þ 3 þ 4 þ 6 þ 8 þ 11 þ 15
interval on the regression line is much narrower than the
¼ 50 standard error on the individual points used to construct it
(assuming the number of observations is well in excess of
the number of parameters); accordingly, the confidence
W þ ¼ ð þ Þ rank sum : 5 þ 7 þ 9 þ 10 þ 12 þ 13 þ 14
interval on the ROC curve must be narrower than the
þ 16 þ 17 interval estimates on the individual CPF, CNF pairs. One
can simply plot the points on the ROC curve with their
¼ 103: associated joint confidence intervals (as we have done in an
The Mann – Whitney U statistic is calculated from the example in Section 6.1 below) and take this as a bounding
group with the smallest rank-sum, in this case the negative condition. Interval expressions do exist for ROC curves
group: generated from the binormal model, with some further
complicating assumptions, but we refer the reader to more
n Iðn þ 1Þ complete discussions of this approach in Ref. [5].
U ¼ nþ In þ  W
2 Beyond this, the only generally applicable course of
action is to empirically determine the confidence intervals on
8Ið8 þ 1Þ the ROC curve by bootstrap [20] resampling, although it is
¼ 9I8 þ  50 ¼ 58 ð16Þ
2 not even transparent what should be bounded by the
where n x is the number of positive or negative events. What bootstrap samples, since the ROC lacks a general parametric
is the U statistic telling us? Had all y values been perfectly form. Some researchers have merely presented bounds within
ordered in correspondence with negative and positive which 1  a% of the bootstrapped ROC curves fell. Due to
events, the negative rank-sum would have been 36, and U described difficulties in ROC curve fitting, however, it is
would have been 72. This would also, as discussed above, quite rare to see confidence intervals presented for a fitted
correspond to a perfect ROC curve with an AUC of 1. In ROC curve for anything but a binormal scenario. It is much
fact, if the Mann – Whitney U is normalized to the maximum more typical to see the uncertainty in the ROC curve cast in
U that can be observed under Fperfect ranking_, it is exactly terms of the uncertainty in the AUC, which we discuss next.
the AUC as first noted by Bamber [19]:
3.3.2. Confidence intervals on the AUC
U The correspondence between the AUC and the Mann –
AUC ¼ þ  : ð17Þ
n n Whitney U statistic is a tremendous advantage for con-
Alternatively, if the data conforms to the binormal model fidence interval estimation. Hanley and McNeil [18]
of Eq. (14), the AUC is estimable directly as suggested that the standard error of the AUC can be
0 1 conservatively approximated as
l D  l̂D̄ C

B l̂ AUCð1  AUCÞþðnþ  1Þ q1  AUC2 þ ðn  1Þ q2  AUC2


AUC ¼ U@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffiA ð18Þ r2AUC ;
r 2D þ r̂
r̂ rD̄2 nþ n
ð19Þ
This method of approximating the AUC is also included in where
the aforementioned MATLAB program in the appendix, and
the AUC’s for the binormal estimated ROC’s in Fig. 5 are q1 ¼ AUC=ð2  AUCÞ
inset in the figure.
Simple trapezoidal integration under the CPF, CNF pairs is q2 ¼ 2AUC2 =ð1 þ AUCÞ: ð20Þ
also an option. The trapezoid rule uniformly underestimates Therefore, under the normal assumption, the 1  a con-
the true definite integral, but with reasonably small segment fidence interval for the AUC is
widths it is often quite sufficient, and it is certainly very easy qffiffiffiffiffiffiffiffiffiffi
to implement. As we proceed to confidence interval AUCL ; AUCU ¼ AUCFza=2 r2AUC ð21Þ
estimation, however, it will be evident that the rank-sum
approach offers some convenient advantages. where z is the typical standard normal deviate, and r AUC2 is
determined from Eq. (19).
3.3. Confidence intervals To reiterate, the Hanley/McNeil method is conservative,
providing confidence intervals that are excessively wide. If
3.3.1. Confidence intervals on a fitted ROC curve the binormal model has been used to estimate the AUC,
As one might have guessed from the expressed diffi- expressions for its standard error are again available [5].
culties of ROC curve fitting, confidence intervals on such Bootstrap resampling remains an option for any of these
curves are even less definable. In general the practitioner is quantities.
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 31

4. Predictive measures: positive and negative predictive advantage, and detriment. To illustrate the advantage,
values; likelihood ratios assume a decision process yields a CPF = CNF = 0.75, a
characteristic which does not depend on the proportion of
4.1. Positive and negative predictive values positive/negative events. The PPV and NPV at various
different proportions of positive events are
To this point the discussion has focused on methods
derived from correct positive fractions, and correct negative p+ 0.05 0.20 0.35 0.50 0.65 0.80 0.95
fractions, which are estimators of probabilities of decisions PPV 0.136 0.429 0.618 0.750 0.848 0.923 0.983
conditioned on events: if a positive event has occurred, NPV 0.983 0.923 0.848 0.750 0.618 0.429 0.136
what is the probability that I will make the correct (positive)
decision? If a negative event has occurred, what is the
probability that I will make the correct (negative) decision? Even if the CPF of a decision process is 0.75, if positive
Although these probabilities are informative, they are often events occur relatively rarely (1 in 20, p + = 0.05) then a
not the probabilities that are truly of interest to the positive decision is really only correct 13.6% of the time.
investigator. In practice, one only has the data, and not Interpreted differently, a positive decision means there is a
the truth, and the more relevant questions are: if I’ve made a 13.6% chance that a positive event has really occurred
positive decision, what is the probability that the event is (assuming p + = 0.05). A negative decision, however,
truly positive? If I’ve made a negative decision, what is the implies that a negative event is extremely likely
probability that the event is truly negative? In short, they are (NPV = 98.3%). The PPV and NPV, therefore, reflect the
questions of trepidation—I’m about to make a certain intrinsic power of a positive and negative decision—a
decision, how likely am I to be right/wrong? Probabilities of negative decision, in this case, is extremely emphatic,
this form are often termed positive and negative predictive while a positive decision should obviously be regarded
values in the literature, and are contrasted to CPF and CNF with some skepticism. Consequently, the PPV and NPV are
below: often much more useful to the end-user of a decision
The derivation of the expressions for PPV and NPV are process than CPF and CNF. For instance, in medical
screening/diagnosis the PPV and NPV are the critical
PrðDjRÞ Prð RjDÞ parameters to the clinician.
CPF : PrðD ¼ þ jR ¼ þ Þ PPV : PrðR ¼ þ jD ¼ þ Þ To the detriment of the PPV and NPV, like the
CNF : Prð D ¼  jR ¼  Þ NPV : Prð R ¼  jD ¼  Þ classification rate, the dependence of the PPV and NPV on
positive/negative event rates makes them rather meaningless
study-specific values unless the event rates p +, p  are very
similar to those expected in real-world deployment of the
straightforward applications of Bayes theorem to the CPF decision process. As noted above, focus-studies, for
and CNF using the prior probability of positive and negative example, are often designed such that a sufficient number
events ( p +, p ), which results in the formulae of both positive and negative events can be examined in a
pþ CPF manageably small number of trials, so the event rates tend to
PPV ¼ ð22Þ be balanced near 0.5; Mother Nature is rarely so even-
pþ CPF þ p ð1  CNFÞ
handed. However, exploiting the fact that CPF and CNF are
p CNF independent of event-rates, Eqs. (22) and (23) can still be
NPV ¼ : ð23Þ used to estimate the actual PPV and NPV in various general
p CNF þ pþ ð1  CPFÞ
event-rate situations. All that is required is substitution of
Rearrangement provides alternate expressions in terms of p +/p  with the event-rates of interest, for example,
the quantities in the contingency Table 1: literature, expert or population values. This procedure is
CP sometimes used in case-control clinical studies where the
PPV ¼ ð24Þ study disease rate is greatly elevated relative to the
CP þ IP
prevalence in the population at large. The CPF and CNF
CN as estimated from the case-control study can be used with
NPV ¼ : ð25Þ population statistics to estimate the PPV and NPV of the
CN þ IN
procedure if it were to be broadly deployed.
Naturally the PPV and NPV also provide the probabilities Therefore, the utility of the PPV and NPV depends
of erroneous decisions, since 1  PPV is the probability that heavily on the study and application context. When the prior
a positive decision will be incorrect, and 1  NPV is the probabilities of positive and negative events are known to be
probability that a negative decision will be incorrect. accurate, the PPV and NPV are much more relevant for
It is critical for the reader to recognize that the PPV and prediction than the CPF and CNF. If the event rates are
NPV depend on the proportion of positive and negative unknown, or cannot be defined—for example, it is difficult
events, while the CPF and CNF do not, which is both their to estimate the prior probability of a terrorist strike, or the
32 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

event rates in a previously unstudied population—they have means that, given a positive decision, the true event is 12.00
very limited utility. times more likely to be positive than negative. If the prior
Although not an ROC curve in the literature vernacular, odds of a positive event were 0.25 ( p + = 0.2), the posterior
PPV can be plotted against (1 - NPV) at various decision odds of a positive event given a positive decision are 3,
thresholds, which, provided the event-rates are appropriate, which is a posterior probability of 0.75 (3/1 + 3). Note that if
can be a more informative prediction view of the decision instead the prior odds of a positive event were 2 ( p + = 0.667),
space than the ROC curve alone can provide. the posterior odds given a positive decision would now be 24
with the same positive likelihood ratio of 12.00.
4.2. Likelihood ratios (Bayes factors) As an interesting aside, if the distributional data follow a
binormal model, the LR+ can be interpreted as the slope of the
Over the past two decades, likelihood ratios (synon- line tangential to the ROC curve at a given CPF –CNF pair.
ymous with Bayes factors) have gained in popularity in the
medical literature for describing the value of a clinical 4.3. Confidence intervals
decision process [21,22]. A likelihood ratio is simply the
ratio of probabilities of a specific occurrence under different PPV and NPV are proportions, and therefore their
hypotheses. For a dichotomous decision process the alter- confidence intervals can be determined exactly from the
native hypotheses are that a positive event has truly binomial distribution, or approximately using Eqs. (10) or
occurred, or a negative event has truly occurred. The (11). Likelihood ratios, being the ratio of two proportions,
likelihood ratio for a positive decision is therefore are not so simple. The most common approach is to work
with the logarithm of the likelihood ratio, and assume
Prð D ¼ þ jR ¼ þ Þ CPF
LRþ ¼ ¼ ð26Þ asymptotic normality [23]. The procedure is as follows. First
Prð D ¼ þ jR ¼  Þ 1  CNF estimate the variance of the log-likelihood ratios as
and similarly, the likelihood ratio for a negative decision is
1  CPF CNF
r2lnðLRþ Þ ¼ þ ð31Þ
Prð D ¼  jR ¼ þ Þ 1  CPF CP IP
LR ¼ ¼ : ð27Þ
Prð D ¼  jR ¼  Þ CNF
1  CNF CPF
The scale for likelihood ratios is (0,V), and a LR = 1 r2lnðLR Þ ¼ þ : ð32Þ
CN IN
indicates that the decision is powerless. A perfect decision
process would have CPF = 1 and CNF = 1, and hence (Quantities in the denominators are again those from
LR+ = V and LR = 0. Table 1). The confidence interval for the likelihood ratio is
The increasing attention paid to likelihood ratios is then given by
mostly due to their simple interpretation: if a decision is   qffiffiffiffiffiffiffiffiffiffiffiffi  qffiffiffiffiffiffiffiffiffiffiffiffi
positive, a positive event is LR+ times more likely than a LRIexp  za=2 r2lnðLRÞ LRIexp þ za=2 r2lnðLRÞ :
negative event. Therefore, LR’s of 1 indicate uninformative
decision processes (CPF = 1  CNF, which is always a point ð33Þ
on the line of uselessness in the ROC plot). Mathematically,
LR’s apply to odds (O), rather than probabilities (Pr), but (The same expression holds for both LR+ and LR, with
the two are easily inter-converted as the appropriate replacement of terms).
Pr O
O¼ Pr ¼ : ð28Þ
1  Pr 1þO
5. Other considerations
The pre-decision odds of a positive event is
pþ 5.1. Cost functions

pre ¼ : ð29Þ
1  pþ
In the introduction we discussed the notion that different
After the data, y, arrives a positive or negative decision decision errors, i.e., incorrect negatives, incorrect positives,
will be made according to the decision rule. If this decision could result in very different losses. The ROC curve and
rule has characteristics LR+, the post-decision odds of a related measures do not however quantify the losses; they
positive event given a positive decision is merely characterize the probability of the various potential
errors of the decision process. In many instances the loss
Oþ þ þ
post ¼ LR IOpre : ð30Þ
incurred can be reduced to monetary terms, and the most
Another advantage of likelihood ratios is that they do not instructive characterization of the available trade-offs in a
depend on the event-rates, and yet they are very similar to the decision operating space is achieved by considering
PPV and NPV in terms of providing evidence for prediction. monetary characteristics of decision processes. Such cost
As an example, consider a test with CPF = 0.60 and functions allow one to assess the decision space relative to
CNF = 0.95. The positive likelihood ratio is 12.00, which expected cost—the average expense per trial—or expected
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 33

margin—the expected loss or savings per trial when a decision process. This is certainly true for some circum-
decision process is used, relative to the expected cost if no stances, but not all. For example, few automobile owners
decision process is used. would invest in a highly accurate oil-dipstick that costs $500
The expected cost function depends on the rate of per use, as any purported improvement in accuracy would
positive/negative events ( p +) and as many as 5 cost terms: likely not justify the cost. Additionally, our presentation
here suggests that the costs of certain outcomes are fixed
N cost of making the decision regardless of CPF, and CNF. In practice this may not be the
N CP cost of a correct positive decision case, as overhead and infrastructural costs may increase or
N CN cost of a correct negative decision decrease depending on the number of positive/negative
N INF cost of an incorrect negative decision decisions being made.
N IP cost of an incorrect positive decision An example of cost function analysis is presented in Fig.
6, and the ROC curve for the candidate decision process is
Normally these costs expressed on a per trial basis. The indicated by the black squares in Fig. 6A. Assume the
expected cost of a decision process operating with character- question regards the implementation of a rapid spectro-
istics CPF and CNF is scopic analysis method that could be run on each lot of
manufactured material. It has been estimated that the
E ðCostÞdecision ¼ N þ ½NCP CPF Ipþ þ NCN CNF Ip

amortized cost of the QC method per test will be $10 (N).


þ ½NIN ð1  CPFÞI pþ þ NIP ð1  CNFÞIp
If the method correctly identifies faulty batches, the added
cost in downtime is $15 (N CP), while correct negatives (no
ð34Þ
fault) incur no added cost (N CN = 0). If batches are
The square brackets above simply collect cost terms that incorrectly deemed faulty, the incremental cost of identify-
result from correct versus incorrect decisions. If the decision ing the false-positive is $50 (N IP). Critically, if faulty
process was not employed at all (N = 0) and all events were batches are deemed normal, the flaw is often not caught
assumed negative (CNF = 1, CPF = 0), the expected cost until the customer has received shipment, and the cost in
would simply be this situation is estimated to be $250 (N IN). As it is a rather
E ðCostÞno decision ¼ NIN I pþ þ NCN I p : ð35Þ finicky manufacturing process, the expected abundance of
faulty batches is quite high: about 0.2 ( p +). The question is,
The expected margin is is it worth implementing this spectroscopic method? If so, at
  which decision characteristics?
E ðCostÞno decision
M¼  1 I100% ð36Þ In Fig. 6B we have plotted the classification rate as a
EðCostÞdecision
function of CPF, which suggests the maximum classification
where positive margins indicate that the decision is rate is achieved around CPF = 0.43, although one could
providing a cost-benefit, and negative margins indicate that apparently operate anywhere between CPF = 0 and CPF =
the decision is actually losing money. 0.7 and achieve roughly a CR of 0.8 owing to the relative
Oftentimes in cost function analysis the terms associated infrequent occurrence of faulty batches. Fig. 6C presents
with correct decisions are ignored (the cost is assumed to be quite a different view of the decision space, where we have
zero), which in effect presumes that the cost of the decision plotted expected margin (%) versus the same CPF axis for
(N) and the costs of the correct decisions (N CP, N CN) have direct comparison to Fig. 6B. The same information could
no role in defining the operating characteristics of the be discerned from the isomargin contours we have overlaid

1 1
%
60
0.9
50
% A 0.8
% %
0.8 40 20
CR

% 0.6
10
0.7
% 0%
30 0.4
0.6 -1
0% B optimum CR: 80.9%
CPF

%
- 20 0.2
0.5
0%
-3
margin (%)

0.4
-4
0% 40 C optimum margin: 37.7%
0%
0.3 -5
% 20
- 60
0.2 gain
optimum CR 0%
-7 0
optimum cost % loss
0.1 0
-8
-20
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-CNF CPF
Fig. 6. Cost analysis of candidate decision processes. A. The ROC curve for the method, overlaid on iso-margin contours. B. The classification rate of the
method versus CPF, showing a maximum at CPF = 0.44, and C. The margin versus CPF, suggesting an optimum at CPF = 0.89.
34 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

on the ROC curve, but this is a somewhat simpler way to Assume that a number of trials have been run, and a
view it. Fig. 6C indicates that it is possible to achieve an contingency table has been generated for the decision versus
estimated 37.7% cost savings by using the spectroscopic the (imperfect) reference condition at a particular decision
method, and that the decision process should operate at a threshold t. The observed CNF, CPF and p + can be
CPF of approximately 0.9 (CNF of about 0.68) to achieve calculated directly from the contingency table (quantities
optimal savings. If one had chosen the maximum CR as the which we denote by the subscript obs). If the reference
threshold at which to operate the decision process, it would method has known CPFref and CNFref characteristics, the
operate at a margin of only 11%. Worse still, it might be following chain of formulae afford unbiased estimates of the
(incorrectly) inferred that since the optimum CR = 0.8, and true decision CPFtrue and CNFtrue, based on the observed
the proportion of faulty batches is about 0.2, the decision CPFobs and CNFobs, and p +obs. First, the true fraction of
process is barely improving on chance detection, and positive events must be estimated:
therefore can provide no cost-benefit whatsoever.

obs  1 þ CNFref
Cost curves, as we have portrayed them in the example pp̂ þ ¼ ð37Þ
above, are most relevant in large-scale applications of CPFref  1 þ CNFref
decision processes where the critical objective is to reduce With this information, determine the following four
cost. They are, however, equally relevant many other quantities A –D:
applications in which the various outcomes are viewed to
have very different consequences. For example, comparative pp̂ þ CPFref

risk values can be substituted for the monetary cost terms in p̂p þ CPFref
þ ð1  p̂p þ Þð1  CNFref Þ
Eq. (34) to stress, for example, the criticality of positive event ð1  p̂p þ Þð1  CNFref Þ
B¼ þ
detection for chemical/biological weapons or protein markers p̂p CPFref þ ð1  p̂p þ Þð1  CNFref Þ
for immediately life-threatening disease. Conversely, com-
parative risk can be chosen to reflect the severe repercussions pp̂ þ ð1  CPFref Þ
of incorrect positive results in forensic applications. C¼
p̂p þ ð1
 CPFref Þ þ ð1  p̂p þ ÞCNFref
5.2. Imperfect references ð1  p̂p þ ÞCNFref
D¼ þ : ð38Þ
p̂p ð1  CPFref Þ þ ð1  p̂p þ ÞCNFref
It is sometimes difficult to perfectly define whether an Finally, using A – D and the CPFobs and CNFobs, estimates
event has or has not actually occurred. This situation often of the true performance of the decision process can be
presents itself because of cost limitations, time constraints, or determined:
other practical limitations on experimentation. For example,
in medical decision making, new screening/diagnostic CPFobs  B þ B I CNFobs =D  B I C=D
procedures are often compared to the existing Fgold-stand- CP̂Ftrue ¼ ð39Þ
A  B I C=D
ard_ methodology, which is often an imperfect Fsilver-
standard_ rather than gold. For example, the current standard
CNFobs  C I ð1  CPFref Þ
for the diagnosis of diabetes is the oral glucose tolerance test, CN̂Ftrue ¼ : ð40Þ
which has an estimated ROC AUC of only 0.82. The perfect D
reference would be longitudinal outcome (e.g., development The most common problem encountered with this
of diabetic retinopathy, neuropathy, etc.) but in the early approach is that the estimated p +, CPFtrue, CNFtrue can be
phases of evaluating a new methodology longitudinal less than 0, or greater than one. This is an unfortunate
studies, which often take many years to execute, are consequence of the independent binomial sampling variance
prohibitively expensive. Therefore, it is a practical reality in the reference and decision CPF, CNF.
that some decision processes must be evaluated against a
reference that may be less than ideal. Nevertheless, the 5.3. Decision comparisons and combinations
objective of the analyst remains the same: assess the true
performance characteristics of candidate decision process. We have omitted the important matter of statistical
Fortunately, methods exist for such objective determinations procedures for comparing the attributes of two candidate
even if the reference is imperfect. If the characteristics (CPF, decision processes: comparisons of proportions, both paired
CNF) of the reference method are known, and the reference and unpaired (e.g., CPF, CNF, PPV, NPV), comparisons of
and candidate decision processes are independent, a solution AUC’s, and comparisons of LR’s. There are also many
is tractable and exact [24 –26]. We give a synopsis of this situations in which independent decision processes can be
solution below. Beyond this scenario, latent class analysis combined to yield performance that exceeds that of the
methods have found increasing favor [27,28]. Discrepant individual processes (commonly termed ensemble prediction
resolution is also commonly applied, but this approach or aggregation). The complexities and nuances of these
produces biased estimates [29] of CPF and CNF and is comparisons are treated in Pepe [5] and Zhou et al. [32] with
increasingly recommended against [30,31]. far more erudition than we could aspire to in this introduction.
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 35

6. Examples designation of the threshold for the decision process does


not end here, however. For the decision process to prove
6.1. Classification advantageous in large-scale testing for diabetes, a cost
function analysis (see Section 5.1) is necessary to determine
A recent study assessed the performance of a prototype the operating point with optimum cost-benefit, and as with
NIR spectroscopic classifier for indications of type-2 all candidate medical devices, it must prove efficacious in a
diabetes in a 145 subject case-control study, which is realistic clinical setting.
described in detail elsewhere [33]. The system had been
calibrated in a separate study some months previous, so the 6.2. Limit of detection
objective of the case-control study was to validate the
performance of the device. In typical case-control fashion, The limit of detection is an analytical figure of merit that
approximately equal numbers of diabetic and non-diabetic has also historically been tethered to the assumptions of
subjects participated (69 and 76, respectively), and repeat asymptotic normality. It is commonly defined as the ana-
spectroscopic measurements were acquired for each subject. lytical concentration which is three times greater than the
The output of the multivariate classifier was designed to be standard deviation of concentration readings from a blank:
a quantitative indicator of the probability of type-2 diabetes.
LOD ¼ 3rblank : ð41Þ
The histograms of the classifier output for the diabetic
(+) and non-diabetic () subject measurements are shown in Three standard deviations on the standard normal variate
the left of Fig. 7. The ROC curve shown in the right of Fig. corresponds to a false-positive probability of approximately
7 was determined from the subject-average measurements, 0.0013. More appropriately, analytical detection decisions
and we have shown Wilson 95% joint coverage intervals for should not only relate to false-positives, but also incorrect
the individual CPF, CNF pairs using the method discussed negatives—failures to detect the substance when it is truly
in Section 2.2. (The ROC curve for the individual measure- present—which has been discussed by several researchers in
ments is virtually identical, but has artificially narrow recent publications [34 – 36]. The ROC curve and the AUC is
intervals because of the repeat measurements.) Again, we quite an attractive non-parametric option for describing the
remind the reader that these coverage intervals cannot be detection characteristics of an analytical system when para-
construed as representing the confidence interval for the metric assumptions such as normality and homogeneity of
ROC curve, as it will be considerably narrower. The AUC variance are not likely applicable. An ROC-type approach
and the confidence interval for the AUC (inset in the ROC was very recently employed by Christesen for characterizing
figure) were calculated using the Mann – Whitney U biological detection schemes [37], and a recent DARPA
statistic, again on the subject-average measurements. report [38] strongly advocates for ROC curve analysis of
The classification rate, if calculated from this study, chemical devices for chemical/biological agent detection.
would suggest an optimum of 0.77, at a CPF = 0.87 and Decision theoretic alternatives to the ROC for limit-of-
CNF = 0.68. However the proportion of positive cases (0.48) detection applications have also recently been discussed [39].
is significantly elevated in this study relative to the A combined ROC and AUC method for characterizing
population at large (¨0.1), so the stated classification rate detection decisions using non-parametric means is illus-
of 0.77 is inappropriate. If the population proportions are trated in Fig. 8 using simulated data. Behavior like that
substituted for the study proportions in Eq. (5), using shown in Fig. 8 is common for enzymatic assays, and often
CPF = 0.87, CNF = 0.68 a realistic estimate of the classi- does not meet the requirements of either normality or
fication rate in the population at large is 0.67. The homogeneity of variance over the concentration range. For

30 1
25 DM(+)
number of observations

20
15
10
5
CPF

0 0.5

30 DM(-)

20

10 AUC: 0.83 [0.72 0.94]95%


0 0
40 60 80 100 120 140 0 0.5 1
classifier output 1-CNF
Fig. 7. Illustration of ROC use for assessment of a NIR spectroscopy-based classifier for indications of diabetes mellitus (DM). Bars on the ROC points reflect
95% coverage for each CPF, CNF pair using Wilson intervals for the proportions.
36 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

this simulated protocol, 10 samples were prepared at each of 1


the following 6 discrete concentration levels: 0, 2, 4, 6, 8, 10
> 95% superiority
mM. The reference is therefore defined according to 0.8
whether the analyte is present at non-zero concentration:


AUC
0.6
 if c ¼ 0
R¼ ð42Þ
þ if c > 0 0.4
U procedure
and the analytically reported concentration ( y) at each
0.2 binormal model
threshold t was converted to a positive/negative decision in
standard fashion as
 0
0 2 4 6 8 10
 if y < t
D¼ ð43Þ µg/dL)
reference concentration (µ
þ if y  t
Each of the ROC traces (which we have fit using the
Fig. 9. AUC determined using the Mann-Whitney U procedure ( ) and its
95% confidence intervals, when using the ordinate concentration as a
˝
binormal model, with a logarithm transformation of y to threshold of detection. The estimated AUC’s using a binormal assumption
account for the non-normality in the data) in Fig. 8 allows are also shown (>).
examination of the positive and negative detection charac-
teristics at a fixed reference concentration level, the object Although the point estimate for the 6 Ag/dL reference
being to determine the reference concentration level at concentration exceeds 95%, based on the interval estimates
which the new analytical method can reliably detect non- there roughly 50% chance that the true AUC may be lower
zero concentrations with reasonable power (CPF) and false- that 95%. The 8 Ag/dL level is a safer bet. This procedure
positives (1  CNF). Manual inspection of these traces isn’t likely to be considered for standard adoption, but it is
suggests that if Freasonable power_ is determined to be simply another illustration of how these tools can be
80% (CPF = 0.8), that 6 mg/dL is likely the minimum employed to inform practical chemical decisions.
concentration that can be detected without excessive false-
positives (5%, 1  CNF = 0.05). The ROC points are quite
choppy for such small sample sizes, however, so alter- 7. Summary
natively one can examine the AUC and its estimated
confidence intervals, as we show in Fig. 9. Receiver operator characteristic curves, and related decision
As an alternative to defining LOD in terms of type I and theory measures are rapidly becoming the standard approach
II errors, one could strive for a Fsuperiority specification_, for describing the merits of proposed dichotomous decision
for which the AUC is very well suited. For instance in Fig. 9 processes in many scientific fields; journals dealing with
we have drawn a horizontal line at an AUC of 0.95, which, medical and clinical screening and diagnosis, for example,
by virtue of its correspondence to the U test, implies that now strongly recommend ROC curve analysis where appro-
reference concentration levels with AUC’s above this line priate in their instructions to authors. As one reviewer noted, the
ensure greater than 95% stochastic dominance—the method fact that ROC curves have not been used in chemometrics may
reading at a true reference concentration of FX_ will exceed be simply due to the absence of ROC analysis options in
the reading from a blank greater than 95% of the time. commercial chemometric software. The simplicity of the

A. B.
12 1
reported concentration (µg/dL)

10
0.8
8

0.6
6
CPF

4
0.4

2 2 µg/dL
4 µg/dL
0.2
0 6 µg/dL

-2 0
0 2 4 6 8 10 0 0.2 0.4 0.6 0.8 1
reference concentration (µg/dL) 1-CNF

Fig. 8. Data from a hypothetical LOD experiment to evaluate operating characteristics of various candidate detection concentration thresholds. A. Predicted
versus reference concentrations for the LOD experiment. B. ROC curves for detection of various concentrations in the experiment.
C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38 37

analyses means that it can be done in Excel (or even by hand) be converted into expected cost curves, which describe the
with very little effort, so optimistically this tutorial will inspire total expected cost of operating at the various possible
some readers to consider decision theoretic measures even if decision thresholds and minimize at the most cost-effective
packaged software is lacking. There are also many free or low- decision operating point. We provided several examples to
cost programs available on the internet for ROC analysis. illustrate the use of these measures, although there are
The value of the ROC curve is primarily its independence certainly many more potential chemometric applications;
from the proportion of positive/negative events (which is not characterizing the efficiency of outlier detection methods,
the case for the classification rate), and a transparent and model selection are ones that come immediately to mind.
graphical summary of the entire space of possible operating Scientific intuition is usually relied on to bridge the gap
characteristics of a decision process in terms of the correct between data and decision, and will always be indispensable
positive and negative fractions. The area under the ROC for doing so, but hopefully this tutorial provides a practical
curve also makes for an extremely convenient non-para- introduction of some simple decision theoretic tools that can
metric summary measure independent of a decision thresh- provide objective perspective for the data-to-decision process.
old. In predictive applications where the prior probabilities of
positive/negative events can be accurately specified, users
may find more value in measures such as the positive and Acknowledgements
negative predictive values or likelihood ratios, which are
better suited to characterizing the probability of events given The authors would like to thank Veralight, Inc. (Albu-
a specific decision. Correct and incorrect decisions often querque, NM) for consenting to publication of the non-
come with associated costs or losses, and the ROC can easily invasive NIR diabetes classification data.

Appendix A. Matlab code for binormal receiver operator characteristic curve and AUC estimation

function [FPF,CPF,AUC] = binormROC(y,class);

% Function to compute binormal receiver operating characteristic curves.


% Requires the Matlab Statistics toolbox for the normal CDF and inverse
% CDF.
%
% Usage: [FPF,CPF,AUC] = binormROC(y,class);
%
% Inputs:
% y - nx1 vector of responses
% class - nx1 membership vector of 0/1 values:
% negatives taken to be 0's
% positives taken to be 1's
%
% Outputs:
% FPF - vector of fitted false positive fraction values (1-CNF)
% CPF - vector of fitted correct positive fraction values
% AUC - estimated area under the ROC curve

k0 = find(class == 0); % extract negative events


k1 = find(class == 1); % extract positive events

mu0 = mean(y(k0)); % Calculate moments of (-)


s0 = std(y(k0));

mu1 = mean(y(k1)); % Calculate moments of (+)


s1 = std(y(k1));

a = (mu1-mu0)/s1; % Parameters for binormal curve


b = s0/s1; % as per equation 15

FPF = linspace(0.01,0.99); % ordinate axis for the fit


tmp = norminv(FPF,0,1);
CPF = normcdf(a+b*tmp); % estimated CPFs from the cumulative
% normal density

AUC = normcdf(a/sqrt(1+b*b)); % binormal AUC estimate as per


% equation 18
38 C.D. Brown, H.T. Davis / Chemometrics and Intelligent Laboratory Systems 80 (2006) 24 – 38

References [22] B. Dujardin, J. van den Ende, A. van Gompel, J.P. Unger, P. van der
Stuyft, European Journal of Epidemiology 10 (1994) 29 – 36.
[23] D.L. Simel, G.P. Samsa, D.B. Matchar, Journal of Clinical Epidemi-
[1] J.A. Swets, Science 240 (1988) 1285 – 1293. ology 44 (1991) 763 – 770.
[2] R.M. Centor, Medical Decision Making 11 (1991) 102 – 106. [24] S. Baker, Communications in Statistics. Theory and Methods 20
[3] N.A. Obuchowski, Radiology 229 (2003) 3 – 8. (1991) 2739 – 2752.
[4] D.M. Green, J.A. Swets, Signal Detection Theory and Psychophysics, [25] P.N. Valenstein, American Journal of Clinical Pathology 93 (1990)
John Wiley & Sons, New York, NY, 1974. 252 – 258.
[5] M.S. Pepe, The Statistical Evaluation of Medical Tests for Classi- [26] J. Gart, A. Buck, American Journal of Epidemiology 83 (1966)
fication and Prediction, Oxford University Press, New York, NY, 593 – 602.
2003. [27] S.L. Hui, X.H. Zhou, Statistical Methods in Medical Research 7
[6] W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Numerical (1998) 354 – 370.
Recipes in C: The Art of Scientific Computing, 2nd edR, Cambridge [28] J.S. Uebersax, W.M. Grove, Statistics in Medicine 9 (1990) 559 – 572.
University Press, New York, NY, 1992. [29] H.B. Lipman, J.R. Astles, Clinical Chemistry 44 (1998) 108 – 115.
[7] G.W. Snedecor, W.G. Cochran, Statistical Methods, 8th edR, Iowa [30] A. Hagdu, Lancet 348 (1996) 592 – 593.
State University Press, Ames, Iowa, 1989. [31] W.C. Miller, Journal of Clinical Epidemiology 51 (1998) 219 – 231.
[8] E.B. Wilson, Journal of the American Statistical Association 22 (1927) [32] X.-H. Zhou, D.K. McClish, N.A. Obuchowski, Statistical Methods in
209 – 212. Diagnostic Medicine, J. Wiley & Sons, New York, NY, 2002.
[9] L.L. Brown, T.T. Cai, A. DasGupta, Statistical Science 16 (2001) [33] C.D. Brown, H.T. Davis, M.N. Ediger, C.M. Fleming, E.L. Hull,
101 – 133. M. Rohrscheib, Diabetes Technology and Therapeutics 7 (2005)
[10] R.A. Hilgers, Methods of Information in Medicine 30 (1991) 96 – 101. 456 – 466.
[11] W.P. Tanner, J.A. Swets, Psychological Review 61 (1954) 401 – 409. [34] L.A. Currie, Chemometrics and Intelligent Laboratory Systems 37
[12] J.P. Egan, Signal Detection Theory and ROC Analysis, Academic (1997) 151 – 181.
Press, New York, 1975. [35] R. Boqué, F.X. Rius, Chemometrics and Intelligent Laboratory
[13] J.A. Swets, R.M. Pickett, Evaluation of diagnostic systems: methods Systems 32 (1996) 11 – 23.
from signal detection theory, Academic Press, New York, NY, 1982. [36] H. van der Voet, Encyclopedia of Environmetrics, vol. 1, Wiley,
[14] A.R. Henderson, Annals of Clinical Biochemistry 30 (1993) 521 – 539. Chichester, 2002, pp. 504 – 515.
[15] K. Linnet, Clinical Chemistry 34 (1988) 1379 – 1386. [37] S. Christesen, K. Spencer, J. Sylvia, K. Gonser, ‘‘SERS of chemical
[16] M.H. Zweig, G. Campbell, Clinical Chemistry 39 (1993) 561 – 577. agents in water — determining limits of detection,’’ presentation at
[17] F. Provost, T. Fawcett, R. Kohavi, Proceedings of the Fifteenth Federation of Analytical Chemistry and Spectroscopy Societies
International Conference on Machine Learning 1998, San Francisco, (FACSS) Conference; Portland, OR, 2004.
CA, Morgan Kaufmann, San Mateo, CA, 1998, pp. 445 – 453. [38] http://www.darpa.mil/mto/people/pms/pdfs/CBS3FinalReport.pdf.
[18] J.A. Hanley, B.J. McNeil, Radiology 143 (1982) 29 – 36. [39] H.T. Davis, E. Merrill, A proposed method to estimate receiver
[19] D. Bamber, Journal of Mathematical Psychology 12 (1975) 387 – 415. operating characteristic curves for chemical and biological standards,
[20] B. Efon, R.J. Tibshirani, An Introduction to the Bootstrap, Chapman Proceeding of the SPIE Defense & Security Symposium, March 28 –
& Hall/CRC Press, New York, NY, 1993. April 1, 2005, Orlando, Florida, 2005.
[21] E.J. Boyko, Medical Decision Making 14 (1994) 175 – 179.

You might also like