AI351 Lecture 2 - Common Evaluation Metrics
AI351 Lecture 2 - Common Evaluation Metrics
1
Goals for the lecture
you should understand the following concepts
• test sets
• learning curves
• validation (tuning) sets
• stratified sampling
• cross validation
• internal cross validation
• confusion matrices
• TP, FP, TN, FN
• ROC curves
• confidence intervals for error
• pairwise t-tests for comparing learning systems
• scatter plots for comparing learning systems
• lesion studies
2
Goals for the lecture (continued)
3
Test sets revisited
How can we get an unbiased estimate of the accuracy of a learned model?
learned model
learning
method
4
accuracy estimate
Test sets revisited
How can we get an unbiased estimate of the accuracy of a
learned model?
5
Learning curves
How does the accuracy of a learning method change as a function of
the training-set size?
6
Figure from Perlich et al. Journal of Machine Learning Research, 2003
Learning curves
given training/test set partition
• for each sample size s on learning curve
• (optionally) repeat n times
• randomly select s instances from training set
• learn model
• evaluate model on test set to determine accuracy a
• plot (s, a) or (s, avg. accuracy and error bars)
7
Validation (tuning) sets revisited
Suppose we want unbiased estimates of accuracy during the learning
process (e.g. to choose the best level of decision-tree pruning)?
learning process
learn
select model
models
9
Random resampling
We can address the second issue by repeatedly randomly
partitioning the available data into training and set sets.
random
partitions
traini ng set test set
10
Stratified sampling
When randomly selecting training or validation sets, we may want to
ensure that class proportions are maintained in each selected set
s1 s2 s3 s4 s5
12
Cross validation example
Suppose we have 100 instances, and we want to estimate accuracy
with cross validation
13
Cross validation
• 10-fold cross validation is common, but smaller values of
n are often used when learning takes a lot of time
learning process
s1 s2 s3 s4 s5 learned model
learn
select model
models
15
Example: using internal cross
validation to select k in k-NN
given a training set
1. partition training set into n folds, s1 … sn
2. for each value of k
considered for i = 1 to n
learn k-NN model using all folds but si
evaluate accuracy on si
3. select k that resulted in best accuracy for s1 … sn
4. learn model using entire training set and selected k
the steps inside the box are run independently for each training set
(i.e. if we’re using 10-fold CV to measure the overall accuracy
of our k-NN approach, then the box would be executed 10 times)
16
Confusion matrices
How can we understand what types of mistakes a learned model makes?
actual class
predicted class 17
figure from vision.jhu.edu
Confusion matrix for 2-class problems
actual class
positive negative
TP + TN
accuracy =
TP + FP + FN + TN
18
Is accuracy an adequate measure
of predictive performance?
• accuracy may not be useful measure in cases where
• there is a large class skew
• Is 98% accuracy good if 97% of the instances are negative?
positive negative
TP TP
true positive rate (recall) = =
actual pos TP + FN
FP FP
false positive rate = =
actual neg TN + FP
20
ROC curves
A Receiver Operating Characteristic (ROC) curve plots the TP-rate vs. the
FP-rate as a threshold on the confidence of an instance being positive is
varied
21
ROC curve example
24
Plotting an ROC curve
confidence correct
instance positive class
Ex 9 .99 +
Ex 7 .98 TPR= 2/5, FPR= 0/5 + 1.0
Ex 1 .72 -
e
Ex 2 .70 +
Ex 6 .65 TPR= 4/5, FPR= 1/5 +
p t
Ex 10 .51 -
Ex 3 .39 TPR= 4/5, FPR= 3/5 -
Ex 5 .24 TPR= 5/5, FPR= 3/5 + 1.0
Ex 4 .11 - False positive rate
Ex 8 .01 TPR= 5/5, FPR= 5/5 -
25
Plotting an ROC curve
can interpolate between points to get convex hull
• convex hull: repeatedly, while possible, perform interpolations that
skip one data point and discard any point that lies below a line
• interpolated points are achievable in theory: can flip weighted coin
to choose between classifiers represented by plotted points
1.0
True positive rate
27
Other accuracy metrics
actual class
positive negative
TP TP
recall (TP rate) = =
actual pos TP + FN
TP TP
precision = =
predicted pos TP + FP 28
Precision/recall curves
A precision/recall curve plots the precision vs. recall (TP-rate) as a
threshold on the confidence of an instance being positive is varied
ideal point
1.0
default precision
precision
determined by the
fraction of instances
that are positive
29
Mammography Example: ROC
Mammography Example: PR
How do we get one ROC/PR curve
when we do cross validation?
Approach 1
• make assumption that confidence values are comparable
across folds
• pool predictions from all test sets
• plot the curve from the pooled predictions
32
Comments on ROC and PR curves
both
• allow predictive performance to be assessed at various levels of
confidence
• assume binary classification tasks
• sometimes summarized by calculating area under the curve
ROC curves
• insensitive to changes in class distribution (ROC curve does not
change if the proportion of positive and negative instances in the test
set are varied)
• can identify optimal classification thresholds for tasks with differential
misclassification costs
precision/recall curves
• show the fraction of predictions that are false positives
• well suited for tasks with lots of negative instances 33
To Avoid Cross-Validation
Pitfalls, Ask:
• 1. Is my held-aside test data really
representative of going out to collect
new data?
– Even if your methodology is fine,
someone may have collected features for
positive examples differently than for
negatives – should be randomized
– Example: samples from cancer processed
by different people or on different days
than samples for normal controls
34
To Avoid Pitfalls, Ask:
• 2. Did I repeat my entire data
processing procedure on every fold of
cross-validation, using only the
training data for that fold?
– On each fold of cross-validation, did I
ever access in any way the label of a test
case?
– Any preprocessing done over entire data
set (feature selection, parameter tuning,
threshold selection) must not use labels
35
To Avoid Pitfalls, Ask:
• 3. Have I modified my algorithm so
many times, or tried so many
approaches, on this same data set that
I (the human) am overfitting it?
– Have I continually modified my
preprocessing or learning algorithm until I
got some improvement on this data set?
– If so, I really need to get some additional
data now to at least test on
36
Confidence intervals on error
Given the observed error (accuracy) of a model over a limited
sample of data, how well does this error characterize its accuracy
over additional instances?
Suppose we have
• a learned model h
• a test set S containing n instances drawn independently of one
another and independent of h
• n ≥ 30
• h makes r errors over the n instances
r
errorS(h) =
n
37
Confidence intervals on error
38
Confidence intervals on error
How did we get this?
1. Our estimate of the error follows a binomial distribution given by n
and p (the true error rate over the data distribution)
40
Empirical Confidence Bounds
• Bootstrapping: Given n examples in
data set, randomly, uniformly,
independently (with replacement) draw
n examples – bootstrap sample
• Repeat 1000 (or 10,000) times:
– Draw bootstrap sample
– Repeat entire cross-validation process
• Lower (upper) bound is result such that
2.5% of runs yield lower (higher)
41
Comparing learning systems
42
Motivating example
43
Comparing systems using a paired t test
• consider δ’s as observed values of a set of i.i.d.
random variables
• hypothesis test:
– use paired t-test to determine probability p that
mean of δ’s would arise from null hypothesis
– if p is sufficiently small (typically < 0.05) then reject
the null hypothesis 44
Comparing systems using a paired t test
1 n
1. calculate the sample mean d = åd i
n i=1
d
2. calculate the t statistic t= n
1
å
n(n - 1) i=1
(d i - d )2
45
Comparing systems using a paired t test
figure from Freund & Mason, ICML 1999 figure from Noto & Craven, BMC Bioinformatics 2006
49
Lesion studies
We can gain insight into what contributes to a learning system’s
performance by removing (lesioning) components of it
The ROC curves here show how performance is affected when various
feature types are removed from the learning representation
50
figure from Bockhorst et al., Bioinformatics 2003