Classification Evaluation
Classification Evaluation
2
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
• Classifier Accuracy, or
A\P C ¬C recognition rate: percentage
C TP FN P of test set tuples that are
¬C FP TN N correctly classified
P’ N’ All
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
4
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Classifier Accuracy, or
recognition rate: percentage Accuracy
of test set tuples that are (6954+2588)/10000
correctly classified =9542/10000=0.9542
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or Error rate
Error rate = (FP + FN)/All 1-0.9542=0.0458
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
Class Imbalance Problem:
C 0 10 10
One class may be rare, e.g.
¬C 0 999990 999990
0 1000000 1000000
fraud, or HIV-positive
Significant majority of the
recognition rate
Specificity = TN/N
6
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
8
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
• Recall: completeness
Recall Sensitivity: True
Positive recognition
6954/7000 rate
=0.9934
Precision/Recall
• Inverse relationship between precision & recall
• A system that tag all tuple as positive has 100% recall!
12
The simple (arithmetic) mean is 50% for “return-everything”
F1 and other averages
search engine, which is too high.
Precision Recall Average F1
0.01 0.99 0.5 0.02
0.1 0.9 0.5 0.18
0.2 0.8 0.5 0.32
0.5 0.5 0.5 0.5
Combined Measures
Desideratum: Punish really bad performance
100
on either precision or recall.
80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic– Taking the minimum achieves this.
20
– But minimum is not smooth and
0 hard to weight.
0 20 40 60 80 100 – F (harmonic mean) is a kind of
Precision (Recall fixed at 70%)
smooth minimum. 13
ROC Curve
• The receiver operating characteristic (ROC)
curve is another common tool used with
binary classifiers.
– Plots the true positive rate(TPR) against the false
positive rate(FPR)
• TPR (recall, Sensitivity)
– TP/P
• FPR (1-True Negative rate)
– FP/N or 1-TN/N
ROC Curve
• TPR (recall, Sensitivity)
– TP/P (6954/7000=0.99)
• FPR (1-True Negative rate)
– FP/N or 1-TN/N (412/3000=0.14)
ROC Curve
There is a trade-off:
The higher the recall(TPR), the more
false positives(FPR) the classifier
produces
18
Segmenting the dataset for training and
testing
• Holdout method
• Cross validation
19
Evaluating Classifier Accuracy:
Holdout
• Holdout method
– Given data is randomly partitioned into two
independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the
accuracies obtained
20
Evaluating Classifier Accuracy:
Cross-Validation Methods
21
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
22
Model Evaluation and Selection
23
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
24
Estimating Confidence Intervals:
Null Hypothesis
25
Estimating Confidence Intervals:
Table for t-distribution
• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2
26
Estimating Confidence Intervals:
Statistical Significance
27
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
28