0% found this document useful (0 votes)
29 views

Classification Evaluation

Model evaluation and selection involves assessing a classifier's accuracy using test data rather than training data. Key metrics include accuracy, error rate, sensitivity, specificity, precision, recall, and the F1 measure. These are often visualized using a confusion matrix or ROC curve. The area under the ROC curve (AUC) is used to compare classifiers, with higher AUC indicating better performance. Evaluation metrics help select the best performing model.

Uploaded by

Kathy Kg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Classification Evaluation

Model evaluation and selection involves assessing a classifier's accuracy using test data rather than training data. Key metrics include accuracy, error rate, sensitivity, specificity, precision, recall, and the F1 measure. These are often visualized using a confusion matrix or ROC curve. The area under the ROC curve (AUC) is used to compare classifiers, with higher AUC indicating better performance. Evaluation metrics help select the best performing model.

Uploaded by

Kathy Kg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

MODEL EVALUATION AND SELECTION

Model Evaluation and Selection


• Use test set of class-labeled tuples instead of training set
when assessing accuracy
• Classifier Evaluation Metrics
– Accuracy
– Error Rate
– Sensitivity (True Positive recognition rate)
– Specificity (True Negative recognition rate)
– Precision
– Recall
– F measure
– ROC Curve
– …

2
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
3
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity

• Classifier Accuracy, or
A\P C ¬C recognition rate: percentage
C TP FN P of test set tuples that are
¬C FP TN N correctly classified
P’ N’ All
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All

4
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Classifier Accuracy, or
recognition rate: percentage Accuracy
of test set tuples that are (6954+2588)/10000
correctly classified =9542/10000=0.9542
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or Error rate
Error rate = (FP + FN)/All 1-0.9542=0.0458
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
 Class Imbalance Problem:
C 0 10 10
 One class may be rare, e.g.
¬C 0 999990 999990
0 1000000 1000000
fraud, or HIV-positive
 Significant majority of the

negative class and minority of


the positive class
 Sensitivity: True Positive
• Accuracy
recognition rate
(0+999990)/1000000  Sensitivity = TP/P

=99.999%  Specificity: True Negative

recognition rate
 Specificity = TN/N

6
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Sensitivity: True Positive Sensitivity


6954/7000
recognition rate
=0.9934
 Sensitivity = TP/P
 Specificity: True Negative Specificity
recognition rate 2588/3000
 Specificity = TN/N =0.8627
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?

8
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Precision: exactness Precision


6954/7366
=0.9441

• Recall: completeness
Recall Sensitivity: True
Positive recognition
6954/7000 rate
=0.9934
Precision/Recall
• Inverse relationship between precision & recall
• A system that tag all tuple as positive has 100% recall!

• In a good system, precision decreases as either the


number of docs retrieved or recall increases
– This is not a theorem, but a result with strong empirical
confirmation
10
A combined measure: F
• Combined measure that assesses
precision/recall tradeoff is F measure
(weighted harmonic mean):
2
1 (   1) PR
F 
1 1 2
 PR
  (1   )
P R

• People usually use balanced F1 measure


– i.e., with  = 1 or  = ½ 11
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity)
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%


– F1= 1
 0.34
1 1
(0.5)   (1  0.5) 
0.3913 0.3

12
The simple (arithmetic) mean is 50% for “return-everything”
F1 and other averages
search engine, which is too high.
Precision Recall Average F1
0.01 0.99 0.5 0.02
0.1 0.9 0.5 0.18
0.2 0.8 0.5 0.32
0.5 0.5 0.5 0.5

Combined Measures
Desideratum: Punish really bad performance
100
on either precision or recall.

80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic– Taking the minimum achieves this.
20
– But minimum is not smooth and
0 hard to weight.
0 20 40 60 80 100 – F (harmonic mean) is a kind of
Precision (Recall fixed at 70%)
smooth minimum. 13
ROC Curve
• The receiver operating characteristic (ROC)
curve is another common tool used with
binary classifiers.
– Plots the true positive rate(TPR) against the false
positive rate(FPR)
• TPR (recall, Sensitivity)
– TP/P
• FPR (1-True Negative rate)
– FP/N or 1-TN/N
ROC Curve
• TPR (recall, Sensitivity)
– TP/P (6954/7000=0.99)
• FPR (1-True Negative rate)
– FP/N or 1-TN/N (412/3000=0.14)
ROC Curve

There is a trade-off:
The higher the recall(TPR), the more
false positives(FPR) the classifier
produces

One way to compare classifiers is to measure the area under the


curve (AUC).
A perfect classifier will have a ROC AUC equal to 1,
where as a purely random classifier will have a ROC AUC equal to 0.5.
Actual class\Predicted class yes no Total

yes 800 200 1000


no 250 150 400
Total 1050 350 1400

• Classifier Evaluation Metrics


– Accuracy
– Error Rate
– Sensitivity
– Specificity
– Precision
– Recall
– F1 measure
17
Model Evaluation and Selection
• How to select test data?

• Methods for estimating a classifier’s accuracy:


– Holdout method, random subsampling
– Cross-validation
– Bootstrap

18
Segmenting the dataset for training and
testing
• Holdout method
• Cross validation

19
Evaluating Classifier Accuracy:
Holdout
• Holdout method
– Given data is randomly partitioned into two
independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the
accuracies obtained

20
Evaluating Classifier Accuracy:
Cross-Validation Methods

• Cross-validation (k-fold, where k = 10 is most


popular)
– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as
training set
– Leave-one-out: k folds where k = # of tuples
This approach can closely estimate the
true accuracy when the value of k is large.

21
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

22
Model Evaluation and Selection

• Suppose we have 2 classifiers, M1 and M2,


which one is better?

23
Estimating Confidence Intervals:
Classifier Models M1 vs. M2

• Use 10-fold cross-validation to obtain and


• These mean error rates are just estimates of error on the true
population of future data cases
• What if the difference between the 2 error rates is just
attributed to chance?
– Use a test of statistical significance

– Obtain confidence limits for our error estimates

24
Estimating Confidence Intervals:
Null Hypothesis

• Perform 10-fold cross-validation


• Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
err ( M 1 )  err ( M 2 )
• Use t-test (or Student’s t-test) t
var(M 1  M 2 ) /( k  1)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
– we conclude that the difference between M1 & M2 is statistically
significant
– Chose model with lower error rate

25
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2

26
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?


– Compute t. Select significance level (e.g. sig = 5%)
– Consult table for t-distribution: Find t value corresponding to k-1
degrees of freedom (here, 9)
– t-distribution is symmetric: typically upper % points of distribution
shown → look up value for confidence limit z=sig/2 (here, 0.025)
– If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are
same
• Conclude: statistically significant difference between M1 & M2
– Otherwise, conclude that any difference is chance

27
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
28

You might also like