0% found this document useful (0 votes)
50 views

ML Model Evaluation

This document discusses various methods for evaluating and comparing machine learning models, including accuracy metrics like confusion matrices, precision, recall, and accuracy. It describes holdout validation, cross-validation, and bootstrap methods for estimating a model's accuracy. Statistical tests like t-tests are presented for comparing models and determining whether differences in accuracy are statistically significant. Factors like speed, resources required, and costs/benefits must also be considered when selecting the best model.

Uploaded by

STYX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

ML Model Evaluation

This document discusses various methods for evaluating and comparing machine learning models, including accuracy metrics like confusion matrices, precision, recall, and accuracy. It describes holdout validation, cross-validation, and bootstrap methods for estimating a model's accuracy. Statistical tests like t-tests are presented for comparing models and determining whether differences in accuracy are statistically significant. Factors like speed, resources required, and costs/benefits must also be considered when selecting the best model.

Uploaded by

STYX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Model Evaluation and Selection

Model Evaluation and Selection


 Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
 Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
2
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
A confusion matrix is a technique for summarizing the performance of a
classification algorithm.

Actual class\Predicted class C1 ¬ C1


C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

There are two possible predicted classes: "yes" and "no". If we were predicting
the presence of a disease

•True positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
•True negatives (TN): We predicted no, and they don't have the disease.
•False positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")
•False negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
3
Example of Confusion Matrix:

Actual class\Predicted class buy_computer = buy_computer = Total


yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All  Sensitivity = TP/P

 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


 Specificity = TN/N

5
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision

6
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

7
Suppose I have 10,000 emails in my mailbox out of which 300 are spams.
The spam detection system detects 150 mails as spams, out of which 50
are actually spams. What is the precision and recall of my spam detection
system ?

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies

obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,

each approximately equal size



At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized data

 *Stratified cross-validation*: folds are stratified so that class

dist. in each fold is approx. the same as that in the initial data
9
Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:

10
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2, which one is better?

 Use 10-fold cross-validation to obtain and


 These mean error rates are just estimates of error on the true
population of future data cases
 What if the difference between the 2 error rates is just
attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

11
Estimating Confidence Intervals:
Null Hypothesis
 Perform 10-fold cross-validation
 Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
 Use t-test (or Student’s t-test)
 Null Hypothesis: M1 & M2 are the same
 If we can reject null hypothesis, then
 we conclude that the difference between M1 & M2 is
statistically significant
 Chose model with lower error rate

12
Estimating Confidence Intervals: t-test

 If only 1 test set available: pairwise comparison


 For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
 Average over 10 rounds to get and
 t-test computes t-statistic with k-1 degrees of
freedom: where

 If two test sets available: use non-paired t-test


where

where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
13
Estimating Confidence Intervals:
Table for t-distribution

 Symmetric
 Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
 Confidence limit, z
= sig/2

14
Estimating Confidence Intervals:
Statistical Significance
 Are M1 & M2 significantly different?
 Compute t. Select significance level (e.g. sig = 5%)
 Consult table for t-distribution: Find t value corresponding to k-1
degrees of freedom (here, 9)
 t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit z=sig/2
(here, 0.025)
 If t > z or t < -z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M & M are
1 2
same
 Conclude: statistically significant difference between M & M
1 2
 Otherwise, conclude that any difference is chance

15
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
16
Predictor Error Measures

 Measure predictor accuracy: measure how far off the predicted value is from the
actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
 Test error (generalization error):
d the average loss over the test set d

 Mean absolute error: | y


i 1
i  yi ' |
Mean squared error: (y
i 1
i  yi ' ) 2

d d
d
d

 | yRelative
 y '|  ( yi  yi ' ) 2
 Relative absolute error: i 1
d
i
squared error:
i
i 1
d
| y
i 1
i y|
(y
i 1
i  y)2

The mean squared-error exaggerates the presence of outliers


Popularly use (square) root mean-square error, similarly, root relative squared
error
17

You might also like