ML Model Evaluation
ML Model Evaluation
There are two possible predicted classes: "yes" and "no". If we were predicting
the presence of a disease
•True positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
•True negatives (TN): We predicted no, and they don't have the disease.
•False positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")
•False negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
3
Example of Confusion Matrix:
5
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive
6
Classifier Evaluation Metrics: Example
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
7
Suppose I have 10,000 emails in my mailbox out of which 300 are spams.
The spam detection system detects 150 mails as spams, out of which 50
are actually spams. What is the precision and recall of my spam detection
system ?
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent sets
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
dist. in each fold is approx. the same as that in the initial data
9
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
10
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
Suppose we have 2 classifiers, M1 and M2, which one is better?
11
Estimating Confidence Intervals:
Null Hypothesis
Perform 10-fold cross-validation
Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
Use t-test (or Student’s t-test)
Null Hypothesis: M1 & M2 are the same
If we can reject null hypothesis, then
we conclude that the difference between M1 & M2 is
statistically significant
Chose model with lower error rate
12
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
13
Estimating Confidence Intervals:
Table for t-distribution
Symmetric
Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
Confidence limit, z
= sig/2
14
Estimating Confidence Intervals:
Statistical Significance
Are M1 & M2 significantly different?
Compute t. Select significance level (e.g. sig = 5%)
Consult table for t-distribution: Find t value corresponding to k-1
degrees of freedom (here, 9)
t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit z=sig/2
(here, 0.025)
If t > z or t < -z, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M & M are
1 2
same
Conclude: statistically significant difference between M & M
1 2
Otherwise, conclude that any difference is chance
15
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
16
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value is from the
actual known value
Loss function: measures the error betw. yi and the predicted value yi’
Absolute error: | yi – yi’|
Squared error: (yi – yi’)2
Test error (generalization error):
d the average loss over the test set d
d d
d
d
| yRelative
y '| ( yi yi ' ) 2
Relative absolute error: i 1
d
i
squared error:
i
i 1
d
| y
i 1
i y|
(y
i 1
i y)2