Lec07 Classification ModelEvaluation Ensemble
Lec07 Classification ModelEvaluation Ensemble
Data Mining 1
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison
Machine Learning 2
Model Evaluation and Selection
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
– How can we measure accuracy? Other metrics to consider?
– Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
• Methods for Performance Evaluation
– How to obtain reliable estimates?
– Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Methods for Model Comparison
– How to compare the relative performance among competing models?
– Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
Data Mining 3
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\
C ¬C Total
Predicted class
True Positives False Negatives
C P
(TP) (FN)
False Positives True Negatives
¬C N
(FP) (TN)
Total P’ N’ P+N
Data Mining 4
Classifier Evaluation Metrics:
Confusion Matrix
Actual class\
C ¬C Total
Predicted class
True Positives False Negatives
C P
(TP) (FN)
False Positives True Negatives
¬C N
(FP) (TN)
Total P’ N’ P+N
Data Mining 5
Classifier Evaluation Metrics:
Accuracy, Error Rate
Accuracy, recognition rate: percentage of test set tuples that are correctly classified
Accuracy = (TP + TN) / (TP+FP+TN+FN)
Actual\
C ¬C
Predicted
Error Rate, misclassification rate: 1 – Accuracy C TP FN
Error Rate = (FP + FN) / (TP+FP+TN+FN) ¬C FP TN
Data Mining 6
Classifier Evaluation Metrics:
Sensitivity, Specificity
Class Imbalance Problem: Main class of interest is rare.
• Data set distribution reflects: majority negative class and minority positive class
– In medical data, there may be a rare class, such as “cancer.”
• If there is a class imbalance problem, Accuracy is NOT a good evaluation metric.
– If only 3% of training tuples are actually cancer, 97% accuracy is NOT acceptable.
• the classifier could be correctly labeling only the noncancer tuples, for instance, and
misclassifying all the cancer tuples.
• Instead, we need other measures, which access how well the classifier can recognize the positive
tuples(cancer=yes) and how well it can recognize the negative tuple(cancer=no).
Data Mining 7
Classifier Evaluation Metrics:
Precision, Recall, F-measure
Precision: a measure of exactness
– what % of tuples that the classifier labeled as positive are actually positive
Precision = TP / (TP+FP)
Recall: a measure of completeness, Sensivity
– what % of positive tuples did the classifier label as positive?
Recall = TP / (TP+FN)
Actual\
C ¬C
• Inverse relationship between precision & recall Predicted
C TP FN
F-measure, F1: harmonic mean of precision and recall,
F-measure = (2*Precision*Recall) / (Precision+Recall) ¬C FP TN
Data Mining 8
Classifier Evaluation Metrics: Example
Actual Class\Predicted Class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
Accuracy =
Error rate =
Sensivity =
Specificity =
Precision =
Recall
Data Mining 9
Classifier Evaluation Metrics: Example
Actual Class\Predicted Class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000
Data Mining 10
Classifier Evaluation Metrics: Confusion Matrix
with more than two classes
Confusion Matrix:
Actual class\
C1 C2 … Cm Total
Predicted class
C1 CM1,1 CM1,2 … CM1,m AC1
C2 CM2,1 CM2,2 … CM2,m AC2
⋮
Cm CMm,1 CMm,2 … CMm,m ACm
Total PC1 PC2 PCm AC1+…+ACm
Data Mining 11
Classifier Evaluation Metrics: with more than two classes
Accuracy, Error Rate
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m
Error Rate:
Error Rate = 1 – Accuracy
Data Mining 12
Classifier Evaluation Metrics: with more than two classes
Precision, Recall
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m
Precision and Recall values for each class:
Precision: Fraction of tuples assigned class i that are actually about class i:
𝐂𝐌𝐢,𝐢
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐂𝐢 =
σ𝐣 𝐂𝐌𝐣,𝐢
Data Mining 13
Classifier Evaluation Metrics: with more than two classes
Microaveraging and Macroaveraging
• In order to derive a single metric that tells us how well the system is doing, we can
combine precision and recall values in two ways.
• In macroaveraging, compute performance for each class, and then average over
classes.
• In microaveraging, collect decisions for all classes into a single confusion matrix,
and then compute precision from that matrix.
Data Mining 14
Classifier Evaluation Metrics: with more than two classes
Macroaveraging
Macroaverage: compute performance for each class, and then average over classes.
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m
Macroaveraging:
σ𝒊 𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐂𝐢 σ𝒊 𝐑𝐞𝐜𝐚𝐥𝐥𝐂𝐢
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 = 𝐑𝐞𝐜𝐚𝐥𝐥 =
𝐦 𝐦
Data Mining 15
Classifier Evaluation Metrics: with more than two classes
Microaveraging
Microaverage: collect decisions for all classes into a single confusion matrix, and then
compute precision from that matrix.
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m
Data Mining 16
Classifier Evaluation Metrics:
with more than two classes - Example
Confusion matrix for a three-class Actual \
urgent normal spam
classification for e-mail classification task Predicted
urgent 30 7 3
Accuracy = 140/200 = 70% normal 10 40 20
spam 5 15 70
Data Mining 17
Classifier Evaluation Metrics:
with more than two classes - Example
Confusion Matrix Confusion Matrix Confusion Matrix
for urgent: for normal: for spam:
Actual \ Actual \ Actual \
yes no yes no yes no
Predicted Predicted Predicted
yes 𝟑𝟎 10 yes 𝟒𝟎 30 yes 𝟕𝟎 20
no 15 145 no 22 108 no 23 87
Confusion Matrix
for all classes:
Actual \
yes no
Predicted
Microaverage Precision = 140 / 200 = .70
yes 𝟏𝟒𝟎 60
no 60 340
Data Mining 18
Predictor Error Measures
• If a predictor returns a continuous value rather than a categorical label, it is difficult
to say exactly whether the predicted value is correct or not.
– Instead of focusing on whether the predicted value is an “exact” match with the correct
value, we look at how far off the predicted value is from the actual value.
• Error Functions: yi is the actualvalue, yi’ is the predicted value
Data Mining 19
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison
Machine Learning 20
Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
– Class distribution, Cost of misclassification, Size of training and test sets.
Data Mining 21
Evaluating Classifier Performance:
Holdout
• The purpose of Evaluating Classifier Performance is to estimate the performance of a
classifier on previously unseen data (test set)
• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
Data Mining 22
Evaluating Classifier Performance:
Cross-Validation
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and other k-1 subsets as training set
– The accuracy estimate is the overall number of correct classifications from the
k iterations, divided by the total number of tuples in the initial data.
Data Mining 23
Evaluating Classifier Performance:
Cross-Validation
Variations of Cross-Validation:
• Leave-one-out:
– k folds where k = # of tuples, for small sized data
– i.e. If there are n tuples in data set, one tuple is used as test data and the rest n-1
tuples are used as training data in each iteration.
• Stratified cross-validation:
– Folds are stratified so that class distribution in each fold is approximately the
same as that in the initial data
– Example:
• Initial data contains 3000 tuples, and 600 tuples are positive (2400 tuples are negative).
• In 3-fold cross validation, each subset will randomly get 200 positive tuples and 800 negative
tuples.
Data Mining 24
Evaluating Classifier Performance:
Bootstrap
• Bootstrap
– Works well with small data sets
– Tuples are randomly selected from initial data to create training data.
• Each time a tuple is selected, it is equally likely to be selected again and re-added to training set
• There are several bootstrap methods, and a common one is .632 Boostrap
– A data set with d tuples is sampled d times, with replacement.
– The data tuples that did not make it into the training set end up forming the test set.
– About 63.2% of the original data end up in the bootstrap (training data), and the remaining
36.8% form the test set.
• Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is
1-1/d. Since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
Data Mining 25
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison
Machine Learning 26
Model Selection
using Statistical Tests of Significance
• Suppose we have 2 classifiers, M1 and M2, which one is better?
• These mean error rates are just estimates of error on the true population of future data
cases
– Although the mean error rates obtained for M1 and M2 may appear different, that difference
may NOT be statistically significant.
• What if the difference between the 2 error rates is just attributed to chance?
– Use a test of statistical significance
– Obtain confidence limits for our error estimates
Data Mining 27
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation.
– Each partitioning is independently drawn.
– Average 10 error rates obtained for M1 and M2, are their mean error rates.
• Assume samples follow a t-distribution with k–1 degrees of freedom (where k=10)
– Individual error rates calculated in cross-validations may be considered as independent
samples from a probability distribution.
– In general, they follow a t-distribution with k-1 degrees of freedom where k=10.
• Use significance test t-test (or Student’s t-test) to see the difference between two
models is statistically significant or not.
• Null Hypothesis: M1 & M2 are the same
• If we can Reject null hypothesis, then
– Conclude that the difference between M1 & M2 is statistically significant
– Chose model with lower error rate
Data Mining 28
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
– For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
– Average over 10 rounds to get 𝐞𝐫𝐫(𝐌𝟏 ) and 𝐞𝐫𝐫(𝐌𝟐 )
– t-test computes t-statistic with k-1 degrees of freedom: (k is 10 here)
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = [ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
Data Mining 29
Estimating Confidence Intervals:
Statistical Significance
• Are M1 & M2 significantly different?
– Compute t.
– Select significance level (e.g. significance = 5%)
– Consult table for t-distribution:
• Find t value corresponding to k-1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically upper % points of distribution shown
look up the value for confidence limit z with significance /2 (here, 0.025)
in table for t-distribution
– If (t > z or t < -z), then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are same
– Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance
Data Mining 30
Estimating Confidence Intervals:
Table for t-distribution
• Significance level, e.g., significance = 0.05 or 5% means M1 & M2 are significantly
different for 95% of population
• Confidence limit z = significance / 2
Data Mining 31
Estimating Confidence Intervals:
Statistical Significance - Example
• Results of 10-fold cross validations
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = [ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤
Data Mining 32
Estimating Confidence Intervals:
Statistical Significance - Example
• Results of 10-fold cross validations
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = [ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤
t ≯ z (0.154 ≯ 2.262)
no difference between M1 & M2
Data Mining 33
Confidence Limits
• Confidence limits for the normal distribution with 0 mean and a variance of 1:
Data Mining 34
Confidence Interval for Accuracy
standard normal distribution of accuracy
• For large data sets (N>100), the accuracy has a normal distribution with mean p
and variance p(1-p)/N.
𝐚𝐜𝐜
• Transformed value for the accuracy acc:
– in order to have the normal distribution 𝐩(𝟏 − 𝐩)/𝐍
with 0 mean and a variance of 1:
𝐚𝐜𝐜
• Confidence limits for significance : 𝐏(−𝐙𝛂/𝟐 < < 𝐙𝛂/𝟐 ) = 1-
𝐩(𝟏−𝐩)/𝐍
Data Mining 35
Confidence Interval for Accuracy - Example
• Consider a model that produces an accuracy of 80% when evaluated on 100 test
instances:
– N=100, accuracy acc=0.8
– Let 1-=0.95 (95% confidence)
– From probability table 𝐙𝛂/𝟐 =1.96 (where significance = 5%)
Data Mining 36
Comparing Performance of 2 Models
• Given two models, say, M1 and M2, which one is better?
– M1 is tested on data set D1 (size=n1), found error rate = e1 .
– M2 is tested on data set D2 (size=n2), found error rate = e2 .
– Assume D1 and D2 are independent.
– If n1 and n2 are sufficiently large, then
𝐞𝟏 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝛍𝟏 , 𝛔𝟏 )
𝐞𝟐 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝛍𝟐 , 𝛔𝟐 )
𝐞𝐢 (𝟏 − 𝐞𝐢 )
– Approximate variance: ෝ𝟐𝐢
𝛔 =
𝐧𝐢
Data Mining 37
Comparing Performance of 2 Models
• To test if performance difference is significant: d = e1 - e2
– 𝐝 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝐝𝐭 , 𝛔𝐭 ) where dt is the true difference
– Since D1 and D2 are independent, their variance adds up:
Data Mining 38
Comparing Performance of 2 Models - Example
• Given: M1: n1=30 e1=0.15
M2: n2=5000 e2=0.25
• d = |e2-e1| = 0.1
𝟎.𝟏𝟓(𝟏−𝟎.𝟏𝟓) 𝟎.𝟐𝟓(𝟏−𝟎.𝟐𝟓)
ෝ𝟐𝐭 =
𝛔 + = 𝟎. 𝟎𝟎𝟒𝟑
𝟑𝟎 𝟓𝟎𝟎𝟎
𝐝𝐭 = 𝟎. 𝟏 ± 𝟏. 𝟗𝟔 ∗ 𝟎. 𝟎𝟎𝟒𝟑 = 𝟎. 𝟏 ± 𝟎. 𝟏𝟐𝟖
Data Mining 39
Model Selection: ROC Curves
ROC (Receiver Operating Characteristics)
• ROC curves: for visual comparison of classification models
• Shows the trade-off between the true positive rate and the false positive rate
• The area under the ROC curve is a measure of the accuracy of the model
• ROC curve characterizes the trade-off between positive hits and false alarms
• ROC curve plots TP rate (on the y-axis) against FP rate (on the x-axis)
• Performance of each classifier represented as a point on the ROC curve
(TP,FP):
• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (1,0): ideal
Data Mining 40
Using ROC for Model Comparison
No model consistently
outperform the other
M1 is better for small FPR
M2 is better for large FPR
Data Mining 41
How to Construct an ROC curve
Data Mining 42
How to Construct an ROC curve
TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
TPR
Data Mining 44
Ensemble Methods: Increasing the Accuracy
Machine Learning 45
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an
improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers
Data Mining 46
Ensemble Methods
• Construct a set of classifiers from the training data
• Predict class label of previously unseen records by aggregating predictions made by
multiple classifiers
Data Mining 47
Why does Ensemble Classifier work?
• Suppose there are 25 base classifiers
– Each classifier has error rate, = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a wrong prediction:
25
25 i
i
i 13
(1 ) 25i
0.06
Data Mining 48
Methods for Constructing an Ensemble Classifier
• The ensemble of classifiers can be constructed in many ways:
Data Mining 49
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most
votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
Data Mining 50
Bagging
• Bagging is a technique that repeatedly samples (with replacement) from a data set
according to a uniform probability distribution.
• Each bootstrap sample has the same size as the original data.
– Because each sample has a probability 1-(1-1/N/)N of being selected in each Di, a
sample Di contains approximately 63% of the original training data.
– After training the k classifiers, a test instance is assigned to the class that receives
the highest number of votes.
• Bagging improves generalization error by reducing the variance of the base
classifiers.
– The performance of bagging depends on the stability of the base classifier.
– If a base classifier is unstable, bagging helps to reduce the errors associated with
random fluctuations in the training data.
– If a base classifier is stable, bagging may not be able to improve the performance
of the base classifiers.
Data Mining 51
Bagging - Example
• We create a single inner node decision tree classifiers for the following training data.
That tree is also known as a decision stump.
• Any decision tree with a single inner node can have maximum %70 accuracy for this
training set.
– Attribute >35 produces %70 accuracy
– Attribute >75 produces %70 accuracy
Data Mining 52
Bagging - Example
Data Mining 53
Bagging - Example
Data Mining 54
Boosting
• Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
– The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data
Data Mining 55
Adaboost
• In the AdaBoost algorithm, the importance of a base classifier Ci depends on its error
rate, which is defined as
Data Mining 56
Adaboost
• Given a set of N class-labeled tuples, (X1, y1), …, (XN, yN)
• Initially, all the weights of tuples are set the same (1/N)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased
Weights of
training records
Data Mining 58
Random Forest
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated using
a random selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F attributes
as candidates for the split at the node. The CART methodology is used to grow
the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or features)
that are a linear combination of the existing attributes (reduces the correlation
between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split, and
faster than bagging or boosting
Data Mining 59
Empirical Comparison among Ensemble Methods
Data Mining 60
Summary
• Classification is a form of data analysis that extracts models describing important
data classes.
• Effective and scalable methods have been developed for decision tree induction,
Naive Bayesian classification, rule-based classification, and many other
classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F
measure, and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation.
Data Mining 61
Summary
• Bagging and boosting can be used to increase overall accuracy by learning and
combining a series of individual models.
• Significance tests are useful for model selection.
• There have been numerous comparisons of the different classification methods;
– The matter remains a research topic
– No single method has been found to be superior over all others for all data sets
• Issues such as accuracy, training time, robustness, scalability, and interpretability
must be considered and can involve trade-offs, further complicating the quest for an
overall superior method
Data Mining 62