0% found this document useful (0 votes)

6 views

Lec07 Classification ModelEvaluation Ensemble

The document discusses model evaluation and selection in data mining, focusing on performance metrics, evaluation methods, and model comparison techniques. It covers various metrics such as accuracy, precision, recall, and confusion matrices, along with methods like holdout and cross-validation for assessing classifier performance. Additionally, it addresses challenges like class imbalance and provides examples for better understanding of these concepts.

Uploaded by

ozcan8479

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Lec07 Classification ModelEvaluation Ensemble

Uploaded by

ozcan8479

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Classification

• Model Evaluation and Selection

• Ensemble Methods: Increasing the Accuracy

Data Mining 1
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison

Machine Learning 2
Model Evaluation and Selection
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
– How can we measure accuracy? Other metrics to consider?
– Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
• Methods for Performance Evaluation
– How to obtain reliable estimates?
– Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Methods for Model Comparison
– How to compare the relative performance among competing models?
– Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves

Data Mining 3
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\
C ¬C Total
Predicted class
True Positives False Negatives
C P
(TP) (FN)
False Positives True Negatives
¬C N
(FP) (TN)
Total P’ N’ P+N

True Positives (TP): Positive tuples correctly labeled by classifier.

True Negatives (TN): Negative tuples correctly labeled by classifier.
False Positives (FP): Negative tuples incorrectly labeled as positive by classifier.
False Negatives (FN): Positive tuples incorrectly labeled as negative by classifier.

Data Mining 4
Classifier Evaluation Metrics:
Confusion Matrix
Actual class\
C ¬C Total
Predicted class
True Positives False Negatives
C P
(TP) (FN)
False Positives True Negatives
¬C N
(FP) (TN)

Total P’ N’ P+N

Example of Confusion Matrix:

Actual class\Predicted class buy_computer = yes buy_computer = no Total

buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

Data Mining 5
Classifier Evaluation Metrics:
Accuracy, Error Rate
Accuracy, recognition rate: percentage of test set tuples that are correctly classified
Accuracy = (TP + TN) / (TP+FP+TN+FN)
Actual\
C ¬C
Predicted
Error Rate, misclassification rate: 1 – Accuracy C TP FN
Error Rate = (FP + FN) / (TP+FP+TN+FN) ¬C FP TN

Data Mining 6
Classifier Evaluation Metrics:
Sensitivity, Specificity
Class Imbalance Problem: Main class of interest is rare.
• Data set distribution reflects: majority  negative class and minority  positive class
– In medical data, there may be a rare class, such as “cancer.”
• If there is a class imbalance problem, Accuracy is NOT a good evaluation metric.
– If only 3% of training tuples are actually cancer, 97% accuracy is NOT acceptable.
• the classifier could be correctly labeling only the noncancer tuples, for instance, and
misclassifying all the cancer tuples.
• Instead, we need other measures, which access how well the classifier can recognize the positive
tuples(cancer=yes) and how well it can recognize the negative tuple(cancer=no).

Sensitivity: True Positive recognition rate, Recall

Actual\
Sensitivity = TP / (TP+FN) C ¬C
Predicted
Specificity: True Negative recognition rate C TP FN
Specificity = TN / (TN+FP) ¬C FP TN

Data Mining 7
Classifier Evaluation Metrics:
Precision, Recall, F-measure
Precision: a measure of exactness
– what % of tuples that the classifier labeled as positive are actually positive
Precision = TP / (TP+FP)
Recall: a measure of completeness, Sensivity
– what % of positive tuples did the classifier label as positive?
Recall = TP / (TP+FN)
Actual\
C ¬C
• Inverse relationship between precision & recall Predicted
C TP FN
F-measure, F1: harmonic mean of precision and recall,
F-measure = (2*Precision*Recall) / (Precision+Recall) ¬C FP TN

F𝜷: weighted measure of precision and recall

– assigns 𝛽 times as much weight to recall as to precision
F𝜷 = (1+ 𝛃2)*Precision*Recall) / (𝛃2 *Precision + Recall)

Data Mining 8
Classifier Evaluation Metrics: Example
Actual Class\Predicted Class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

Accuracy =
Error rate =
Sensivity =
Specificity =

Precision =
Recall

Data Mining 9
Classifier Evaluation Metrics: Example
Actual Class\Predicted Class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

Accuracy = (90+9560)/10000 = 96.50%

Error rate = 1 – Accuracy = 3.50% = (140+210)/10000
Sensivity = 90 /300 = 30.00%
Specificity = 9560/9700 = 98.56%

Precision = 90/230 = 39.13%

Recall = 90/300 = 30.00%

Data Mining 10
Classifier Evaluation Metrics: Confusion Matrix
with more than two classes
Confusion Matrix:
Actual class\
C1 C2 … Cm Total
Predicted class
C1 CM1,1 CM1,2 … CM1,m AC1
C2 CM2,1 CM2,2 … CM2,m AC2
⋮
Cm CMm,1 CMm,2 … CMm,m ACm
Total PC1 PC2 PCm AC1+…+ACm

• Given m classes, an entry CMi,j in a confusion matrix indicates # of tuples in class i

that were labeled by the classifier as class j.
• For a classifier to have good accuracy, ideally most of the tuples would be represented
along the diagonal of the confusion matrix, from entry CM1,1 to entry CMm,m, with
the rest of the entries being zero or close to zero.

Data Mining 11
Classifier Evaluation Metrics: with more than two classes
Accuracy, Error Rate
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m

Accuracy: Fraction of documents classified correctly

σ𝐢 𝐂𝐌𝐢,𝐢
Accuracy =
σ𝐣 σ𝐢 𝐂𝐌𝐢,𝐣

Error Rate:
Error Rate = 1 – Accuracy

Data Mining 12
Classifier Evaluation Metrics: with more than two classes
Precision, Recall
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m
Precision and Recall values for each class:

Precision: Fraction of tuples assigned class i that are actually about class i:
𝐂𝐌𝐢,𝐢
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐂𝐢 =
σ𝐣 𝐂𝐌𝐣,𝐢

Recall: Fraction of tuples in class i classified correctly:

𝐂𝐌𝐢,𝐢
𝐑𝐞𝐜𝐚𝐥𝐥𝐂𝐢 =
σ𝐣 𝐂𝐌𝐢,𝐣

Data Mining 13
Classifier Evaluation Metrics: with more than two classes
Microaveraging and Macroaveraging
• In order to derive a single metric that tells us how well the system is doing, we can
combine precision and recall values in two ways.

• In macroaveraging, compute performance for each class, and then average over
classes.
• In microaveraging, collect decisions for all classes into a single confusion matrix,
and then compute precision from that matrix.

Data Mining 14
Classifier Evaluation Metrics: with more than two classes
Macroaveraging
Macroaverage: compute performance for each class, and then average over classes.

Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m

Macroaveraging:

σ𝒊 𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐂𝐢 σ𝒊 𝐑𝐞𝐜𝐚𝐥𝐥𝐂𝐢
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 = 𝐑𝐞𝐜𝐚𝐥𝐥 =
𝐦 𝐦

Data Mining 15
Classifier Evaluation Metrics: with more than two classes
Microaveraging
Microaverage: collect decisions for all classes into a single confusion matrix, and then
compute precision from that matrix.
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m
⋮
Cm CMm,1 CMm,2 … CMm,m

Confusion Matrix for Ci:

Actual \ Confusion Matrix for all classes:
yes no
Predicted Actual \
yes no
FNi= Predicted
yes 𝐓𝐏𝐢 = 𝐂𝐌𝐢,𝐢
σ𝐣 𝐂𝐌𝐢,𝐣 − 𝐂𝐌𝐢,𝐢
yes 𝐓𝐏 = σ𝐢 𝐓𝐏𝐢 𝐅𝐍 = σ𝐢 𝐅𝐍𝐢
FPi= TNi = σ𝐣 σ𝐢 𝐂𝐌𝐢,𝐣 −
no σ𝐣 𝐂𝐌𝐣,𝐢 − 𝐂𝐌𝐢,𝐢 no 𝐅𝐏 = σ𝐢 𝐅𝐏𝐢 𝐓𝐍 = σ𝐢 𝐓𝐍𝐢
𝐅𝐏𝐢 − 𝐅𝐍𝐢 + 𝐂𝐌𝐢,𝐢

Data Mining 16
Classifier Evaluation Metrics:
with more than two classes - Example
Confusion matrix for a three-class Actual \
urgent normal spam
classification for e-mail classification task Predicted
urgent 30 7 3
Accuracy = 140/200 = 70% normal 10 40 20
spam 5 15 70

Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix

for urgent: for normal: for spam: for all classes:
Actual \ Actual \ Actual \ Actual \
yes no yes no yes no yes no
Predicted Predicted Predicted Predicted
yes 𝟑𝟎 10 yes 𝟒𝟎 30 yes 𝟕𝟎 20 yes 𝟏𝟒𝟎 60
no 15 145 no 22 108 no 23 87 no 60 340

Data Mining 17
Classifier Evaluation Metrics:
with more than two classes - Example
Confusion Matrix Confusion Matrix Confusion Matrix
for urgent: for normal: for spam:
Actual \ Actual \ Actual \
yes no yes no yes no
Predicted Predicted Predicted
yes 𝟑𝟎 10 yes 𝟒𝟎 30 yes 𝟕𝟎 20
no 15 145 no 22 108 no 23 87

Precisionurgent=30/45=.67 Precisionnormal=40/62=.65 Precisionspam=70/93=.75

Macroaverage Precision = (.67+.65+.75) / 3 = .69

Confusion Matrix
for all classes:
Actual \
yes no
Predicted
Microaverage Precision = 140 / 200 = .70
yes 𝟏𝟒𝟎 60
no 60 340
Data Mining 18
Predictor Error Measures
• If a predictor returns a continuous value rather than a categorical label, it is difficult
to say exactly whether the predicted value is correct or not.
– Instead of focusing on whether the predicted value is an “exact” match with the correct
value, we look at how far off the predicted value is from the actual value.
• Error Functions: yi is the actualvalue, yi’ is the predicted value

• The mean squared error exaggerates

the presence of outliers, while the
mean absolute error does not.

• If we were to take the square root of

the mean squared error, the resulting
error measure is called the root
mean squared error.

Data Mining 19
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison

Machine Learning 20
Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
– Class distribution, Cost of misclassification, Size of training and test sets.

shows how learning curve accuracy

changes with varying sample size

Effect of small sample size:

- Bias in the estimate
- Variance of estimate

Data Mining 21
Evaluating Classifier Performance:
Holdout
• The purpose of Evaluating Classifier Performance is to estimate the performance of a
classifier on previously unseen data (test set)

• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation

• Random sampling: a variation of holdout

– Repeat holdout k times,
– accuracy = avg. of the accuracies obtained

Data Mining 22
Evaluating Classifier Performance:
Cross-Validation
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and other k-1 subsets as training set
– The accuracy estimate is the overall number of correct classifications from the
k iterations, divided by the total number of tuples in the initial data.

• Example: 3-fold cross-validation

Data Mining 23
Evaluating Classifier Performance:
Cross-Validation
Variations of Cross-Validation:

• Leave-one-out:
– k folds where k = # of tuples, for small sized data
– i.e. If there are n tuples in data set, one tuple is used as test data and the rest n-1
tuples are used as training data in each iteration.

• Stratified cross-validation:
– Folds are stratified so that class distribution in each fold is approximately the
same as that in the initial data
– Example:
• Initial data contains 3000 tuples, and 600 tuples are positive (2400 tuples are negative).
• In 3-fold cross validation, each subset will randomly get 200 positive tuples and 800 negative
tuples.

Data Mining 24
Evaluating Classifier Performance:
Bootstrap
• Bootstrap
– Works well with small data sets
– Tuples are randomly selected from initial data to create training data.
• Each time a tuple is selected, it is equally likely to be selected again and re-added to training set

• There are several bootstrap methods, and a common one is .632 Boostrap
– A data set with d tuples is sampled d times, with replacement.
– The data tuples that did not make it into the training set end up forming the test set.
– About 63.2% of the original data end up in the bootstrap (training data), and the remaining
36.8% form the test set.
• Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is
1-1/d. Since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

Data Mining 25
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison

Machine Learning 26
Model Selection
using Statistical Tests of Significance
• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain mean error rates of models M1 and M2 :

• These mean error rates are just estimates of error on the true population of future data
cases
– Although the mean error rates obtained for M1 and M2 may appear different, that difference
may NOT be statistically significant.

• What if the difference between the 2 error rates is just attributed to chance?
– Use a test of statistical significance
– Obtain confidence limits for our error estimates

Data Mining 27
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation.
– Each partitioning is independently drawn.
– Average 10 error rates obtained for M1 and M2, are their mean error rates.
• Assume samples follow a t-distribution with k–1 degrees of freedom (where k=10)
– Individual error rates calculated in cross-validations may be considered as independent
samples from a probability distribution.
– In general, they follow a t-distribution with k-1 degrees of freedom where k=10.
• Use significance test t-test (or Student’s t-test) to see the difference between two
models is statistically significant or not.
• Null Hypothesis: M1 & M2 are the same
• If we can Reject null hypothesis, then
– Conclude that the difference between M1 & M2 is statistically significant
– Chose model with lower error rate

Data Mining 28
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
– For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
– Average over 10 rounds to get 𝐞𝐫𝐫(𝐌𝟏 ) and 𝐞𝐫𝐫(𝐌𝟐 )
– t-test computes t-statistic with k-1 degrees of freedom: (k is 10 here)

𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤

𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = ෍[ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏

Data Mining 29
Estimating Confidence Intervals:
Statistical Significance
• Are M1 & M2 significantly different?
– Compute t.
– Select significance level (e.g. significance = 5%)
– Consult table for t-distribution:
• Find t value corresponding to k-1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically upper % points of distribution shown
 look up the value for confidence limit z with significance /2 (here, 0.025)
in table for t-distribution
– If (t > z or t < -z), then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are same
– Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance

Data Mining 30
Estimating Confidence Intervals:
Table for t-distribution
• Significance level, e.g., significance = 0.05 or 5% means M1 & M2 are significantly
different for 95% of population
• Confidence limit z = significance / 2

Data Mining 31
Estimating Confidence Intervals:
Statistical Significance - Example
• Results of 10-fold cross validations
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = ෍[ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤

• significance=5%, M1 & M2 are significantly different

for 95% of population

• value for confidence limit z with significance/2

from table for t-distribution
z = 2.262

t > z (13 > 2.262)

 statistically significant difference between M1 & M2

Data Mining 32
Estimating Confidence Intervals:
Statistical Significance - Example
• Results of 10-fold cross validations
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = ෍[ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤

• significance=5%, M1 & M2 are significantly different

for 95% of population

• value for confidence limit z with significance/2

from table for t-distribution
z = 2.262

t ≯ z (0.154 ≯ 2.262)
 no difference between M1 & M2

Data Mining 33
Confidence Limits
• Confidence limits for the normal distribution with 0 mean and a variance of 1:

• Thus, a confidence limit for 10% significance (i.e. significance/2 = 5%)

P(-1.65  X  1.65) = 90%
– This means that 90% of the population in this range.

Data Mining 34
Confidence Interval for Accuracy
standard normal distribution of accuracy
• For large data sets (N>100), the accuracy has a normal distribution with mean p
and variance p(1-p)/N.

𝐚𝐜𝐜
• Transformed value for the accuracy acc:
– in order to have the normal distribution 𝐩(𝟏 − 𝐩)/𝐍
with 0 mean and a variance of 1:

𝐚𝐜𝐜
• Confidence limits for significance : 𝐏(−𝐙𝛂/𝟐 < < 𝐙𝛂/𝟐 ) = 1- 
𝐩(𝟏−𝐩)/𝐍

• Solving the equation above for p yields Confidence Interval for p:

𝟐∗𝐍∗𝐚𝐜𝐜 + 𝐙 𝟐 ± 𝐙 𝟐 + 𝟒∗𝐍∗𝐚𝐜𝐜 −𝟒∗𝐍∗𝐚𝐜𝐜 𝟐

𝐩= where Z is 𝐙𝛂/𝟐
𝟐∗(𝐍+𝐙𝟐 )

Data Mining 35
Confidence Interval for Accuracy - Example
• Consider a model that produces an accuracy of 80% when evaluated on 100 test
instances:
– N=100, accuracy acc=0.8
– Let 1-=0.95 (95% confidence)
– From probability table 𝐙𝛂/𝟐 =1.96 (where significance  = 5%)

N 100 500 1000 5000

Mean p lower 0.747 0.779 0.785 0.794
Mean p upper 0.830 0.817 0.812 0.806

Data Mining 36
Comparing Performance of 2 Models
• Given two models, say, M1 and M2, which one is better?
– M1 is tested on data set D1 (size=n1), found error rate = e1 .
– M2 is tested on data set D2 (size=n2), found error rate = e2 .
– Assume D1 and D2 are independent.
– If n1 and n2 are sufficiently large, then

𝐞𝟏 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝛍𝟏 , 𝛔𝟏 )
𝐞𝟐 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝛍𝟐 , 𝛔𝟐 )

𝐞𝐢 (𝟏 − 𝐞𝐢 )
– Approximate variance: ෝ𝟐𝐢
𝛔 =
𝐧𝐢

Data Mining 37
Comparing Performance of 2 Models
• To test if performance difference is significant: d = e1 - e2
– 𝐝 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝐝𝐭 , 𝛔𝐭 ) where dt is the true difference
– Since D1 and D2 are independent, their variance adds up:

𝛔𝟐𝐭 = 𝛔𝟐𝟏 + 𝛔𝟐𝟐 ≅ 𝛔

ෝ𝟐𝟏 + 𝛔
ෝ𝟐𝟐
𝐞𝟏 (𝟏−𝐞𝟏 ) 𝐞𝟐 (𝟏−𝐞𝟐 )
ෝ𝟐𝐭
𝛔 = +
𝐧𝟏 𝐧𝟐

– At (1-) confidence level, true difference range: 𝐝𝐭 = 𝐝 ± 𝐙𝛂/𝟐 ෝ

𝛔𝐭

Data Mining 38
Comparing Performance of 2 Models - Example
• Given: M1: n1=30 e1=0.15
M2: n2=5000 e2=0.25

• d = |e2-e1| = 0.1

𝟎.𝟏𝟓(𝟏−𝟎.𝟏𝟓) 𝟎.𝟐𝟓(𝟏−𝟎.𝟐𝟓)
ෝ𝟐𝐭 =
𝛔 + = 𝟎. 𝟎𝟎𝟒𝟑
𝟑𝟎 𝟓𝟎𝟎𝟎

• At 95% confidence level: 𝐙𝛂/𝟐 =1.96

𝐝𝐭 = 𝟎. 𝟏 ± 𝟏. 𝟗𝟔 ∗ 𝟎. 𝟎𝟎𝟒𝟑 = 𝟎. 𝟏 ± 𝟎. 𝟏𝟐𝟖

 Since interval contains 0, difference may NOT be statistically significant.

Data Mining 39
Model Selection: ROC Curves
ROC (Receiver Operating Characteristics)
• ROC curves: for visual comparison of classification models
• Shows the trade-off between the true positive rate and the false positive rate
• The area under the ROC curve is a measure of the accuracy of the model
• ROC curve characterizes the trade-off between positive hits and false alarms
• ROC curve plots TP rate (on the y-axis) against FP rate (on the x-axis)
• Performance of each classifier represented as a point on the ROC curve

(TP,FP):
• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (1,0): ideal

• Diagonal line: Random guessing

Data Mining 40
Using ROC for Model Comparison

 No model consistently
outperform the other
 M1 is better for small FPR
 M2 is better for large FPR

 Area Under the ROC curve

 Ideal: Area = 1
 Random guess: Area = 0.5

Data Mining 41
How to Construct an ROC curve

• Use classifier that produces

posterior probability for each
test instance P(+|A)
• Sort the instances according to
P(+|A) in decreasing order
• Apply threshold at each unique
value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP+TN)

Data Mining 42
How to Construct an ROC curve

TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
TPR

FPR Data Mining 43

Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or compactness of
classification rules

Data Mining 44
Ensemble Methods: Increasing the Accuracy

Machine Learning 45
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an
improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers

Data Mining 46
Ensemble Methods
• Construct a set of classifiers from the training data
• Predict class label of previously unseen records by aggregating predictions made by
multiple classifiers

Data Mining 47
Why does Ensemble Classifier work?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a wrong prediction:

25
 25 i
  
 i 
i 13 
 (1   ) 25i
 0.06


Data Mining 48
Methods for Constructing an Ensemble Classifier
• The ensemble of classifiers can be constructed in many ways:

• By manipulating the training set.

– Multiple training sets are created by resampling the original data.
– A classifier is then built from each training set using a particular learning
algorithm.
– Bagging and boosting are two examples of ensemble methods that manipulate
their training sets

• By manipulating the input features.

– A subset of input features is chosen to form each training set.
– This approach works well with data sets that contain highly redundant features.
– Random forest is an ensemble method that manipulates its input features and
uses decision trees as its base classifiers.

Data Mining 49
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most
votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
Data Mining 50
Bagging
• Bagging is a technique that repeatedly samples (with replacement) from a data set
according to a uniform probability distribution.
• Each bootstrap sample has the same size as the original data.
– Because each sample has a probability 1-(1-1/N/)N of being selected in each Di, a
sample Di contains approximately 63% of the original training data.
– After training the k classifiers, a test instance is assigned to the class that receives
the highest number of votes.
• Bagging improves generalization error by reducing the variance of the base
classifiers.
– The performance of bagging depends on the stability of the base classifier.
– If a base classifier is unstable, bagging helps to reduce the errors associated with
random fluctuations in the training data.
– If a base classifier is stable, bagging may not be able to improve the performance
of the base classifiers.

Data Mining 51
Bagging - Example
• We create a single inner node decision tree classifiers for the following training data.
That tree is also known as a decision stump.

• Any decision tree with a single inner node can have maximum %70 accuracy for this
training set.
– Attribute >35 produces %70 accuracy
– Attribute >75 produces %70 accuracy

Data Mining 52
Bagging - Example

• 10 decision trees with single inner

nodes are generated.

• Each bootstrap sample has the same

size as the original data and randomly
generated from the original data.

Data Mining 53
Bagging - Example

The performance is increased by bagging approach

Data Mining 54
Boosting
• Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
– The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data

Data Mining 55
Adaboost
• In the AdaBoost algorithm, the importance of a base classifier Ci depends on its error
rate, which is defined as

where I(p) = 1 if the predicate p is true, and 0 otherwise.

• The importance of a classifier Ci is given by the following parameter,

Data Mining 56
Adaboost
• Given a set of N class-labeled tuples, (X1, y1), …, (XN, yN)
• Initially, all the weights of tuples are set the same (1/N)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased

• Classifier Mi error rate is the sum of

the weights of the misclassified tuples:

• The importance of classifier Mi’s vote:

• Weight update mechanism:

– where Zj is the normalization factor used to ensure that

Data Mining 57
Adaboost -Example
Training records

Training records chosen

during boosting

Weights of
training records

Data Mining 58
Random Forest
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated using
a random selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F attributes
as candidates for the split at the node. The CART methodology is used to grow
the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or features)
that are a linear combination of the existing attributes (reduces the correlation
between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split, and
faster than bagging or boosting

Data Mining 59
Empirical Comparison among Ensemble Methods

• Empirical results obtained when

comparing the performance of a
decision tree classifier against
bagging, boosting, and random
forest.
• The base classifiers used in each
ensemble method consist of
fifty decision trees.
• The classification accuracies
reported in this table are
obtained from ten-fold cross-
validation.
• Notice that the ensemble
classifiers generally outperform
a single decision tree classifier
on many of the data sets.

Data Mining 60
Summary
• Classification is a form of data analysis that extracts models describing important
data classes.
• Effective and scalable methods have been developed for decision tree induction,
Naive Bayesian classification, rule-based classification, and many other
classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F
measure, and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation.

Data Mining 61
Summary
• Bagging and boosting can be used to increase overall accuracy by learning and
combining a series of individual models.
• Significance tests are useful for model selection.
• There have been numerous comparisons of the different classification methods;
– The matter remains a research topic
– No single method has been found to be superior over all others for all data sets
• Issues such as accuracy, training time, robustness, scalability, and interpretability
must be considered and can involve trade-offs, further complicating the quest for an
overall superior method

Data Mining 62

Lab Report Specific Heat Capacity
100% (4)
Lab Report Specific Heat Capacity
12 pages
Iso TS 21749 2005
No ratings yet
Iso TS 21749 2005
13 pages
Technical Memorandum 63-106 Factors Affecting Measurement Reliability
No ratings yet
Technical Memorandum 63-106 Factors Affecting Measurement Reliability
13 pages
Classification Evaluation
No ratings yet
Classification Evaluation
28 pages
CST 42315 Dam - L9 1
No ratings yet
CST 42315 Dam - L9 1
15 pages
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
06 - ML - Classificaion Performance Evaluation Measures
No ratings yet
06 - ML - Classificaion Performance Evaluation Measures
19 pages
Lecture 11 Model Evaluation
No ratings yet
Lecture 11 Model Evaluation
11 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
Lecture 01-Model Selection and Evaluation
No ratings yet
Lecture 01-Model Selection and Evaluation
29 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
chapter 5 Model Evaluation
No ratings yet
chapter 5 Model Evaluation
21 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
48 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Lecture 16
No ratings yet
Lecture 16
36 pages
Lectures3 5
No ratings yet
Lectures3 5
57 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Unit 4 Model Evaluation
No ratings yet
Unit 4 Model Evaluation
24 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
BSC ML CH1.pptx
No ratings yet
BSC ML CH1.pptx
63 pages
Module 2
No ratings yet
Module 2
151 pages
Session 1 Evaluation Model
No ratings yet
Session 1 Evaluation Model
58 pages
Classifier Performance Measures and Ensemble Method
No ratings yet
Classifier Performance Measures and Ensemble Method
7 pages
Prediction---accuracy
No ratings yet
Prediction---accuracy
33 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
22 pages
Lecture-(3-4) Evaluation Metrices Classification and Regression
No ratings yet
Lecture-(3-4) Evaluation Metrices Classification and Regression
28 pages
Lec 8
No ratings yet
Lec 8
35 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
8c - Model Evaluation and Selection
No ratings yet
8c - Model Evaluation and Selection
15 pages
Slide Kuliah - Pemilihan Dan Evaluasi Model
No ratings yet
Slide Kuliah - Pemilihan Dan Evaluasi Model
46 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
Lesson 3.2 - Supervised Learning Evaluation PDF
No ratings yet
Lesson 3.2 - Supervised Learning Evaluation PDF
38 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Unit 4 ML
No ratings yet
Unit 4 ML
28 pages
Accuracy and error measures
No ratings yet
Accuracy and error measures
14 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
18 pages
6.Data Mining - Classification Ppt
No ratings yet
6.Data Mining - Classification Ppt
37 pages
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
No ratings yet
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
33 pages
Jnn 5.2 Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Jnn 5.2 Confusion Matrix and Performance Evaluation Metrics
13 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
Model Performance Assessment
No ratings yet
Model Performance Assessment
13 pages
WINSEM2024-25_CBS3006_ETH_VL2024250505168_2025-01-09_Reference-Material-IV
No ratings yet
WINSEM2024-25_CBS3006_ETH_VL2024250505168_2025-01-09_Reference-Material-IV
20 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
06-FSSR_DS610_2024=2025T1_ٍMetrics
No ratings yet
06-FSSR_DS610_2024=2025T1_ٍMetrics
24 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
Week 5
No ratings yet
Week 5
72 pages
CEN-478 Project Assignment
No ratings yet
CEN-478 Project Assignment
4 pages
Week 6 History of Turkish Revolution and Ataturk's Principles I
No ratings yet
Week 6 History of Turkish Revolution and Ataturk's Principles I
6 pages
Week 3 History of Turkish Revolution and Ataturk's Principles I
No ratings yet
Week 3 History of Turkish Revolution and Ataturk's Principles I
6 pages
Week 2 His101 History of Turkish Revolution and Ataturk's Principles I
No ratings yet
Week 2 His101 History of Turkish Revolution and Ataturk's Principles I
6 pages
Week 1 His101 History of Turkish Revolution and Ataturk's Principles I
No ratings yet
Week 1 His101 History of Turkish Revolution and Ataturk's Principles I
4 pages
Lecture 08
No ratings yet
Lecture 08
32 pages
Week 4 History of Turkish Revolution and Ataturk's Principles I
No ratings yet
Week 4 History of Turkish Revolution and Ataturk's Principles I
6 pages
Lecture 05
No ratings yet
Lecture 05
27 pages
Lecture 01
No ratings yet
Lecture 01
25 pages
Lecture 03
No ratings yet
Lecture 03
19 pages
Literature Presentation
No ratings yet
Literature Presentation
6 pages
Lecture 04
No ratings yet
Lecture 04
16 pages
Customer Data
No ratings yet
Customer Data
4 pages
Math217 Midterm1 Practice
No ratings yet
Math217 Midterm1 Practice
2 pages
Lecture 02
No ratings yet
Lecture 02
20 pages
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
No ratings yet
Prediction of HIV Status in Addis Ababa Using Data Mining Technology
7 pages
Physics N5 Assignment Sheet
No ratings yet
Physics N5 Assignment Sheet
31 pages
Garment Returns Prediction For AI-Based Processing and Waste Reduction in E-Commerce
No ratings yet
Garment Returns Prediction For AI-Based Processing and Waste Reduction in E-Commerce
9 pages
MODULE 1 Introduction To Surveying
No ratings yet
MODULE 1 Introduction To Surveying
5 pages
BS EN ISO 5667-14-2016--[2023-04-13--11-26-58 AM]
No ratings yet
BS EN ISO 5667-14-2016--[2023-04-13--11-26-58 AM]
46 pages
Precision Test Pressure Gauge Acc. EN 837-1 With Bourdon Tube
No ratings yet
Precision Test Pressure Gauge Acc. EN 837-1 With Bourdon Tube
12 pages
Metrology and Mechanical Measurements
No ratings yet
Metrology and Mechanical Measurements
118 pages
Anthropometry Measurement Error and Technical Error of Measurement (TEM)
No ratings yet
Anthropometry Measurement Error and Technical Error of Measurement (TEM)
24 pages
TAPPI
No ratings yet
TAPPI
92 pages
Lab 3 CHM130LL Accuracy and Measurement of Volume
No ratings yet
Lab 3 CHM130LL Accuracy and Measurement of Volume
13 pages
Siemens 1 GTX 100 Turbine
100% (1)
Siemens 1 GTX 100 Turbine
16 pages
CBC Web Development 2018 (Online)
100% (1)
CBC Web Development 2018 (Online)
53 pages
III. Quality Assessment & Management
100% (1)
III. Quality Assessment & Management
3 pages
Arc PDA-UV OQ, PQ
No ratings yet
Arc PDA-UV OQ, PQ
30 pages
Cambridge IGCSE™: Bahasa Indonesia 0538/02
No ratings yet
Cambridge IGCSE™: Bahasa Indonesia 0538/02
13 pages
Impresiones 3D
No ratings yet
Impresiones 3D
9 pages
Size Grading Jagung
No ratings yet
Size Grading Jagung
5 pages
Astm e 1182
No ratings yet
Astm e 1182
7 pages
Container Closure Integrity Testing
No ratings yet
Container Closure Integrity Testing
14 pages
Lab #2
No ratings yet
Lab #2
5 pages
Lab 1 PDF
No ratings yet
Lab 1 PDF
15 pages
Accounting Software in Computerized Business Envir
No ratings yet
Accounting Software in Computerized Business Envir
10 pages
Cambridge O Level: Additional Mathematics For Examination From 2020
No ratings yet
Cambridge O Level: Additional Mathematics For Examination From 2020
12 pages
Brazilian
No ratings yet
Brazilian
16 pages
Accuracy Precision Errors Interpolation and Significant Figures PDF
No ratings yet
Accuracy Precision Errors Interpolation and Significant Figures PDF
4 pages

Lec07 Classification ModelEvaluation Ensemble

Uploaded by

Lec07 Classification ModelEvaluation Ensemble

Uploaded by

Classification

• Model Evaluation and Selection

True Positives (TP): Positive tuples correctly labeled by classifier.

Example of Confusion Matrix:

Actual class\Predicted class buy_computer = yes buy_computer = no Total

Sensitivity: True Positive recognition rate, Recall

F𝜷: weighted measure of precision and recall

Accuracy = (90+9560)/10000 = 96.50%

Precision = 90/230 = 39.13%

• Given m classes, an entry CMi,j in a confusion matrix indicates # of tuples in class i

Accuracy: Fraction of documents classified correctly

Recall: Fraction of tuples in class i classified correctly:

Confusion Matrix for Ci:

Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix

Precisionurgent=30/45=.67 Precisionnormal=40/62=.65 Precisionspam=70/93=.75

Macroaverage Precision = (.67+.65+.75) / 3 = .69

• The mean squared error exaggerates

• If we were to take the square root of

shows how learning curve accuracy

Effect of small sample size:

• Random sampling: a variation of holdout

• Example: 3-fold cross-validation

• Use 10-fold cross-validation to obtain mean error rates of models M1 and M2 :

• significance=5%, M1 & M2 are significantly different

• value for confidence limit z with significance/2

t > z (13 > 2.262)

• significance=5%, M1 & M2 are significantly different

• value for confidence limit z with significance/2

• Thus, a confidence limit for 10% significance (i.e. significance/2 = 5%)

• Solving the equation above for p yields Confidence Interval for p:

𝟐∗𝐍∗𝐚𝐜𝐜 + 𝐙 𝟐 ± 𝐙 𝟐 + 𝟒∗𝐍∗𝐚𝐜𝐜 −𝟒∗𝐍∗𝐚𝐜𝐜 𝟐

N 100 500 1000 5000

𝛔𝟐𝐭 = 𝛔𝟐𝟏 + 𝛔𝟐𝟐 ≅ 𝛔

– At (1-) confidence level, true difference range: 𝐝𝐭 = 𝐝 ± 𝐙𝛂/𝟐 ෝ

• At 95% confidence level: 𝐙𝛂/𝟐 =1.96

 Since interval contains 0, difference may NOT be statistically significant.

• Diagonal line: Random guessing

 Area Under the ROC curve

• Use classifier that produces

FPR Data Mining 43

• By manipulating the training set.

• By manipulating the input features.

• 10 decision trees with single inner

• Each bootstrap sample has the same

The performance is increased by bagging approach

where I(p) = 1 if the predicate p is true, and 0 otherwise.

• The importance of a classifier Ci is given by the following parameter,

• Classifier Mi error rate is the sum of

• The importance of classifier Mi’s vote:

• Weight update mechanism:

– where Zj is the normalization factor used to ensure that

Training records chosen

• Empirical results obtained when

You might also like