0% found this document useful (0 votes)
6 views

Lec07 Classification ModelEvaluation Ensemble

The document discusses model evaluation and selection in data mining, focusing on performance metrics, evaluation methods, and model comparison techniques. It covers various metrics such as accuracy, precision, recall, and confusion matrices, along with methods like holdout and cross-validation for assessing classifier performance. Additionally, it addresses challenges like class imbalance and provides examples for better understanding of these concepts.

Uploaded by

ozcan8479
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lec07 Classification ModelEvaluation Ensemble

The document discusses model evaluation and selection in data mining, focusing on performance metrics, evaluation methods, and model comparison techniques. It covers various metrics such as accuracy, precision, recall, and confusion matrices, along with methods like holdout and cross-validation for assessing classifier performance. Additionally, it addresses challenges like class imbalance and provides examples for better understanding of these concepts.

Uploaded by

ozcan8479
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Classification

• Model Evaluation and Selection


• Ensemble Methods: Increasing the Accuracy

Data Mining 1
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison

Machine Learning 2
Model Evaluation and Selection
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
– How can we measure accuracy? Other metrics to consider?
– Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
• Methods for Performance Evaluation
– How to obtain reliable estimates?
– Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Methods for Model Comparison
– How to compare the relative performance among competing models?
– Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves

Data Mining 3
Classifier Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\
C ¬C Total
Predicted class
True Positives False Negatives
C P
(TP) (FN)
False Positives True Negatives
¬C N
(FP) (TN)
Total P’ N’ P+N

True Positives (TP): Positive tuples correctly labeled by classifier.


True Negatives (TN): Negative tuples correctly labeled by classifier.
False Positives (FP): Negative tuples incorrectly labeled as positive by classifier.
False Negatives (FN): Positive tuples incorrectly labeled as negative by classifier.

Data Mining 4
Classifier Evaluation Metrics:
Confusion Matrix
Actual class\
C ¬C Total
Predicted class
True Positives False Negatives
C P
(TP) (FN)
False Positives True Negatives
¬C N
(FP) (TN)

Total P’ N’ P+N

Example of Confusion Matrix:

Actual class\Predicted class buy_computer = yes buy_computer = no Total


buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

Data Mining 5
Classifier Evaluation Metrics:
Accuracy, Error Rate
Accuracy, recognition rate: percentage of test set tuples that are correctly classified
Accuracy = (TP + TN) / (TP+FP+TN+FN)
Actual\
C ¬C
Predicted
Error Rate, misclassification rate: 1 – Accuracy C TP FN
Error Rate = (FP + FN) / (TP+FP+TN+FN) ¬C FP TN

Data Mining 6
Classifier Evaluation Metrics:
Sensitivity, Specificity
Class Imbalance Problem: Main class of interest is rare.
• Data set distribution reflects: majority  negative class and minority  positive class
– In medical data, there may be a rare class, such as “cancer.”
• If there is a class imbalance problem, Accuracy is NOT a good evaluation metric.
– If only 3% of training tuples are actually cancer, 97% accuracy is NOT acceptable.
• the classifier could be correctly labeling only the noncancer tuples, for instance, and
misclassifying all the cancer tuples.
• Instead, we need other measures, which access how well the classifier can recognize the positive
tuples(cancer=yes) and how well it can recognize the negative tuple(cancer=no).

Sensitivity: True Positive recognition rate, Recall


Actual\
Sensitivity = TP / (TP+FN) C ¬C
Predicted
Specificity: True Negative recognition rate C TP FN
Specificity = TN / (TN+FP) ¬C FP TN

Data Mining 7
Classifier Evaluation Metrics:
Precision, Recall, F-measure
Precision: a measure of exactness
– what % of tuples that the classifier labeled as positive are actually positive
Precision = TP / (TP+FP)
Recall: a measure of completeness, Sensivity
– what % of positive tuples did the classifier label as positive?
Recall = TP / (TP+FN)
Actual\
C ¬C
• Inverse relationship between precision & recall Predicted
C TP FN
F-measure, F1: harmonic mean of precision and recall,
F-measure = (2*Precision*Recall) / (Precision+Recall) ¬C FP TN

F𝜷: weighted measure of precision and recall


– assigns 𝛽 times as much weight to recall as to precision
F𝜷 = (1+ 𝛃2)*Precision*Recall) / (𝛃2 *Precision + Recall)

Data Mining 8
Classifier Evaluation Metrics: Example
Actual Class\Predicted Class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

Accuracy =
Error rate =
Sensivity =
Specificity =

Precision =
Recall

Data Mining 9
Classifier Evaluation Metrics: Example
Actual Class\Predicted Class cancer = yes cancer = no Total
cancer = yes 90 210 300
cancer = no 140 9560 9700
Total 230 9770 10000

Accuracy = (90+9560)/10000 = 96.50%


Error rate = 1 – Accuracy = 3.50% = (140+210)/10000
Sensivity = 90 /300 = 30.00%
Specificity = 9560/9700 = 98.56%

Precision = 90/230 = 39.13%


Recall = 90/300 = 30.00%

Data Mining 10
Classifier Evaluation Metrics: Confusion Matrix
with more than two classes
Confusion Matrix:
Actual class\
C1 C2 … Cm Total
Predicted class
C1 CM1,1 CM1,2 … CM1,m AC1
C2 CM2,1 CM2,2 … CM2,m AC2

Cm CMm,1 CMm,2 … CMm,m ACm
Total PC1 PC2 PCm AC1+…+ACm

• Given m classes, an entry CMi,j in a confusion matrix indicates # of tuples in class i


that were labeled by the classifier as class j.
• For a classifier to have good accuracy, ideally most of the tuples would be represented
along the diagonal of the confusion matrix, from entry CM1,1 to entry CMm,m, with
the rest of the entries being zero or close to zero.

Data Mining 11
Classifier Evaluation Metrics: with more than two classes
Accuracy, Error Rate
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m

Cm CMm,1 CMm,2 … CMm,m

Accuracy: Fraction of documents classified correctly


σ𝐢 𝐂𝐌𝐢,𝐢
Accuracy =
σ𝐣 σ𝐢 𝐂𝐌𝐢,𝐣

Error Rate:
Error Rate = 1 – Accuracy

Data Mining 12
Classifier Evaluation Metrics: with more than two classes
Precision, Recall
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m

Cm CMm,1 CMm,2 … CMm,m
Precision and Recall values for each class:

Precision: Fraction of tuples assigned class i that are actually about class i:
𝐂𝐌𝐢,𝐢
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐂𝐢 =
σ𝐣 𝐂𝐌𝐣,𝐢

Recall: Fraction of tuples in class i classified correctly:


𝐂𝐌𝐢,𝐢
𝐑𝐞𝐜𝐚𝐥𝐥𝐂𝐢 =
σ𝐣 𝐂𝐌𝐢,𝐣

Data Mining 13
Classifier Evaluation Metrics: with more than two classes
Microaveraging and Macroaveraging
• In order to derive a single metric that tells us how well the system is doing, we can
combine precision and recall values in two ways.

• In macroaveraging, compute performance for each class, and then average over
classes.
• In microaveraging, collect decisions for all classes into a single confusion matrix,
and then compute precision from that matrix.

Data Mining 14
Classifier Evaluation Metrics: with more than two classes
Macroaveraging
Macroaverage: compute performance for each class, and then average over classes.

Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m

Cm CMm,1 CMm,2 … CMm,m

Macroaveraging:

σ𝒊 𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐂𝐢 σ𝒊 𝐑𝐞𝐜𝐚𝐥𝐥𝐂𝐢
𝐏𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 = 𝐑𝐞𝐜𝐚𝐥𝐥 =
𝐦 𝐦

Data Mining 15
Classifier Evaluation Metrics: with more than two classes
Microaveraging
Microaverage: collect decisions for all classes into a single confusion matrix, and then
compute precision from that matrix.
Actual \
C1 C2 … Cm
Predicted
C1 CM1,1 CM1,2 … CM1,m
C2 CM2,1 CM2,2 … CM2,m

Cm CMm,1 CMm,2 … CMm,m

Confusion Matrix for Ci:


Actual \ Confusion Matrix for all classes:
yes no
Predicted Actual \
yes no
FNi= Predicted
yes 𝐓𝐏𝐢 = 𝐂𝐌𝐢,𝐢
σ𝐣 𝐂𝐌𝐢,𝐣 − 𝐂𝐌𝐢,𝐢
yes 𝐓𝐏 = σ𝐢 𝐓𝐏𝐢 𝐅𝐍 = σ𝐢 𝐅𝐍𝐢
FPi= TNi = σ𝐣 σ𝐢 𝐂𝐌𝐢,𝐣 −
no σ𝐣 𝐂𝐌𝐣,𝐢 − 𝐂𝐌𝐢,𝐢 no 𝐅𝐏 = σ𝐢 𝐅𝐏𝐢 𝐓𝐍 = σ𝐢 𝐓𝐍𝐢
𝐅𝐏𝐢 − 𝐅𝐍𝐢 + 𝐂𝐌𝐢,𝐢

Data Mining 16
Classifier Evaluation Metrics:
with more than two classes - Example
Confusion matrix for a three-class Actual \
urgent normal spam
classification for e-mail classification task Predicted
urgent 30 7 3
Accuracy = 140/200 = 70% normal 10 40 20
spam 5 15 70

Confusion Matrix Confusion Matrix Confusion Matrix Confusion Matrix


for urgent: for normal: for spam: for all classes:
Actual \ Actual \ Actual \ Actual \
yes no yes no yes no yes no
Predicted Predicted Predicted Predicted
yes 𝟑𝟎 10 yes 𝟒𝟎 30 yes 𝟕𝟎 20 yes 𝟏𝟒𝟎 60
no 15 145 no 22 108 no 23 87 no 60 340

Data Mining 17
Classifier Evaluation Metrics:
with more than two classes - Example
Confusion Matrix Confusion Matrix Confusion Matrix
for urgent: for normal: for spam:
Actual \ Actual \ Actual \
yes no yes no yes no
Predicted Predicted Predicted
yes 𝟑𝟎 10 yes 𝟒𝟎 30 yes 𝟕𝟎 20
no 15 145 no 22 108 no 23 87

Precisionurgent=30/45=.67 Precisionnormal=40/62=.65 Precisionspam=70/93=.75

Macroaverage Precision = (.67+.65+.75) / 3 = .69

Confusion Matrix
for all classes:
Actual \
yes no
Predicted
Microaverage Precision = 140 / 200 = .70
yes 𝟏𝟒𝟎 60
no 60 340
Data Mining 18
Predictor Error Measures
• If a predictor returns a continuous value rather than a categorical label, it is difficult
to say exactly whether the predicted value is correct or not.
– Instead of focusing on whether the predicted value is an “exact” match with the correct
value, we look at how far off the predicted value is from the actual value.
• Error Functions: yi is the actualvalue, yi’ is the predicted value

• The mean squared error exaggerates


the presence of outliers, while the
mean absolute error does not.

• If we were to take the square root of


the mean squared error, the resulting
error measure is called the root
mean squared error.

Data Mining 19
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison

Machine Learning 20
Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?
• Performance of a model may depend on other factors besides the learning algorithm:
– Class distribution, Cost of misclassification, Size of training and test sets.

shows how learning curve accuracy


changes with varying sample size

Effect of small sample size:


- Bias in the estimate
- Variance of estimate

Data Mining 21
Evaluating Classifier Performance:
Holdout
• The purpose of Evaluating Classifier Performance is to estimate the performance of a
classifier on previously unseen data (test set)

• Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation

• Random sampling: a variation of holdout


– Repeat holdout k times,
– accuracy = avg. of the accuracies obtained

Data Mining 22
Evaluating Classifier Performance:
Cross-Validation
• Cross-validation (k-fold, where k = 10 is most popular)
– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and other k-1 subsets as training set
– The accuracy estimate is the overall number of correct classifications from the
k iterations, divided by the total number of tuples in the initial data.

• Example: 3-fold cross-validation

Data Mining 23
Evaluating Classifier Performance:
Cross-Validation
Variations of Cross-Validation:

• Leave-one-out:
– k folds where k = # of tuples, for small sized data
– i.e. If there are n tuples in data set, one tuple is used as test data and the rest n-1
tuples are used as training data in each iteration.

• Stratified cross-validation:
– Folds are stratified so that class distribution in each fold is approximately the
same as that in the initial data
– Example:
• Initial data contains 3000 tuples, and 600 tuples are positive (2400 tuples are negative).
• In 3-fold cross validation, each subset will randomly get 200 positive tuples and 800 negative
tuples.

Data Mining 24
Evaluating Classifier Performance:
Bootstrap
• Bootstrap
– Works well with small data sets
– Tuples are randomly selected from initial data to create training data.
• Each time a tuple is selected, it is equally likely to be selected again and re-added to training set

• There are several bootstrap methods, and a common one is .632 Boostrap
– A data set with d tuples is sampled d times, with replacement.
– The data tuples that did not make it into the training set end up forming the test set.
– About 63.2% of the original data end up in the bootstrap (training data), and the remaining
36.8% form the test set.
• Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is
1-1/d. Since (1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

Data Mining 25
Model Evaluation and Selection
• Metrics for Performance Evaluation
• Methods for Performance Evaluation
• Methods for Model Comparison

Machine Learning 26
Model Selection
using Statistical Tests of Significance
• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain mean error rates of models M1 and M2 :

• These mean error rates are just estimates of error on the true population of future data
cases
– Although the mean error rates obtained for M1 and M2 may appear different, that difference
may NOT be statistically significant.

• What if the difference between the 2 error rates is just attributed to chance?
– Use a test of statistical significance
– Obtain confidence limits for our error estimates

Data Mining 27
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation.
– Each partitioning is independently drawn.
– Average 10 error rates obtained for M1 and M2, are their mean error rates.
• Assume samples follow a t-distribution with k–1 degrees of freedom (where k=10)
– Individual error rates calculated in cross-validations may be considered as independent
samples from a probability distribution.
– In general, they follow a t-distribution with k-1 degrees of freedom where k=10.
• Use significance test t-test (or Student’s t-test) to see the difference between two
models is statistically significant or not.
• Null Hypothesis: M1 & M2 are the same
• If we can Reject null hypothesis, then
– Conclude that the difference between M1 & M2 is statistically significant
– Chose model with lower error rate

Data Mining 28
Estimating Confidence Intervals: t-test
• If only 1 test set available: pairwise comparison
– For ith round of 10-fold cross-validation, the same cross partitioning is used to
obtain err(M1)i and err(M2)i
– Average over 10 rounds to get 𝐞𝐫𝐫(𝐌𝟏 ) and 𝐞𝐫𝐫(𝐌𝟐 )
– t-test computes t-statistic with k-1 degrees of freedom: (k is 10 here)

𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤

𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = ෍[ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏

Data Mining 29
Estimating Confidence Intervals:
Statistical Significance
• Are M1 & M2 significantly different?
– Compute t.
– Select significance level (e.g. significance = 5%)
– Consult table for t-distribution:
• Find t value corresponding to k-1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically upper % points of distribution shown
 look up the value for confidence limit z with significance /2 (here, 0.025)
in table for t-distribution
– If (t > z or t < -z), then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are same
– Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance

Data Mining 30
Estimating Confidence Intervals:
Table for t-distribution
• Significance level, e.g., significance = 0.05 or 5% means M1 & M2 are significantly
different for 95% of population
• Confidence limit z = significance / 2

Data Mining 31
Estimating Confidence Intervals:
Statistical Significance - Example
• Results of 10-fold cross validations
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = ෍[ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤

• significance=5%, M1 & M2 are significantly different


for 95% of population

• value for confidence limit z with significance/2


from table for t-distribution
z = 2.262

t > z (13 > 2.262)


 statistically significant difference between M1 & M2

Data Mining 32
Estimating Confidence Intervals:
Statistical Significance - Example
• Results of 10-fold cross validations
𝐤
𝟏
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 = ෍[ 𝐞𝐫𝐫(𝐌𝟏 )𝐢 −𝐞𝐫𝐫(𝐌𝟐 )𝐢 − 𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 ) ]𝟐
𝐤
𝐢=𝟏
𝐞𝐫𝐫(𝐌𝟏 ) − 𝐞𝐫𝐫(𝐌𝟐 )
𝐭 =
𝐯𝐚𝐫 𝐌𝟏 − 𝐌𝟐 /𝐤

• significance=5%, M1 & M2 are significantly different


for 95% of population

• value for confidence limit z with significance/2


from table for t-distribution
z = 2.262

t ≯ z (0.154 ≯ 2.262)
 no difference between M1 & M2

Data Mining 33
Confidence Limits
• Confidence limits for the normal distribution with 0 mean and a variance of 1:

• Thus, a confidence limit for 10% significance (i.e. significance/2 = 5%)


P(-1.65  X  1.65) = 90%
– This means that 90% of the population in this range.

Data Mining 34
Confidence Interval for Accuracy
standard normal distribution of accuracy
• For large data sets (N>100), the accuracy has a normal distribution with mean p
and variance p(1-p)/N.

𝐚𝐜𝐜
• Transformed value for the accuracy acc:
– in order to have the normal distribution 𝐩(𝟏 − 𝐩)/𝐍
with 0 mean and a variance of 1:

𝐚𝐜𝐜
• Confidence limits for significance : 𝐏(−𝐙𝛂/𝟐 < < 𝐙𝛂/𝟐 ) = 1- 
𝐩(𝟏−𝐩)/𝐍

• Solving the equation above for p yields Confidence Interval for p:

𝟐∗𝐍∗𝐚𝐜𝐜 + 𝐙 𝟐 ± 𝐙 𝟐 + 𝟒∗𝐍∗𝐚𝐜𝐜 −𝟒∗𝐍∗𝐚𝐜𝐜 𝟐


𝐩= where Z is 𝐙𝛂/𝟐
𝟐∗(𝐍+𝐙𝟐 )

Data Mining 35
Confidence Interval for Accuracy - Example
• Consider a model that produces an accuracy of 80% when evaluated on 100 test
instances:
– N=100, accuracy acc=0.8
– Let 1-=0.95 (95% confidence)
– From probability table 𝐙𝛂/𝟐 =1.96 (where significance  = 5%)

N 100 500 1000 5000


Mean p lower 0.747 0.779 0.785 0.794
Mean p upper 0.830 0.817 0.812 0.806

Data Mining 36
Comparing Performance of 2 Models
• Given two models, say, M1 and M2, which one is better?
– M1 is tested on data set D1 (size=n1), found error rate = e1 .
– M2 is tested on data set D2 (size=n2), found error rate = e2 .
– Assume D1 and D2 are independent.
– If n1 and n2 are sufficiently large, then

𝐞𝟏 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝛍𝟏 , 𝛔𝟏 )
𝐞𝟐 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝛍𝟐 , 𝛔𝟐 )

𝐞𝐢 (𝟏 − 𝐞𝐢 )
– Approximate variance: ෝ𝟐𝐢
𝛔 =
𝐧𝐢

Data Mining 37
Comparing Performance of 2 Models
• To test if performance difference is significant: d = e1 - e2
– 𝐝 ∼ 𝐍𝐨𝐫𝐦𝐚𝐥𝐃𝐢𝐬𝐭(𝐝𝐭 , 𝛔𝐭 ) where dt is the true difference
– Since D1 and D2 are independent, their variance adds up:

𝛔𝟐𝐭 = 𝛔𝟐𝟏 + 𝛔𝟐𝟐 ≅ 𝛔


ෝ𝟐𝟏 + 𝛔
ෝ𝟐𝟐
𝐞𝟏 (𝟏−𝐞𝟏 ) 𝐞𝟐 (𝟏−𝐞𝟐 )
ෝ𝟐𝐭
𝛔 = +
𝐧𝟏 𝐧𝟐

– At (1-) confidence level, true difference range: 𝐝𝐭 = 𝐝 ± 𝐙𝛂/𝟐 ෝ


𝛔𝐭

Data Mining 38
Comparing Performance of 2 Models - Example
• Given: M1: n1=30 e1=0.15
M2: n2=5000 e2=0.25

• d = |e2-e1| = 0.1

𝟎.𝟏𝟓(𝟏−𝟎.𝟏𝟓) 𝟎.𝟐𝟓(𝟏−𝟎.𝟐𝟓)
ෝ𝟐𝐭 =
𝛔 + = 𝟎. 𝟎𝟎𝟒𝟑
𝟑𝟎 𝟓𝟎𝟎𝟎

• At 95% confidence level: 𝐙𝛂/𝟐 =1.96

𝐝𝐭 = 𝟎. 𝟏 ± 𝟏. 𝟗𝟔 ∗ 𝟎. 𝟎𝟎𝟒𝟑 = 𝟎. 𝟏 ± 𝟎. 𝟏𝟐𝟖

 Since interval contains 0, difference may NOT be statistically significant.

Data Mining 39
Model Selection: ROC Curves
ROC (Receiver Operating Characteristics)
• ROC curves: for visual comparison of classification models
• Shows the trade-off between the true positive rate and the false positive rate
• The area under the ROC curve is a measure of the accuracy of the model
• ROC curve characterizes the trade-off between positive hits and false alarms
• ROC curve plots TP rate (on the y-axis) against FP rate (on the x-axis)
• Performance of each classifier represented as a point on the ROC curve

(TP,FP):
• (0,0): declare everything to be negative class
• (1,1): declare everything to be positive class
• (1,0): ideal

• Diagonal line: Random guessing

Data Mining 40
Using ROC for Model Comparison

 No model consistently
outperform the other
 M1 is better for small FPR
 M2 is better for large FPR

 Area Under the ROC curve


 Ideal: Area = 1
 Random guess: Area = 0.5

Data Mining 41
How to Construct an ROC curve

• Use classifier that produces


posterior probability for each
test instance P(+|A)
• Sort the instances according to
P(+|A) in decreasing order
• Apply threshold at each unique
value of P(+|A)
• Count the number of TP, FP,
TN, FN at each threshold
• TP rate, TPR = TP/(TP+FN)
• FP rate, FPR = FP/(FP+TN)

Data Mining 42
How to Construct an ROC curve

TPR = TP/(TP+FN)
FPR = FP/(FP+TN)
TPR

FPR Data Mining 43


Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or compactness of
classification rules

Data Mining 44
Ensemble Methods: Increasing the Accuracy

Machine Learning 45
Ensemble Methods: Increasing the Accuracy
• Ensemble methods
– Use a combination of models to increase accuracy
– Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an
improved model M*
• Popular ensemble methods
– Bagging: averaging the prediction over a collection of classifiers
– Boosting: weighted vote with a collection of classifiers
– Ensemble: combining a set of heterogeneous classifiers

Data Mining 46
Ensemble Methods
• Construct a set of classifiers from the training data
• Predict class label of previously unseen records by aggregating predictions made by
multiple classifiers

Data Mining 47
Why does Ensemble Classifier work?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes a wrong prediction:

25
 25 i
  
 i 
i 13 
 (1   ) 25i
 0.06

Data Mining 48
Methods for Constructing an Ensemble Classifier
• The ensemble of classifiers can be constructed in many ways:

• By manipulating the training set.


– Multiple training sets are created by resampling the original data.
– A classifier is then built from each training set using a particular learning
algorithm.
– Bagging and boosting are two examples of ensemble methods that manipulate
their training sets

• By manipulating the input features.


– A subset of input features is chosen to form each training set.
– This approach works well with data sets that contain highly redundant features.
– Random forest is an ensemble method that manipulates its input features and
uses decision trees as its base classifiers.

Data Mining 49
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
– Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
– A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
– Each classifier Mi returns its class prediction
– The bagged classifier M* counts the votes and assigns the class with the most
votes to X
• Prediction: can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
• Accuracy
– Often significantly better than a single classifier derived from D
– For noise data: not considerably worse, more robust
– Proved improved accuracy in prediction
Data Mining 50
Bagging
• Bagging is a technique that repeatedly samples (with replacement) from a data set
according to a uniform probability distribution.
• Each bootstrap sample has the same size as the original data.
– Because each sample has a probability 1-(1-1/N/)N of being selected in each Di, a
sample Di contains approximately 63% of the original training data.
– After training the k classifiers, a test instance is assigned to the class that receives
the highest number of votes.
• Bagging improves generalization error by reducing the variance of the base
classifiers.
– The performance of bagging depends on the stability of the base classifier.
– If a base classifier is unstable, bagging helps to reduce the errors associated with
random fluctuations in the training data.
– If a base classifier is stable, bagging may not be able to improve the performance
of the base classifiers.

Data Mining 51
Bagging - Example
• We create a single inner node decision tree classifiers for the following training data.
That tree is also known as a decision stump.

• Any decision tree with a single inner node can have maximum %70 accuracy for this
training set.
– Attribute >35 produces %70 accuracy
– Attribute >75 produces %70 accuracy

Data Mining 52
Bagging - Example

• 10 decision trees with single inner


nodes are generated.

• Each bootstrap sample has the same


size as the original data and randomly
generated from the original data.

Data Mining 53
Bagging - Example

The performance is increased by bagging approach

Data Mining 54
Boosting
• Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
• How boosting works?
– Weights are assigned to each training tuple
– A series of k classifiers is iteratively learned
– After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
– The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data

Data Mining 55
Adaboost
• In the AdaBoost algorithm, the importance of a base classifier Ci depends on its error
rate, which is defined as

where I(p) = 1 if the predicate p is true, and 0 otherwise.

• The importance of a classifier Ci is given by the following parameter,

Data Mining 56
Adaboost
• Given a set of N class-labeled tuples, (X1, y1), …, (XN, yN)
• Initially, all the weights of tuples are set the same (1/N)
• Generate k classifiers in k rounds. At round i,
– Tuples from D are sampled (with replacement) to form a training set Di of the same size
– Each tuple’s chance of being selected is based on its weight
– A classification model Mi is derived from Di
– Its error rate is calculated using Di as a test set
– If a tuple is misclassified, its weight is increased, o.w. it is decreased

• Classifier Mi error rate is the sum of


the weights of the misclassified tuples:

• The importance of classifier Mi’s vote:

• Weight update mechanism:

– where Zj is the normalization factor used to ensure that


Data Mining 57
Adaboost -Example
Training records

Training records chosen


during boosting

Weights of
training records

Data Mining 58
Random Forest
• Random Forest:
– Each classifier in the ensemble is a decision tree classifier and is generated using
a random selection of attributes at each node to determine the split
– During classification, each tree votes and the most popular class is returned
• Two Methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select, at each node, F attributes
as candidates for the split at the node. The CART methodology is used to grow
the trees to maximum size
– Forest-RC (random linear combinations): Creates new attributes (or features)
that are a linear combination of the existing attributes (reduces the correlation
between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to errors and outliers
• Insensitive to the number of attributes selected for consideration at each split, and
faster than bagging or boosting

Data Mining 59
Empirical Comparison among Ensemble Methods

• Empirical results obtained when


comparing the performance of a
decision tree classifier against
bagging, boosting, and random
forest.
• The base classifiers used in each
ensemble method consist of
fifty decision trees.
• The classification accuracies
reported in this table are
obtained from ten-fold cross-
validation.
• Notice that the ensemble
classifiers generally outperform
a single decision tree classifier
on many of the data sets.

Data Mining 60
Summary
• Classification is a form of data analysis that extracts models describing important
data classes.
• Effective and scalable methods have been developed for decision tree induction,
Naive Bayesian classification, rule-based classification, and many other
classification methods.
• Evaluation metrics include: accuracy, sensitivity, specificity, precision, recall, F
measure, and Fß measure.
• Stratified k-fold cross-validation is recommended for accuracy estimation.

Data Mining 61
Summary
• Bagging and boosting can be used to increase overall accuracy by learning and
combining a series of individual models.
• Significance tests are useful for model selection.
• There have been numerous comparisons of the different classification methods;
– The matter remains a research topic
– No single method has been found to be superior over all others for all data sets
• Issues such as accuracy, training time, robustness, scalability, and interpretability
must be considered and can involve trade-offs, further complicating the quest for an
overall superior method

Data Mining 62

You might also like