0% found this document useful (0 votes)

50 views

ML Model Evaluation

This document discusses various methods for evaluating and comparing machine learning models, including accuracy metrics like confusion matrices, precision, recall, and accuracy. It describes holdout validation, cross-validation, and bootstrap methods for estimating a model's accuracy. Statistical tests like t-tests are presented for comparing models and determining whether differences in accuracy are statistically significant. Factors like speed, resources required, and costs/benefits must also be considered when selecting the best model.

Uploaded by

STYX

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

ML Model Evaluation

Uploaded by

STYX

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 17

Model Evaluation and Selection

 Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
 Use validation test set of class-labeled tuples instead of training
set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random subsampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
2
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
A confusion matrix is a technique for summarizing the performance of a
classification algorithm.

Actual class\Predicted class C1 ¬ C1

C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

There are two possible predicted classes: "yes" and "no". If we were predicting
the presence of a disease

•True positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
•True negatives (TN): We predicted no, and they don't have the disease.
•False positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")
•False negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
3
Example of Confusion Matrix:

Actual class\Predicted class buy_computer = buy_computer = Total

yes no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates

# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of

recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All  Sensitivity = TP/P

 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

 Specificity = TN/N

5
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier labeled
as positive are actually positive

 Recall: completeness – what % of positive tuples did the

classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and recall,

 Fß: weighted measure of precision and recall

 assigns ß times as much weight to recall as to precision

6
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

7
Suppose I have 10,000 emails in my mailbox out of which 300 are spams.
The spam detection system detects 150 mails as spams, out of which 50
are actually spams. What is the precision and recall of my spam detection
system ?

A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies

obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,

each approximately equal size


At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized data

 Stratified cross-validation: folds are stratified so that class

dist. in each fold is approx. the same as that in the initial data
9
Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:

10
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
 Suppose we have 2 classifiers, M1 and M2, which one is better?

 Use 10-fold cross-validation to obtain and

 These mean error rates are just estimates of error on the true
population of future data cases
 What if the difference between the 2 error rates is just
attributed to chance?
 Use a test of statistical significance
 Obtain confidence limits for our error estimates

11
Estimating Confidence Intervals:
Null Hypothesis
 Perform 10-fold cross-validation
 Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
 Use t-test (or Student’s t-test)
 Null Hypothesis: M1 & M2 are the same
 If we can reject null hypothesis, then
 we conclude that the difference between M1 & M2 is
statistically significant
 Chose model with lower error rate

12
Estimating Confidence Intervals: t-test

 If only 1 test set available: pairwise comparison

 For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
 Average over 10 rounds to get and
 t-test computes t-statistic with k-1 degrees of
freedom: where

 If two test sets available: use non-paired t-test

where

where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
13
Estimating Confidence Intervals:
Table for t-distribution

 Symmetric
 Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
 Confidence limit, z
= sig/2

14
Estimating Confidence Intervals:
Statistical Significance
 Are M1 & M2 significantly different?
 Compute t. Select significance level (e.g. sig = 5%)
 Consult table for t-distribution: Find t value corresponding to k-1
degrees of freedom (here, 9)
 t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit z=sig/2
(here, 0.025)
 If t > z or t < -z, then t value lies in rejection region:
 Reject null hypothesis that mean error rates of M & M are
1 2
same
 Conclude: statistically significant difference between M & M
1 2
 Otherwise, conclude that any difference is chance

15
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
16
Predictor Error Measures

 Measure predictor accuracy: measure how far off the predicted value is from the
actual known value
 Loss function: measures the error betw. yi and the predicted value yi’
 Absolute error: | yi – yi’|
 Squared error: (yi – yi’)2
 Test error (generalization error):
d the average loss over the test set d

 Mean absolute error: | y

i 1
i  yi ' |
Mean squared error: (y
i 1
i  yi ' ) 2

d d
d
d

 | yRelative
 y '|  ( yi  yi ' ) 2
 Relative absolute error: i 1
d
i
squared error:
i
i 1
d
| y
i 1
i y|
(y
i 1
i  y)2

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative squared
error
17

Lesson Plan For Spanish
100% (14)
Lesson Plan For Spanish
86 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
CST 42315 Dam - L9 1
No ratings yet
CST 42315 Dam - L9 1
15 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
41 pages
Classification - Performance Evlaution
No ratings yet
Classification - Performance Evlaution
13 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
Lecture 5 Evaluation_Classifer
No ratings yet
Lecture 5 Evaluation_Classifer
61 pages
CH-5_ML
No ratings yet
CH-5_ML
36 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
DL_IT324a_4
No ratings yet
DL_IT324a_4
52 pages
DM 09 Classification and Prediction 19112024 102854am
No ratings yet
DM 09 Classification and Prediction 19112024 102854am
21 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Lecture 11
No ratings yet
Lecture 11
61 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
18 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
48 pages
6 Evaluarea performantei
No ratings yet
6 Evaluarea performantei
43 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
No ratings yet
IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting
18 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
No ratings yet
Introduction To Artificial Intelligence: Amna Iftikhar Fall ' 2019 1
33 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
5.2
No ratings yet
5.2
62 pages
TE_DWM Module no 3
No ratings yet
TE_DWM Module no 3
48 pages
EvaluationMatrix
No ratings yet
EvaluationMatrix
29 pages
Lesson 3.2 - Supervised Learning Evaluation PDF
No ratings yet
Lesson 3.2 - Supervised Learning Evaluation PDF
38 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Evaluation Method Holdout
No ratings yet
Evaluation Method Holdout
14 pages
Module 6
No ratings yet
Module 6
24 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Lec07 Classification ModelEvaluation Ensemble
No ratings yet
Lec07 Classification ModelEvaluation Ensemble
62 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
ML Unit-3 - RTU
No ratings yet
ML Unit-3 - RTU
20 pages
AI351 Lecture 2 - Common Evaluation Metrics
No ratings yet
AI351 Lecture 2 - Common Evaluation Metrics
50 pages
Chapter 10
No ratings yet
Chapter 10
31 pages
Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)
No ratings yet
Classification: Evaluation: Data Mining and Text Mining (UIC 583 at Politecnico Di Milano)
53 pages
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
No ratings yet
Session 2 Evaluation Boosting Bagging Contemporary Business Anaytics
17 pages
08 Classifier Evaluation
No ratings yet
08 Classifier Evaluation
39 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
Model Selection On ML
No ratings yet
Model Selection On ML
49 pages
Evaluation Metrics:: Confusion Matrix
No ratings yet
Evaluation Metrics:: Confusion Matrix
7 pages
AI UNIT 5
No ratings yet
AI UNIT 5
13 pages
Lect_02_Evaluation_Part_1
No ratings yet
Lect_02_Evaluation_Part_1
33 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
chapter 5 Model Evaluation
No ratings yet
chapter 5 Model Evaluation
21 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Fast-Track Semester 2022: Digital Assignment-5
No ratings yet
Fast-Track Semester 2022: Digital Assignment-5
3 pages
Fast-Track Semester 2022: Technical Answers To Real-World Problems
No ratings yet
Fast-Track Semester 2022: Technical Answers To Real-World Problems
2 pages
Write 8086 Assembly Language Programs To Sort The Given 10 Numbers in Ascending Order and in Descending Order
No ratings yet
Write 8086 Assembly Language Programs To Sort The Given 10 Numbers in Ascending Order and in Descending Order
5 pages
Quiz 5 Detailed Solution International Marketing
No ratings yet
Quiz 5 Detailed Solution International Marketing
1 page
Img 0001
No ratings yet
Img 0001
1 page
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
MGT1036 Principles-Of-Marketing Eth 1.0 40 MGT1036
No ratings yet
MGT1036 Principles-Of-Marketing Eth 1.0 40 MGT1036
2 pages
CSE4001 - Parallel and Distributed Computing, Fall 2019 Vellore Institute of Technology Instructor: Prof Deebak B D - SCOPE
No ratings yet
CSE4001 - Parallel and Distributed Computing, Fall 2019 Vellore Institute of Technology Instructor: Prof Deebak B D - SCOPE
3 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
WINSEM2021-22 CSE4020 ETH VL2021220501968 Reference Material I 22-01-2022 PAC Learning
No ratings yet
WINSEM2021-22 CSE4020 ETH VL2021220501968 Reference Material I 22-01-2022 PAC Learning
34 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
CAT2
No ratings yet
CAT2
146 pages
Research Paper 1
No ratings yet
Research Paper 1
13 pages
Pilot Testing, Monitoring and Evaluating The Implementation of The Curriculum
100% (3)
Pilot Testing, Monitoring and Evaluating The Implementation of The Curriculum
34 pages
Individual Development Plan Ti-Iii Sample
No ratings yet
Individual Development Plan Ti-Iii Sample
2 pages
Level 3 Certificate in Supporting Individuals With Learning Disabilities Units
No ratings yet
Level 3 Certificate in Supporting Individuals With Learning Disabilities Units
5 pages
Weekly Home Learning Plan in Health First Quarter
No ratings yet
Weekly Home Learning Plan in Health First Quarter
2 pages
Ethics Pascua
No ratings yet
Ethics Pascua
5 pages
PDF
100% (1)
PDF
271 pages
5 Creative Activities For Teaching Adjectives
No ratings yet
5 Creative Activities For Teaching Adjectives
3 pages
Empathy How To Develop
No ratings yet
Empathy How To Develop
21 pages
Art App Module 1 Defining Arts Assumptions Scope and Limitation
No ratings yet
Art App Module 1 Defining Arts Assumptions Scope and Limitation
3 pages
Chapter 2 - Practicing Entrepreneurship
No ratings yet
Chapter 2 - Practicing Entrepreneurship
18 pages
Academic Writing
No ratings yet
Academic Writing
2 pages
First: Changes in Goals of English Teaching and Learning.: Amel Lusta / 2017
No ratings yet
First: Changes in Goals of English Teaching and Learning.: Amel Lusta / 2017
6 pages
MRP Front Page
No ratings yet
MRP Front Page
6 pages
ANNEX 5 Homeroom Guidance Learners Development Assessment GRADE 3
No ratings yet
ANNEX 5 Homeroom Guidance Learners Development Assessment GRADE 3
10 pages
q3 - week-2-Teaching-Guide-Catchup-Friday
No ratings yet
q3 - week-2-Teaching-Guide-Catchup-Friday
5 pages
Investigators
No ratings yet
Investigators
22 pages
GOALS. Brian Tracy
No ratings yet
GOALS. Brian Tracy
6 pages
Motivational Letter
50% (2)
Motivational Letter
2 pages
Language Skills Related Tasks
No ratings yet
Language Skills Related Tasks
3 pages
American - English - File (1) - 16-35
No ratings yet
American - English - File (1) - 16-35
2 pages
Introduction To Accounting Theory
No ratings yet
Introduction To Accounting Theory
12 pages
English 7O - Unit 4
No ratings yet
English 7O - Unit 4
6 pages
CONFERENCE POSTER
No ratings yet
CONFERENCE POSTER
1 page
Intelligence: Clarken, Moral Intelligence, P. 1
No ratings yet
Intelligence: Clarken, Moral Intelligence, P. 1
9 pages
lo assignment
No ratings yet
lo assignment
12 pages
Resume Ella Marlena
No ratings yet
Resume Ella Marlena
2 pages
DLL Arts q3 Week 1 4
No ratings yet
DLL Arts q3 Week 1 4
8 pages
Estillore, Marnie-Action Research Final
No ratings yet
Estillore, Marnie-Action Research Final
19 pages
22-Organizational Justice in Performance Appraisal System Impact On Employees Satisfaction and Work Performance
No ratings yet
22-Organizational Justice in Performance Appraisal System Impact On Employees Satisfaction and Work Performance
10 pages

ML Model Evaluation

Uploaded by

ML Model Evaluation

Uploaded by

Model Evaluation and Selection

Model Evaluation and Selection

Actual class\Predicted class C1 ¬ C1

Actual class\Predicted class buy_computer = buy_computer = Total

 Given m classes, an entry, CMi,j in a confusion matrix indicates

 Classifier Accuracy, or negative class and minority of

 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

 Recall: completeness – what % of positive tuples did the

 Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

 Training set (e.g., 2/3) for model construction

 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout

 Repeat holdout k times, accuracy = avg. of the accuracies

each approximately equal size

 *Stratified cross-validation*: folds are stratified so that class

 Use 10-fold cross-validation to obtain and

 If only 1 test set available: pairwise comparison

 If two test sets available: use non-paired t-test

 Mean absolute error: | y

The mean squared-error exaggerates the presence of outliers

You might also like

 Stratified cross-validation: folds are stratified so that class