0% found this document useful (0 votes)

14 views

AI351 Lecture 2 - Common Evaluation Metrics

The document discusses various methods for evaluating machine learning models, including test sets, validation sets, cross validation, and metrics like accuracy, precision, recall, and confusion matrices. It explains how to use these methods and metrics to obtain unbiased estimates of a model's performance, understand types of mistakes, and ensure the model selection process is not biased. The goal is to properly evaluate models during both development and deployment.

Uploaded by

Ramiz Saud

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

AI351 Lecture 2 - Common Evaluation Metrics

Uploaded by

Ramiz Saud

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Evaluating Machine Learning Methods

1
Goals for the lecture
you should understand the following concepts
• test sets
• learning curves
• validation (tuning) sets
• stratified sampling
• cross validation
• internal cross validation
• confusion matrices
• TP, FP, TN, FN
• ROC curves
• confidence intervals for error
• pairwise t-tests for comparing learning systems
• scatter plots for comparing learning systems
• lesion studies
2
Goals for the lecture (continued)

• recall/sensitivity/true positive rate (TPR)

• precision/positive predictive value (PPV)
• specificity and false positive rate (FPR or 1-specificity)
• precision-recall (PR) curves

3
Test sets revisited
How can we get an unbiased estimate of the accuracy of a learned model?

labeled data set

training set test set

learned model
learning
method

4
accuracy estimate
Test sets revisited
How can we get an unbiased estimate of the accuracy of a
learned model?

• when learning a model, you should pretend that you don’t

have the test data yet (it is “in the mail”)*

• if the test-set labels influence the learned model in any way,

accuracy estimates will be biased

* In some applications it is reasonable to assume that you have access

to the feature vector (i.e. x) but not the y part of each test instance.

5
Learning curves
How does the accuracy of a learning method change as a function of
the training-set size?

this can be assessed by plotting learning curves

6
Figure from Perlich et al. Journal of Machine Learning Research, 2003
Learning curves
given training/test set partition
• for each sample size s on learning curve
• (optionally) repeat n times
• randomly select s instances from training set
• learn model
• evaluate model on test set to determine accuracy a
• plot (s, a) or (s, avg. accuracy and error bars)

7
Validation (tuning) sets revisited
Suppose we want unbiased estimates of accuracy during the learning
process (e.g. to choose the best level of decision-tree pruning)?

training set test set

learning process

training set validation set learned model

learn
select model
models

Partition training data into separate training/validation sets 8

Limitations of using a single
training/test partition
• we may not have enough data to make sufficiently large
training and test sets
• a larger test set gives us more reliable estimate of
accuracy (i.e. a lower variance estimate)
• but… a larger training set will be more representative of
how much data we actually have for learning process

• a single training set doesn’t tell us how sensitive accuracy

is to a particular training sample

9
Random resampling
We can address the second issue by repeatedly randomly
partitioning the available data into training and set sets.

labeled data set

random
partitions
traini ng set test set

training set test set

10
Stratified sampling
When randomly selecting training or validation sets, we may want to
ensure that class proportions are maintained in each selected set

labeled data set

++++++++++++ - - - - - - - -

training set test set

++++++ - - - - ++++++ - - - -

validation set This can be done via stratified

+++ - - sampling: first stratify instances by
class, then randomly select instances
from each class proportionally.
11
Cross validation

labeled data set

partition data
into n subsamples

s1 s2 s3 s4 s5

iteration train on test on

iteratively leave one 1 s2 s3 s4 s5 s1
subsample out for 2 s1 s3 s4 s5 s2
the test set, train on 3 s1 s2 s4 s5 s3
the rest
4 s1 s2 s3 s5 s4
5 s1 s2 s3 s4 s5

12
Cross validation example
Suppose we have 100 instances, and we want to estimate accuracy
with cross validation

iteration train on test on correct

1 s2 s3 s4 s5 s1 11 / 20
2 s1 s3 s4 s5 s2 17 / 20
3 s1 s2 s4 s5 s3 16 / 20
4 s1 s2 s3 s5 s4 13 / 20
5 s1 s2 s3 s4 s5 16 / 20

accuracy = 73/100 = 73%

13
Cross validation
• 10-fold cross validation is common, but smaller values of
n are often used when learning takes a lot of time

• in leave-one-out cross validation, n = # instances

• in stratified cross validation, stratified sampling is used

when partitioning the data

• CV makes efficient use of the available data for testing

• note that whenever we use multiple training sets, as in

CV and random resampling, we are evaluating a learning
method as opposed to an individual learned model
14
Internal cross validation
Instead of a single validation set, we can use cross-validation within a
training set to select a model (e.g. to choose the best level of decision-tree
pruning)

training set test set

learning process

s1 s2 s3 s4 s5 learned model

learn
select model
models

15
Example: using internal cross
validation to select k in k-NN
given a training set
1. partition training set into n folds, s1 … sn
2. for each value of k
considered for i = 1 to n
learn k-NN model using all folds but si
evaluate accuracy on si
3. select k that resulted in best accuracy for s1 … sn
4. learn model using entire training set and selected k

the steps inside the box are run independently for each training set
(i.e. if we’re using 10-fold CV to measure the overall accuracy
of our k-NN approach, then the box would be executed 10 times)
16
Confusion matrices
How can we understand what types of mistakes a learned model makes?

activity recognition from video

actual class

predicted class 17
figure from vision.jhu.edu
Confusion matrix for 2-class problems
actual class

positive negative

positive true positives false positives

(TP) (FP)
predicted
class false negatives true negatives
negative
(FN) (TN)

TP + TN
accuracy =
TP + FP + FN + TN

18
Is accuracy an adequate measure
of predictive performance?
• accuracy may not be useful measure in cases where
• there is a large class skew
• Is 98% accuracy good if 97% of the instances are negative?

• there are differential misclassification costs – say,

getting a positive wrong costs more than getting a
negative wrong
• Consider a medical domain in which a false positive results in
an extraneous test but a false negative results in a failure to
treat a disease

• we are most interested in a subset of high-confidence

predictions
19
Other accuracy metrics
actual class

positive negative

true positives false positives

positive
(TP) (FP)
predicted
class false negatives true negatives
negative
(FN) (TN)

TP TP
true positive rate (recall) = =
actual pos TP + FN

FP FP
false positive rate = =
actual neg TN + FP
20
ROC curves
A Receiver Operating Characteristic (ROC) curve plots the TP-rate vs. the
FP-rate as a threshold on the confidence of an instance being positive is
varied

ideal point Different methods can

work better in different
Alg 1 parts of ROC space.
1.0 This depends on cost of
True positive rate

false + vs. false -

Alg 2

expected curve for

random guessing

False positive rate 1.0

21
ROC curve example

figure from Bockhorst et al., Bioinformatics 2003

22
ROC curves and misclassification costs

best operating point when

FN costs 10× FP

best operating point when

cost of misclassifying positives
and negatives is equal

best operating point when

FP costs 10× FN
23
Algorithm for creating an ROC curve

1. sort test-set predictions according to confidence that each

instance is positive
2. step through sorted list from high to low confidence

i. locate a threshold between

instances with opposite classes (keeping instances with
the same confidence value on the same side of threshold)

ii. compute TPR, FPR for instances above threshold

iii. output (FPR, TPR) coordinate

24
Plotting an ROC curve

confidence correct
instance positive class
Ex 9 .99 +
Ex 7 .98 TPR= 2/5, FPR= 0/5 + 1.0
Ex 1 .72 -

Tr ue osi ive rat

TPR= 2/5, FPR= 1/5

e
Ex 2 .70 +
Ex 6 .65 TPR= 4/5, FPR= 1/5 +

p t
Ex 10 .51 -
Ex 3 .39 TPR= 4/5, FPR= 3/5 -
Ex 5 .24 TPR= 5/5, FPR= 3/5 + 1.0
Ex 4 .11 - False positive rate
Ex 8 .01 TPR= 5/5, FPR= 5/5 -

25
Plotting an ROC curve
can interpolate between points to get convex hull
• convex hull: repeatedly, while possible, perform interpolations that
skip one data point and discard any point that lies below a line
• interpolated points are achievable in theory: can flip weighted coin
to choose between classifiers represented by plotted points

1.0
True positive rate

False positive rate 1.0 26

ROC curves
Does a low false-positive rate indicate that most positive predictions
(i.e. predictions with confidence > some threshold) are correct?

suppose our TPR is 0.9, and FPR is 0.01

fraction of instances that fraction of p ositive

are positive predictions that are correct
0.5 0.989
0.1 0.909
0.01 0.476
0.001 0.083

27
Other accuracy metrics
actual class

positive negative

true positives false positives

positive
(TP) (FP)
predicted
class false negatives true negatives
negative
(FN) (TN)

TP TP
recall (TP rate) = =
actual pos TP + FN

TP TP
precision = =
predicted pos TP + FP 28
Precision/recall curves
A precision/recall curve plots the precision vs. recall (TP-rate) as a
threshold on the confidence of an instance being positive is varied

ideal point

1.0
default precision
precision

determined by the
fraction of instances
that are positive

recall (TPR) 1.0

29
Mammography Example: ROC
Mammography Example: PR
How do we get one ROC/PR curve
when we do cross validation?

Approach 1
• make assumption that confidence values are comparable
across folds
• pool predictions from all test sets
• plot the curve from the pooled predictions

Approach 2 (for ROC curves)

• plot individual curves for all test sets
• view each curve as a function
• plot the average curve for this set of functions

32
Comments on ROC and PR curves
both
• allow predictive performance to be assessed at various levels of
confidence
• assume binary classification tasks
• sometimes summarized by calculating area under the curve

ROC curves
• insensitive to changes in class distribution (ROC curve does not
change if the proportion of positive and negative instances in the test
set are varied)
• can identify optimal classification thresholds for tasks with differential
misclassification costs

precision/recall curves
• show the fraction of predictions that are false positives
• well suited for tasks with lots of negative instances 33
To Avoid Cross-Validation
Pitfalls, Ask:
• 1. Is my held-aside test data really
representative of going out to collect
new data?
– Even if your methodology is fine,
someone may have collected features for
positive examples differently than for
negatives – should be randomized
– Example: samples from cancer processed
by different people or on different days
than samples for normal controls
34
To Avoid Pitfalls, Ask:
• 2. Did I repeat my entire data
processing procedure on every fold of
cross-validation, using only the
training data for that fold?
– On each fold of cross-validation, did I
ever access in any way the label of a test
case?
– Any preprocessing done over entire data
set (feature selection, parameter tuning,
threshold selection) must not use labels
35
To Avoid Pitfalls, Ask:
• 3. Have I modified my algorithm so
many times, or tried so many
approaches, on this same data set that
I (the human) am overfitting it?
– Have I continually modified my
preprocessing or learning algorithm until I
got some improvement on this data set?
– If so, I really need to get some additional
data now to at least test on

36
Confidence intervals on error
Given the observed error (accuracy) of a model over a limited
sample of data, how well does this error characterize its accuracy
over additional instances?

Suppose we have
• a learned model h
• a test set S containing n instances drawn independently of one
another and independent of h
• n ≥ 30
• h makes r errors over the n instances

our best estimate of the error of h is

r
errorS(h) =
n
37
Confidence intervals on error

With approximately N% probability, the true error lies in the interval

errorS (h)(1 - errorS (h))

errorS(h) ± z N
n

where zN is a constant that depends on N (e.g. for 95% confidence, zN =1.96)

38
Confidence intervals on error
How did we get this?
1. Our estimate of the error follows a binomial distribution given by n
and p (the true error rate over the data distribution)

2. Simplest (and most common) way to determine a binomial

confidence interval is to use the normal approximation 39
Confidence intervals on error
2. When n ≥ 30, and p is not too extreme, the normal distribution is a
good approximation to the binomial

3. We can determine the N% confidence interval by determining what

bounds contain N% of the probability mass under the normal

40
Empirical Confidence Bounds
• Bootstrapping: Given n examples in
data set, randomly, uniformly,
independently (with replacement) draw
n examples – bootstrap sample
• Repeat 1000 (or 10,000) times:
– Draw bootstrap sample
– Repeat entire cross-validation process
• Lower (upper) bound is result such that
2.5% of runs yield lower (higher)
41
Comparing learning systems

How can we determine if one learning system provides

better performance than another
• for a particular task?
• across a set of tasks / data sets?

42
Motivating example

Accuracies on test sets

System 1: 80% 50 75 … 99
System 2: 79 49 74 … 98
δ: +1 +1 +1 … +1

• Mean accuracy for System 1 is better, but the

standard deviations for the two clearly overlap
• Notice that System 1 is always better than System 2

43
Comparing systems using a paired t test
• consider δ’s as observed values of a set of i.i.d.
random variables

• null hypothesis: the 2 learning systems have the

same accuracy
• alternative hypothesis: one of the systems is more
accurate than the other

• hypothesis test:
– use paired t-test to determine probability p that
mean of δ’s would arise from null hypothesis
– if p is sufficiently small (typically < 0.05) then reject
the null hypothesis 44
Comparing systems using a paired t test
1 n
1. calculate the sample mean d = åd i
n i=1

d
2. calculate the t statistic t= n
1
å
n(n - 1) i=1
(d i - d )2

3. determine the corresponding p-value,

by looking up t in a table of values for
the Student's t-distribution with n-1
degrees of freedom

45
Comparing systems using a paired t test

The null distribution of our t

statistic looks like this

f(t) The p-value indicates how far

out in a tail our t statistic is

If the p-value is sufficiently

small, we reject the null
hypothesis, since it is unlikely
t we’d get such a t by chance

for a two-tailed test, the p-value

represents the probability mass
in these two regions 46
Why do we use a two-tailed test?

• a two-tailed test asks the question: is the accuracy of the

two systems different
• a one-tailed test asks the question: is system A better than
system B
• a priori, we don’t know which learning system will be more
accurate (if there is a difference) – we want to allow that
either one might be 47
Sign Test
• If less than 300 examples, we won’t
have 30 test examples per fold
• Prefer leave-one-out cross-validation
• Count “wins” for Algorithm A and B over
the N test examples on which they
disagree
• Let M be the larger of these counts
• What is probability under b(N,0.5) that
either A or B would win at least M times
48
Scatter plots for pairwise
method comparison
We can compare the performance of two methods A and B by plotting (A
performance, B performance) across numerous data sets

figure from Freund & Mason, ICML 1999 figure from Noto & Craven, BMC Bioinformatics 2006

49
Lesion studies
We can gain insight into what contributes to a learning system’s
performance by removing (lesioning) components of it

The ROC curves here show how performance is affected when various
feature types are removed from the learning representation

50
figure from Bockhorst et al., Bioinformatics 2003

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (83)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Electronics Engineering: Control Systems
No ratings yet
Electronics Engineering: Control Systems
16 pages
DL_IT324a_4
No ratings yet
DL_IT324a_4
52 pages
Lecture 5 Evaluation_Classifer
No ratings yet
Lecture 5 Evaluation_Classifer
61 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Lecture Testmodels
No ratings yet
Lecture Testmodels
31 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
A10-Model-Performance-v2-2up
No ratings yet
A10-Model-Performance-v2-2up
11 pages
Lecture 01-Model Selection and Evaluation
No ratings yet
Lecture 01-Model Selection and Evaluation
29 pages
9b. Evaluation of Classifiers
No ratings yet
9b. Evaluation of Classifiers
4 pages
2020 Evaluation PDF
No ratings yet
2020 Evaluation PDF
25 pages
Module 6
No ratings yet
Module 6
24 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
Module 5 Advanced Classification Techniques
No ratings yet
Module 5 Advanced Classification Techniques
40 pages
Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
37 pages
Machine Learning # 2
No ratings yet
Machine Learning # 2
17 pages
6 Model Evalution
No ratings yet
6 Model Evalution
16 pages
ML Model Evaluation
No ratings yet
ML Model Evaluation
17 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
CSC4316 9
No ratings yet
CSC4316 9
40 pages
Presentation On Classification
No ratings yet
Presentation On Classification
18 pages
Module 2
No ratings yet
Module 2
19 pages
Classification Evaluation
No ratings yet
Classification Evaluation
28 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Cofusion Matrix Cross- Validation
No ratings yet
Cofusion Matrix Cross- Validation
34 pages
ML 5
No ratings yet
ML 5
14 pages
Chapitre_2
No ratings yet
Chapitre_2
26 pages
Lecture - Model Accuracy Measures
No ratings yet
Lecture - Model Accuracy Measures
61 pages
Assingment On Database
No ratings yet
Assingment On Database
16 pages
Chapitre_2-converti
No ratings yet
Chapitre_2-converti
26 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
Unit 6-Feature Engineering and Sensitivity Analysis
No ratings yet
Unit 6-Feature Engineering and Sensitivity Analysis
63 pages
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE334L TH VL2024250101768 2024-10-08 Reference-Material-I
18 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
T1 ML QB Soln
No ratings yet
T1 ML QB Soln
23 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
lecture11evaluationmetricsforclassification-240913060639-0c766554
No ratings yet
lecture11evaluationmetricsforclassification-240913060639-0c766554
28 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
59 pages
Model Performance Assessment
No ratings yet
Model Performance Assessment
13 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
Mining Process
No ratings yet
Mining Process
33 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
7 ML
No ratings yet
7 ML
38 pages
MLT_Notes
No ratings yet
MLT_Notes
28 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Lecture 11
No ratings yet
Lecture 11
61 pages
Lec - 4
No ratings yet
Lec - 4
43 pages
ERROR and Confusion Matrix
No ratings yet
ERROR and Confusion Matrix
29 pages
Lecture 2.3
No ratings yet
Lecture 2.3
9 pages
TR Rain Error
No ratings yet
TR Rain Error
6 pages
Lesson 6 Analytics Methods
No ratings yet
Lesson 6 Analytics Methods
12 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Wk07 Topic07 2 - 202303
No ratings yet
Wk07 Topic07 2 - 202303
21 pages
cs461 hw1
No ratings yet
cs461 hw1
14 pages
Solution To Exercises: A Breviary of Seismic Tomography
No ratings yet
Solution To Exercises: A Breviary of Seismic Tomography
3 pages
AMC 10B paper
No ratings yet
AMC 10B paper
9 pages
Algebra Packet 1
0% (1)
Algebra Packet 1
11 pages
Paper - Analysis of Concealed Beams
No ratings yet
Paper - Analysis of Concealed Beams
8 pages
Geometric Optics The Ray Model of Light
No ratings yet
Geometric Optics The Ray Model of Light
51 pages
ITS Team Sapuangin - Simulate To Innovate
No ratings yet
ITS Team Sapuangin - Simulate To Innovate
18 pages
Calculus Integration by Substitution: Activity Overview
No ratings yet
Calculus Integration by Substitution: Activity Overview
4 pages
2 Compact Operators
No ratings yet
2 Compact Operators
9 pages
Btech Me 5 Sem Heat and Mass Transfer Rme 502 2018 19 PDF
No ratings yet
Btech Me 5 Sem Heat and Mass Transfer Rme 502 2018 19 PDF
2 pages
Chapter 10 Congruence and Similarity
No ratings yet
Chapter 10 Congruence and Similarity
38 pages
Assignment: Course Title: Computer Algorithm Course Code: CSE 1001
No ratings yet
Assignment: Course Title: Computer Algorithm Course Code: CSE 1001
20 pages
Ge8151 Python Programming Question Bank
0% (1)
Ge8151 Python Programming Question Bank
29 pages
Effect of Piperine On Pharmacokinetics of Rifampicin and Isoniazid: Development and Validation of High Performance Liquid Chromatography Method
No ratings yet
Effect of Piperine On Pharmacokinetics of Rifampicin and Isoniazid: Development and Validation of High Performance Liquid Chromatography Method
10 pages
CIND119 Module 4 - How To Build A Decision Tree
No ratings yet
CIND119 Module 4 - How To Build A Decision Tree
7 pages
Grammar Collection With Ans
No ratings yet
Grammar Collection With Ans
16 pages
PQT Anna University Notes Rejinpaul PDF
No ratings yet
PQT Anna University Notes Rejinpaul PDF
337 pages
Unit Test-2 - Code-C - FSG1 - 02-08-2023
No ratings yet
Unit Test-2 - Code-C - FSG1 - 02-08-2023
11 pages
Astesj 050340
No ratings yet
Astesj 050340
11 pages
Radix and Bucket Sort Notes
No ratings yet
Radix and Bucket Sort Notes
4 pages
Henri Poincare, Geometry and Space
No ratings yet
Henri Poincare, Geometry and Space
6 pages
Demagnetizing Factors For Cylinders
No ratings yet
Demagnetizing Factors For Cylinders
19 pages
Basic & Architectural Design
100% (1)
Basic & Architectural Design
20 pages
Lab Manual
No ratings yet
Lab Manual
28 pages
Activity 4 Science
100% (1)
Activity 4 Science
2 pages
Log & APGP - 1327610 - 2024 - 03 - 09 - 14 - 53
No ratings yet
Log & APGP - 1327610 - 2024 - 03 - 09 - 14 - 53
3 pages
Freescale Signal Processing Extension 2 ISA Reference Manual
No ratings yet
Freescale Signal Processing Extension 2 ISA Reference Manual
1,110 pages
Second Long Examination Instructions:: F (X) X 2 X 0,1, 2, 3, 4, 5
No ratings yet
Second Long Examination Instructions:: F (X) X 2 X 0,1, 2, 3, 4, 5
3 pages
Economics 536
No ratings yet
Economics 536
7 pages

AI351 Lecture 2 - Common Evaluation Metrics

Uploaded by

AI351 Lecture 2 - Common Evaluation Metrics

Uploaded by

Evaluating Machine Learning Methods

• recall/sensitivity/true positive rate (TPR)

labeled data set

training set test set

• when learning a model, you should pretend that you don’t

• if the test-set labels influence the learned model in any way,

* In some applications it is reasonable to assume that you have access

this can be assessed by plotting learning curves

training set test set

training set validation set learned model

Partition training data into separate training/validation sets 8

• a single training set doesn’t tell us how sensitive accuracy

labeled data set

training set test set

training set test set

labeled data set

training set test set

validation set This can be done via stratified

labeled data set

iteration train on test on

iteration train on test on correct

accuracy = 73/100 = 73%

• in leave-one-out cross validation, n = # instances

• in stratified cross validation, stratified sampling is used

• CV makes efficient use of the available data for testing

• note that whenever we use multiple training sets, as in

training set test set

activity recognition from video

positive true positives false positives

• there are differential misclassification costs – say,

• we are most interested in a subset of high-confidence

true positives false positives

ideal point Different methods can

false + vs. false -

expected curve for

False positive rate 1.0

figure from Bockhorst et al., Bioinformatics 2003

best operating point when

best operating point when

best operating point when

1. sort test-set predictions according to confidence that each

i. locate a threshold between

ii. compute TPR, FPR for instances above threshold

iii. output (FPR, TPR) coordinate

Tr ue osi ive rat

False positive rate 1.0 26

suppose our TPR is 0.9, and FPR is 0.01

fraction of instances that fraction of p ositive

true positives false positives

recall (TPR) 1.0

Approach 2 (for ROC curves)

our best estimate of the error of h is

With approximately N% probability, the true error lies in the interval

errorS (h)(1 - errorS (h))

where zN is a constant that depends on N (e.g. for 95% confidence, zN =1.96)

2. Simplest (and most common) way to determine a binomial

3. We can determine the N% confidence interval by determining what

How can we determine if one learning system provides

Accuracies on test sets

• Mean accuracy for System 1 is better, but the

• null hypothesis: the 2 learning systems have the

3. determine the corresponding p-value,

The null distribution of our t

f(t) The p-value indicates how far

If the p-value is sufficiently

for a two-tailed test, the p-value

• a two-tailed test asks the question: is the accuracy of the

You might also like