0% found this document useful (0 votes)

29 views

Classification Evaluation

Model evaluation and selection involves assessing a classifier's accuracy using test data rather than training data. Key metrics include accuracy, error rate, sensitivity, specificity, precision, recall, and the F1 measure. These are often visualized using a confusion matrix or ROC curve. The area under the ROC curve (AUC) is used to compare classifiers, with higher AUC indicating better performance. Evaluation metrics help select the best performing model.

Uploaded by

Kathy Kg

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Classification Evaluation

Uploaded by

Kathy Kg

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

MODEL EVALUATION AND SELECTION

Model Evaluation and Selection

• Use test set of class-labeled tuples instead of training set
when assessing accuracy
• Classifier Evaluation Metrics
– Accuracy
– Error Rate
– Sensitivity (True Positive recognition rate)
– Specificity (True Negative recognition rate)
– Precision
– Recall
– F measure
– ROC Curve
– …

2
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates

# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
3
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity

• Classifier Accuracy, or
A\P C ¬C recognition rate: percentage
C TP FN P of test set tuples that are
¬C FP TN N correctly classified
P’ N’ All
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All

4
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Classifier Accuracy, or
recognition rate: percentage Accuracy
of test set tuples that are (6954+2588)/10000
correctly classified =9542/10000=0.9542
Accuracy = (TP + TN)/All
• Error rate: 1 – accuracy, or Error rate
Error rate = (FP + FN)/All 1-0.9542=0.0458
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C
 Class Imbalance Problem:
C 0 10 10
 One class may be rare, e.g.
¬C 0 999990 999990
0 1000000 1000000
fraud, or HIV-positive
 Significant majority of the

negative class and minority of

the positive class
 Sensitivity: True Positive
• Accuracy
recognition rate
(0+999990)/1000000  Sensitivity = TP/P

=99.999%  Specificity: True Negative

recognition rate
 Specificity = TN/N

6
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Sensitivity: True Positive Sensitivity

6954/7000
recognition rate
=0.9934
 Sensitivity = TP/P
 Specificity: True Negative Specificity
recognition rate 2588/3000
 Specificity = TN/N =0.8627
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the

classifier label as positive?

8
Example
Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Precision: exactness Precision

6954/7366
=0.9441

• Recall: completeness
Recall Sensitivity: True
Positive recognition
6954/7000 rate
=0.9934
Precision/Recall
• Inverse relationship between precision & recall
• A system that tag all tuple as positive has 100% recall!

• In a good system, precision decreases as either the

number of docs retrieved or recall increases
– This is not a theorem, but a result with strong empirical
confirmation
10
A combined measure: F
• Combined measure that assesses
precision/recall tradeoff is F measure
(weighted harmonic mean):
2
1 (   1) PR
F 
1 1 2
 PR
  (1   )
P R

• People usually use balanced F1 measure

– i.e., with  = 1 or  = ½ 11
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity)
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

– F1= 1
 0.34
1 1
(0.5)   (1  0.5) 
0.3913 0.3

12
The simple (arithmetic) mean is 50% for “return-everything”
F1 and other averages
search engine, which is too high.
Precision Recall Average F1
0.01 0.99 0.5 0.02
0.1 0.9 0.5 0.18
0.2 0.8 0.5 0.32
0.5 0.5 0.5 0.5

Combined Measures
Desideratum: Punish really bad performance
100
on either precision or recall.

80 Minimum
Maximum
60
Arithmetic
40 Geometric
Harmonic– Taking the minimum achieves this.
20
– But minimum is not smooth and
0 hard to weight.
0 20 40 60 80 100 – F (harmonic mean) is a kind of
Precision (Recall fixed at 70%)
smooth minimum. 13
ROC Curve
• The receiver operating characteristic (ROC)
curve is another common tool used with
binary classifiers.
– Plots the true positive rate(TPR) against the false
positive rate(FPR)
• TPR (recall, Sensitivity)
– TP/P
• FPR (1-True Negative rate)
– FP/N or 1-TN/N
ROC Curve
• TPR (recall, Sensitivity)
– TP/P (6954/7000=0.99)
• FPR (1-True Negative rate)
– FP/N or 1-TN/N (412/3000=0.14)
ROC Curve

There is a trade-off:
The higher the recall(TPR), the more
false positives(FPR) the classifier
produces

One way to compare classifiers is to measure the area under the

curve (AUC).
A perfect classifier will have a ROC AUC equal to 1,
where as a purely random classifier will have a ROC AUC equal to 0.5.
Actual class\Predicted class yes no Total

yes 800 200 1000

no 250 150 400
Total 1050 350 1400

• Classifier Evaluation Metrics

– Accuracy
– Error Rate
– Sensitivity
– Specificity
– Precision
– Recall
– F1 measure
17
Model Evaluation and Selection
• How to select test data?

• Methods for estimating a classifier’s accuracy:

– Holdout method, random subsampling
– Cross-validation
– Bootstrap

18
Segmenting the dataset for training and
testing
• Holdout method
• Cross validation

19
Evaluating Classifier Accuracy:
Holdout
• Holdout method
– Given data is randomly partitioned into two
independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the
accuracies obtained

20
Evaluating Classifier Accuracy:
Cross-Validation Methods

• Cross-validation (k-fold, where k = 10 is most

popular)
– Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
– At i-th iteration, use Di as test set and others as
training set
– Leave-one-out: k folds where k = # of tuples
This approach can closely estimate the
true accuracy when the value of k is large.

21
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

22
Model Evaluation and Selection

• Suppose we have 2 classifiers, M1 and M2,

which one is better?

23
Estimating Confidence Intervals:
Classifier Models M1 vs. M2

• Use 10-fold cross-validation to obtain and

• These mean error rates are just estimates of error on the true
population of future data cases
• What if the difference between the 2 error rates is just
attributed to chance?
– Use a test of statistical significance

– Obtain confidence limits for our error estimates

24
Estimating Confidence Intervals:
Null Hypothesis

• Perform 10-fold cross-validation

• Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
err ( M 1 )  err ( M 2 )
• Use t-test (or Student’s t-test) t
var(M 1  M 2 ) /( k  1)
• Null Hypothesis: M1 & M2 are the same
• If we can reject null hypothesis, then
– we conclude that the difference between M1 & M2 is statistically
significant
– Chose model with lower error rate

25
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2

26
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?

– Compute t. Select significance level (e.g. sig = 5%)
– Consult table for t-distribution: Find t value corresponding to k-1
degrees of freedom (here, 9)
– t-distribution is symmetric: typically upper % points of distribution
shown → look up value for confidence limit z=sig/2 (here, 0.025)
– If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are
same
• Conclude: statistically significant difference between M1 & M2
– Otherwise, conclude that any difference is chance

27
Issues Affecting Model Selection
• Accuracy
– classifier accuracy: predicting class label
• Speed
– time to construct the model (training time)
– time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
– understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
28

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6135)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (628)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4973)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1988)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1994)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carre
3.5/5 (109)
English For Research Publication Purposes
No ratings yet
English For Research Publication Purposes
6 pages
Matrix & Family Tree Arrangement
No ratings yet
Matrix & Family Tree Arrangement
8 pages
Purposive Communication
No ratings yet
Purposive Communication
5 pages
Module 5 Assignment 2
No ratings yet
Module 5 Assignment 2
3 pages
Compute Trends Across Three Eras of Machine Learning
No ratings yet
Compute Trends Across Three Eras of Machine Learning
25 pages
Chapter 4 1
No ratings yet
Chapter 4 1
16 pages
PDF K To 12 Melcs With CG Codes Homeroom Guidance Program1 - Compress
No ratings yet
PDF K To 12 Melcs With CG Codes Homeroom Guidance Program1 - Compress
27 pages
Unit 3 Information Sources
100% (1)
Unit 3 Information Sources
19 pages
CH1 The Foundations - Logic and Proofs
100% (1)
CH1 The Foundations - Logic and Proofs
60 pages
Mixed Methods
No ratings yet
Mixed Methods
69 pages
The Utility of Hotspot Mapping For Predicting Spatial Patterns of Crime - SpringerLink
No ratings yet
The Utility of Hotspot Mapping For Predicting Spatial Patterns of Crime - SpringerLink
10 pages
Sara Abdelghani - Assignment 1
No ratings yet
Sara Abdelghani - Assignment 1
7 pages
Learning From Others and Reviewing The Literature: Darren N. Naelgas, Maed.,Smriedr
No ratings yet
Learning From Others and Reviewing The Literature: Darren N. Naelgas, Maed.,Smriedr
24 pages
Overview of The Nursing Process-1
No ratings yet
Overview of The Nursing Process-1
16 pages
Social Media and Youth
No ratings yet
Social Media and Youth
24 pages
Sociology BA 1yr Book
No ratings yet
Sociology BA 1yr Book
25 pages
Mental Health in Stalking
No ratings yet
Mental Health in Stalking
19 pages
Reading Passage 36: A Neuroscientist Reveals How To Think Differently
No ratings yet
Reading Passage 36: A Neuroscientist Reveals How To Think Differently
4 pages
Budget of Work For Cookery 7-8
100% (2)
Budget of Work For Cookery 7-8
4 pages
Sample Schemes of Work For Expressive Arts
No ratings yet
Sample Schemes of Work For Expressive Arts
4 pages
LP 1 in SocPsych
No ratings yet
LP 1 in SocPsych
10 pages
Research Plan
No ratings yet
Research Plan
2 pages
Chapter 7
No ratings yet
Chapter 7
26 pages
Machine Learning: Junaid Khan Department of Computer Science University of Peshawar Pakistan Presenter
No ratings yet
Machine Learning: Junaid Khan Department of Computer Science University of Peshawar Pakistan Presenter
21 pages
LESSON 4 and 5
No ratings yet
LESSON 4 and 5
7 pages
Cps Presentation
No ratings yet
Cps Presentation
13 pages
A Neural Network Lab Experiment
No ratings yet
A Neural Network Lab Experiment
11 pages
Integration and Accreditation of Blended Mobility and Virtual Exchange: From Theory To Practice
No ratings yet
Integration and Accreditation of Blended Mobility and Virtual Exchange: From Theory To Practice
27 pages
NPTEL Approval Form For Credit Transfer - Maths
100% (1)
NPTEL Approval Form For Credit Transfer - Maths
4 pages
Electives - Sheet1
No ratings yet
Electives - Sheet1
1 page

Classification Evaluation

Uploaded by

Classification Evaluation

Uploaded by

MODEL EVALUATION AND SELECTION

Model Evaluation and Selection

Example of Confusion Matrix:

• Given m classes, an entry, CMi,j in a confusion matrix indicates

negative class and minority of

=99.999%  Specificity: True Negative

 Sensitivity: True Positive Sensitivity

• Recall: completeness – what % of positive tuples did the

• Precision: exactness Precision

• In a good system, precision decreases as either the

• People usually use balanced F1 measure

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

One way to compare classifiers is to measure the area under the

yes 800 200 1000

• Classifier Evaluation Metrics

• Methods for estimating a classifier’s accuracy:

• Cross-validation (k-fold, where k = 10 is most

• Suppose we have 2 classifiers, M1 and M2,

• Use 10-fold cross-validation to obtain and

– Obtain confidence limits for our error estimates

• Perform 10-fold cross-validation

• Are M1 & M2 significantly different?

You might also like