0% found this document useful (0 votes)
19 views

A10-Model-Performance-v2-2up

The document discusses the evaluation of machine learning models, emphasizing the importance of dividing data into training, validation, and test sets to avoid data leakage. It covers metrics for classifier performance, including accuracy, precision, recall, and the use of ROC curves for model evaluation. Additionally, it addresses the concepts of underfitting and overfitting, highlighting the bias-variance tradeoff in model complexity.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

A10-Model-Performance-v2-2up

The document discusses the evaluation of machine learning models, emphasizing the importance of dividing data into training, validation, and test sets to avoid data leakage. It covers metrics for classifier performance, including accuracy, precision, recall, and the use of ROC curves for model evaluation. Additionally, it addresses the concepts of underfitting and overfitting, highlighting the bias-variance tradeoff in model complexity.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Evaluating Performance of

Machine Learning Models


Mehul Motani
Electrical & Computer Engineering
National University of Singapore
Email: [email protected]

© Mehul Motani Model Performance 1

Supervised learning paradigm


Note the three different datasets:
Training Data – Optimize model
parameters and train model
Validation Data – Hyperparameter
tuning and model selection
Test Data – Evaluate the final trained
model

© Mehul Motani Model Performance 2


Training and evaluating the learning algorithm
• Divide available labeled data into three sets:
• Training set: TRAIN
– Used for model parameter optimization on
• Validation set TEST
– Used for hyperparameter tuning and model selection
• Test set:
– Used only for final evaluation of the trained model
– Done after training and validation are completely finished
• Avoid data leakage
– The test data should not influence the choice of model structure or
optimization of parameters.
– If after evaluating on the test set, you don’t like the results, you must set
aside a new test set before training a new model.
© Mehul Motani Model Performance 3

More on the validation set


• Validation set is used when you have enough labeled
data available
• Validation set
– Used to gauge status of generalization error
– Used to optimize small number of high-level meta parameters
• regularization constants; number of gradient descent iterations
• model structure: number of nodes and connections
• types and numbers of parameters: coefficients, weights, etc.
– Used to perform model-selection
• For example, linear vs polynomial regression

More at: https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets

© Mehul Motani Model Performance 4


K-fold cross-validation

5-fold
cross-validation

• K-fold cross validation is used when we have little data


– Typically, we use block folds (shown above) as this allows every sample to
be in validation set.
– We can also use random folds if samples are independent.
• Report average performance over different experiments
• Or use cross-validation for hyperparameter tuning and then
report results on held-out test set.
© Mehul Motani Model Performance 5

Important – Avoid Data leakage


• Data leakage is when the test set (or validation set) leaks
information to the model. This gives you an optimistic
performance prediction and invalidates your entire experiment.
• If you pre-process your data (e.g., normalization), you must do
this on the training set only, not on the entire dataset.
– For example, if you include the test set in normalization, then information
about the test set will leak in to the training set and the model.
– This also applies to K-fold CV with the training and validation sets.
• In K-fold cross validation, you must discard the model and
restart after every experiment.
• If after testing on the test set, you want to train a new model,
you must restart with a new test set. Otherwise, information
about the test set can leak into your model tweaking.

© Mehul Motani Model Performance 6


How good is a classifier?
• Accuracy
a = No. of test samples with label correctly predicted 𝑎
accuracy =
b = No. of test samples with label incorrectly predicted 𝑎+𝑏
Example: 75 samples in test set
- correct class label predicted for 62 samples 62
accuracy = = 82.67%
- wrong class label predicted for 13 samples 75
• Limitations of accuracy
– Consider a two-class problem
• number of class 1 test samples = 9990
• number of class 2 test samples = 10
– What if model predicts everything to be class 1?
• accuracy is extremely high: 9990 / 10000 = 99.9 %
• but model will never correctly predict any sample in class 2
• in this case accuracy is misleading and does not give a good picture of
model quality

© Mehul Motani Model Performance 7

Metrics for classifier performance


actual class
Confusion matrix for
binary classification class 1 class 2
negative positive

class 1
21 (TN) 6 (FN)
negative
predicted
class
class 2
7 (FP) 41 (TP)
positive

TN: true negatives FN: false negatives


FP: false positives TP: true positives

https://en.wikipedia.org/wiki/Confusion_matrix
© Mehul Motani Model Performance 8
Recall, Specificity, and Precision
Recall and Specificity Recall and Precision
• Recall à True positive rate • Recall à True positive rate
• Specificity à True negative rate • Precision à Positive predictive value
• Are useful when false positives and • ‘stupid’ methods can achieve large
false negatives have different recall at the expense of low precision
consequences (and vice versa)
• ‘stupid’ methods can achieve large • Which one is more important
recall at the expense of low specificity depends upon application
(and vice versa) • Recall is important when false
• Which one is more important depends negatives are catastrophic and you
upon application want detect all positive cases.
• Recall is important when false • Precision in important when being
negatives are catastrophic (e.g., right (positive prediction is correct)
missed cancer detection) outweighs detecting all positives.
• Specificity is important when false • F1-Score is the harmonic mean of
positives are bad (e.g., identifying the precision and recall (used when both
wrong person in a DNA test) are important).
© Mehul Motani Model Performance 9

P P P P P P

P N N P P P

N N P P N P

N N P N N P

Algorithm 1 Accuracy = (TP+TN) /(TP+TN+FP+FN) = 7 / 12 = 0.58


P true positives (TP) = 3 Recall = TP / (TP + FN) = 3 / 5 = 0.6
P false positives (FP) = 3 Specificity = TN / (TN + FP) = 4 / 7 = 0.57
N false negatives (FN) = 2 Precision = TP / (TP + FP) = 3 / 6 = 0.5

N true negatives (TN) = 4

Algorithm 2
Accuracy = (TP+TN) /(TP+TN+FP+FN) = 6 / 12 = 0.5
P =4 N =1 Recall = TP / (TP + FN) = 4 / 5 = 0.8
Specificity = TN / (TN + FP) = 2 / 7 = 0.29
P =5 N =2 Precision = TP / (TP + FP) = 4 / 9 = 0.44

Which algorithm is better?


© Mehul Motani Model Performance 10
Exploring the performance tradeoffs
• In a classification problem, we may decide to predict the class values directly
or predict the probabilities for each class instead.
• Computing probabilities allows us to tradeoff false positives and false
negatives using a threshold.
• Two diagnostic tools that help in the interpretation of probabilistic forecast
for binary classification problems are Receiver Operating Characteristic (ROC)
curves and Precision-Recall curves.
• ROC Curves summarize the trade-off between the true positive rate (Recall)
and false positive rate (1-Specificity) for a predictive model using different
probability thresholds.
• Precision-Recall curves summarize the trade-off between the true positive
rate (Recall) and the positive predictive value (Precision) for a predictive
model using different probability thresholds.
• ROC curves are appropriate when the observations are balanced between
each class, whereas Precision-Recall curves are appropriate for imbalanced
datasets.
• For both tradeoffs, the area under the curve (AUC) can be used as a summary
of the tradeoff.

© Mehul Motani Model Performance 11

More on the ROC Curve and AUC-ROC


• Many algorithms (e.g., logistic regression) return a probability which can then
be mapped to two or more classes.
• Other algorithms (e.g., SVM and Random Forest) can be configured to return
probabilities instead of class decisions.
• Mapping from probabilities is done by comparing to a threshold. For
example, a value below the threshold can be class 0 (negative) and a value
above the threshold can be class 1 (positive).
• You might think that a threshold of 0.5 is right but thresholds are problem
dependent and must be tuned based on the impact of false positives and
missed detections.
• Varying the threshold allows us to explore tradeoffs.
– For example, lowering the threshold classifies more items as positive, thus increasing
both False Positives and True Positives.
• The ROC curve (receiver operating characteristic curve) is a graph showing
the performance of a classification model at different classification
thresholds. This curve plots true positives vs false positives.
– The area under the ROC (AUC-ROC) is a single metric to evaluate a classifier. An AUC-
ROC value closer to 1 indicates a good classification algorithm.
– A random classifier has an AUC-ROC of 0.5.

© Mehul Motani Model Performance 12


ROC Curves and PR Curves
Recall (True Positive Rate)

Precision
Random Random
classifier classifier

1-Specificity (False Positive Rate) Recall

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html
- How to Use ROC Curves and Precision-Recall Curves for Classification in Python:
https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-
python/

© Mehul Motani Model Performance 13

Underfitting and overfitting

• Fit of model to training and test sets is controlled by:


– model capacity/complexity ( » number of parameters )
• Example: number of nodes/levels in decision tree
• Example: polynomial degree for regression
• Example: Number of nodes/layers in neural network
– stage of optimization
• example: number of iterations in a gradient descent optimization
• Underfitting leads to poor performance
– On both training and test sets
• Overfitting leads to poor generalization
– Good on training set, bad on test set

© Mehul Motani Model Performance 14


Underfitting and overfitting
underfitting overfitting

optimal fit

Model Complexity

© Mehul Motani Model Performance 15

Causes of Overfitting

Decision boundary distorted Lack of data points in lower


by noise point half of diagram makes it
difficult to correctly predict
class labels in that region.
© Mehul Motani Model Performance 16
Occam’s Razor

• Given two models with similar generalization errors, one should


prefer the simpler model over the more complex model.
• For complex models, there is a greater chance it was fitted
accidentally by errors in data.
• Model complexity should therefore be considered when
evaluating a model.
– More “complex” models tend to overfit the training data, and thus have
higher variance, but have lower bias.
• For example: Full depth decision tree or Large C soft margin SVM
– Less “complex” models tend to underfit the training data, and thus have
lower variance, but have higher bias.
• For example: Limited depth decision tree or Small C soft margin SVM

© Mehul Motani Model Performance 17

Model Fit and Bias / Variance

Underfitting Overfitting

• Underfitting leads to high training error and high test error.


• Underfitting is bad as it means we have not learned enough from our
data. This error is known as bias.
• Overfitting leads to low training error and high test error.
• Overfitting is bad as it means we are too sensitive to our data. This error
is known as variance.
• We want both low bias (no underfitting) and low variance (no
overfitting)!
© Mehul Motani Model Performance 18
Bias-Variance Tradeoff

Low Variance High Variance


Bias – Measures the
High Precision Low Precision accuracy of the model.
It is the error due to
underfitting.
Low Bias
High Accuracy Variance – Measures
how precise the model
is. It is the error due to
overfitting.

High Bias
We want to reduce
Low Accuracy both bias and
variance!

© Mehul Motani Model Performance 19

Bias-Variance Tradeoff

underfitting overfitting • Error on the dataset used to fit the


model doesn’t predict future
high bias low bias
performance.
low variance high variance
• Too much complexity can diminish
model’s accuracy on future data.
• Complex model:
optimal fit • Low ‘bias’: the model fit is good,
i.e., the model value is close to the
data’s expected value.
• High ‘variance’: Model more likely
to make a wrong prediction.
• Sweet spot: the best complexity lies
where the test error reaches a
minimum, that is, somewhere in
between a very simple and a very
Model Complexity complex model.
• Data science is both art and science!
• https://ml.berkeley.edu/blog/2017/07/
13/tutorial-4/

© Mehul Motani Model Performance 20


The Bias squared-Variance Curve

• A curve of squared bias vs variance showing the inverse correlation that is


typical of the relation between the two as the model gets more complex.
• It is not uncommon for the resulting Total Error to follow some variant of
the U-shape shown in the figure.
© Mehul Motani Model Performance 21

XKCD: Computers vs. Humans

© Mehul Motani Model Performance 22

You might also like