09 - ML-Model Evaluation
09 - ML-Model Evaluation
Machine Learning
Model Evaluation
2
1. Metrics for Classification
3
Evaluation for Classification
4
Metrics for Classification
Accuracy score
Confusion matrix
Precision and Recall
F1 score
ROC curve
Area Under the Curve
5
Accuracy Metrics
import numpy as np
from sklearn.metrics import accuracy_score
y_true = np.array([0, 0, 0, 0, 1, 1, 1, 2, 2, 2])
y_pred = np.array([0, 1, 0, 2, 1, 1, 0, 2, 1, 2])
6
Limitation of Accuracy
Solution:
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝑇𝑁 𝑇𝑁
Weighted Accuracy =
𝑤 𝑇𝑃 𝑇𝑃+𝑤 𝐹𝑃 𝐹𝑃+𝑤 𝑇𝑁 𝑇𝑁+𝑤 𝐹𝑁 𝐹𝑁
7
Confusion Matrix
Predicted Class
Actual Positive Negative
Class Positive True Positive (TP) False Negative (FN)
8
Confusion Matrix
9
Confusion Matrix
https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers 10
Confusion Matrix
Predicted Class
Positive Negative
Actual Positive True Positive (TP) False Negative (FN)
Class (Type II error)
Negative False Positive (FP) True Negative (TN)
(Type I error)
Accuracy
= (TP+TN) / (TP+FP+TN+FN)
Precision
= TP / (TP + FP)
Recall
= TP / (TP + FN)
F1-score = 2 precision * recall / (precision + recall)
= 2TP / (2 TP + FP + FN)
11
Precision-Precall
12
F1-score
The F1 score is the harmonic mean of the precision and recall
F1-score (0,1]
F1-score:
(precision 0.5, recall = 0.5) is better than (precision = 0.3, recall = 0.8)
13
Type I and II error
14
Normalized confusion matrix
Predicted Class
Positive Negative
Actual Positive TPR = TP / (TP + FN) FNR = FN / (TP + FN)
Class
Negative FPR = FP / (FP + TN) TNR = TN / (FP + TN)
False Positive Rate còn được gọi là False Alarm Rate (tỉ lệ báo động nhầm)
False Negative Rate còn được gọi là Miss Detection Rate (tỉ lệ bỏ sót)
Trong bài toán dò mìn, “thà báo nhầm còn hơn bỏ sót”, tức là ta có thể chấp
nhận False Alarm Rate cao để đạt được Miss Detection Rate thấp.
Trong bài toán lọc email rác thì việc cho nhầm email quan trọng vào thùng rác
nghiêm trọng hơn việc xác định một email rác là email thường.
15
ROC curve
16
ROC curve
Diagonal line
Random guessing (50%)
17
For multi-class classification
Micro-average
Macro-average
20
2. Metrics for Regression
21
2.1. Bias
Figure 1 presents the relationship between a target variable (y) and a single feature (x)
22
2.2. Mean squared error (MSE)
Pros:
MSE uses the mean (instead of the sum) to keep the metric independent of
the dataset size.
As the residuals are squared, MSE puts a significantly heavier penalty on
large errors. Some of those might be outliers, so MSE is not robust to their
presence.
The metric is useful for optimization algorithms.
Cons:
MSE is not measured in the original units, which can make it harder to
interpret.
MSE cannot be used to compare the performance between different datasets.
23
2.3. Root mean squared error
Root mean squared error (RMSE)
Pros:
Take the square (MSE) to bring the metric back to the scale of the target
variable, so it is easier to interpret and understand.
Cos:
However, take caution: one fact that is often overlooked is that although
RMSE is on the same scale as the target, an RMSE of 10 does not actually
mean you are off by 10 units on average.
24
https://developer.nvidia.com/
2.4. Mean absolute error (MAE)
Pros:
Due to the lack of squaring, the metric is expressed at the same scale as
the target variable, making it easier to interpret.
All errors are treated equally, so the metric is robust to outliers.
Cons:
Absolute value disregards the direction of the errors, so underforecasting
= overforecasting.
Similar to MSE and RMSE, MAE is also scale-dependent, so you cannot
compare it between different datasets.
When you optimize for MAE, the prediction must be as many times
higher than the actual value as it should be lower. That means that you
are effectively looking for the median; that is, a value that splits a dataset
into two equal parts.
As the formula contains absolute values, MAE is not easily differentiable.
25
2.5. R-squared (R²)
26
2.5. R-squared
Measure: how well your model fits the data
• RSS : the residual sum of squares
• TSS : the total sum of squares
Pros:
Model Fit Assessment & Model Comparisons
o A higher R-squared means a better fit.
Helps in Feature Selection
o If adding a variable improves R-squared a lot,
it's likely a good predictor.
Cons:
Sensitive to Outliers
Depends on Sample Size
Not distinguishing between different types of
relationships
27
2.6. Some other metrics
Mean squared log error (MSLE)
Root mean squared log error (RMSLE)
Symmetric mean absolute percentage error (sMAPE)
…
28
3. Metrics for Clustering
29
Rand index (RI)
Given the knowledge of the ground truth class assignments
labels_true and our clustering algorithm assignments of the same
samples labels_pred, the (adjusted or unadjusted)
Rand index measures the similarity of the two assignments,
ignoring permutations
If C is a ground truth class assignment and K the clustering, let us
define a and b as:
a the number of pairs of elements that are in the same set in C and in the
same set in K
b the number of pairs of elements that are in different sets in C and in
different sets in K
30
Adjusted Rand index (ARI)
However, the Rand index does not guarantee that
random label assignments will get a value close to zero
(esp. if the number of clusters is in the same order of
magnitude as the number of samples).
31
Rand index (RI) & Adjust Rand Index (ARI)
Rand index is a function that measures the similarity of the two assignments,
ignoring permutations:
The Rand index does not ensure to obtain a value close to 0.0 for a
random labelling.
The adjusted Rand index corrects for chance and will give such a
baseline.
32
Silhouette Score
The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an
example of such an evaluation, where a higher Silhouette Coefficient
score relates to a model with better defined clusters. The Silhouette
Coefficient is defined for each sample and is composed of two scores:
33
Some other metrics
34
References:
https://scikit-
learn.org/stable/modules/clustering.html#clustering-
evaluation
35