Binary ClassificationMetrics Cheathsheet
Binary ClassificationMetrics Cheathsheet
log post 22 Evaluation Metrics for Binary Classification (And When to Use Them)
Blogpost:
False positive rate | Type-I error False negative rate | Type-II error False discovery rate
Usually, it is not used alone but rather How many false alerts your Usually, it is not used alone but rather How often your model misses - Reversed precision (1-
with some other metric, model raises with some other metric, trully fraudulent transactions. precision).
If the cost of dealing with an alert is If the cost of letting the fraudulent
high you should consider increasing the transactions through is high and the
threshold to get fewer alerts. value you get from the users isn’t you
Formula can consider focusing on this number. Formula Formula
from sklearn.metrics import confusion_matrix from sklearn.metrics import confusion_matrix from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_negative_rate = fn / (tp + fn) false_discovery_rate = fp / (tp + fp)
false_positive_rate = fp / (fp + tn)
Usually, you don’t use it alone but rather as an auxiliary metric, Think of it as recall for negative class. Usually, you don’t use it alone but rather as an auxiliary metric, Think of it as precision for negative class.
When you really want to be sure that you are right when you say When we care about high precision on negative predictions. For
something is safe. A typical example would be a doctor telling a example, imagine we really don’t want to have any additional process
patient “you are healthy”. Making a mistake here and telling a sick for screening the transactions predicted as clean. In that case, we
person they are safe and can go home is something you may want to may want to make sure that our negative predictive value is high.
avoid. Formula Formula
Usually used as objective not metric. Averaged difference between ground truth and When you care about calibrated probabilities. Mean squared error between ground truth and
logarithm of predicted score for every predicted score.
observation. Heavily penalizes when the model
is confident about something yet wrong.
Formula Formula
You should use it when you ultimately care about ranking predictions. Rank correlation between predictions and When you want to communicate precision/recall decision to other What is the average precision over all recall
You should not use it when your data is heavily imbalanced. target. Tells you how good at ranking stakeholders and want to choose the threshold that fits the business values.
You should use it when you care equally about positive and negative predictions (positive over negative) your model problem.
class. is. When your data is heavily imbalanced.
When you care more about positive than negative class.
Formula Formula
when your problem is about sorting/ It helps to assess the separation When you want to combine precision Geometrically averaged Pretty much in every binary Geometric average of precision
prioritizing the most relevant between prediction distributions and recall in one metric and would like precision and recall with a classification problem. and recall.
observations and you care equally for positive and negative class. to be able to adjust how much focus weight beta. ( 0<beta<1 favours It is my go-to metric when working on
about positive and negative class. you put on one or the other. precision; beta>1 favours recall ) those problems.
from scikitplot.helpers import binary_ks_curve from sklearn.metrics import fbeta_score from sklearn.metrics import fbeta_score
fbeta_score(y_true, y_pred_class, beta) fbeta_score(y_true, y_pred_class, beta=1)
res = binary_ks_curve(y_true, y_pred[:, 1])
ks_stat = res[3]
When recalling positive observations Geometric average of precison When you want to combine precision How much better is your model Pretty much in every binary Correlation between predicted
(fraudulent transactions) is more and recall with twice as much and recall in one metric and would like over the random classifier that classification problem. classes and ground truth.
important than being precise about it weight on recall. to be able to adjust how much focus predicts based on class It is my go-to metric when working on
but you still want to have a nice and you put on one or the other. frequencies. those problems.
simple metric that combines precision
and recall. Formula Formula Formula
from sklearn.metrics import fbeta_score from sklearn.metrics import cohen_kappa_score from sklearn.metrics import matthews_corrcoef
fbeta_score(y_true, y_pred_class, beta=2) cohen_kappa_score(y_true, y_pred_class) matthews_corrcoef(y_true, y_pred_class)
Usually, you will not use it alone but rather coupled with other metrics Put all guilty in prison. Again, it usually doesn’t make sense to use it alone but rather Make sure that people that go to prison are
like precision. coupled with other metrics like recall. guilty.
That being said recall is a go-to metric, when you really care about When raising false alerts is costly and you really want all the positive
catching all fraudulent transactions even at a cost of false alerts. predictions to be worth looking at you should optimize for precision.
Potentially it is cheap for you to process those alerts and very
expensive when the transaction goes unseen. Formula Formula
Accuracy
Formula
How to calculate
Confusion Matrix
Pretty much always. Table that contains true negative (tn), false positive (fp),
Get the gist of model imbalance and where the false negative (fn), and true positive (tp) predictions.
predictions fall.
How to calculate
cm = confusion_matrix(y_true, y_pred_class)
tn, fp, fn, tp = cm.ravel()
Whenever you want to use your model to choose the best In simple words, it helps you gauge how much you gain by
customers/transactions to target by sorting all using your model over a random model for a given fraction
predictions you should consider using cumulative gain of top scored predictions.
charts.
How to calculate
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Lift curve
Whenever you want to use your model to choose the best In simple words, it helps you gauge how much you gain by
customers/transactions to target by sorting all using your model over a random model for a given fraction
predictions you should consider using a lift curve. of top scored predictions.
It tells you how much better your model is than a random
model for the given percentile of top scored predictions.
How to calculate
fig, ax = plt.subplots()
plot_lift_curve(y_true, y_pred, ax=ax)
Kolmogorov-Smirnov chart
When your problem is about sorting/prioritizing the most It helps to assess the separation between prediction
relevant observations and you care equally about positive distributions for positive and negative class.
and negative class.
So it works similarly to Cumulative gain chart but instead of
just looking at positive class it looks at the separation
between positive and negative class.
How to calculate
fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)
ROC curve
You should use it when you ultimately care about ranking It is a chart that visualizes the tradeoff between true
predictions. positive rate (TPR) and false positive rate (FPR). Basically,
You should not use it when your data is heavily for every threshold, we calculate TPR and FPR and plot it on
imbalanced. one chart.
You should use it when you care equally about positive
and negative class.
How to calculate
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
Precision-Recall curve
It is a curve that combines precision (PPV) and Recall When you want to communicate precision/recall decision to
(TPR) in a single visualization. For every threshold, you other stakeholders and want to choose the threshold that
calculate PPV and TPR and plot it. The higher on y-axis fits the business problem.
your curve is the better your model performance. When your data is heavily imbalanced.
When you care more about positive than negative class.
How to calculate
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)