Lectura 1
Lectura 1
a r t i c l e i n f o a b s t r a c t
Article history: Classification techniques have been applied to many applications in various fields of sciences. There are
Received 4 May 2018 several ways of evaluating classification algorithms. The analysis of such metrics and its significance
Revised 7 August 2018 must be interpreted correctly for evaluating different learning algorithms. Most of these measures are
Accepted 17 August 2018
scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of
Available online xxxx
the classification assessment measures with the aim of providing the basics of these measures and to
show how it works to serve as a comprehensive source for researchers who are interested in this field.
Keywords:
This overview starts by highlighting the definition of the confusion matrix in binary and multi-class clas-
Receiver operating characteristics (ROC)
Confusion matrix
sification problems. Many classification measures are also explained in details, and the influence of bal-
Precision-Recall (PR) curve anced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1)
Classification how to calculate these measures in binary and multi-class classification problems, and (2) the robustness
Assessment methods of some measures against balanced and imbalanced data. Moreover, some graphical measures such as
Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves
are presented with details. Additionally, in a step-by-step approach, different numerical examples are
demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.
Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.aci.2018.08.003
2210-8327/Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
2 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 3
Fig. 3. Visualization of different metrics and the relations between these metrics. Given two classes, red class and blue class. The black circle represents a classifier that
classifies the sample inside the circle as red samples (belong to the red class) and the samples outside the circle as blue samples (belong to the blue class). Green regions
indicate the correctly classified regions and the red regions indicate the misclassified regions. (For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.)
of the confusion matrix; thus, as data distributions change, these 2.2. Accuracy and error rate
metrics will change as well, even if the classifier performance does
not. Therefore, such these metrics cannot distinguish between the Accuracy (Acc) is one of the most commonly used measures for
numbers of corrected labels from different classes [11]. This fact is the classification performance, and it is defined as a ratio between
partially true because there are some metrics such as Geometric the correctly classified samples to the total number of samples as
Mean (GM) and Youden’s index (YI)2 use values from both columns follows [20]:
and these metrics can be used with balanced and imbalanced data.
This can be interpreted as that the metrics which use values from TP þ TN
Acc ¼ ð1Þ
one column cancel the changes in the class distribution. However, TP þ TN þ FP þ FN
some metrics which use values from both columns are not sensitive where P and N indicate the number of positive and negative sam-
to the imbalanced data because the changes in the class distribution ples, respectively.
cancel each other. For example, the accuracy is defined as follows, The complement of the accuracy metric is the Error rate (ERR) or
Acc ¼ TPþTNþFPþFN
TPþTN
and the GM is defined as follows, misclassification rate. This metric represents the number of misclas-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
GM ¼ TPR TNR ¼ TPþFN TNþFP; thus, both metrics use values
TP TN sified samples from both positive and negative classes, and it is cal-
culated as follows, EER ¼ 1 Acc ¼ ðFP þ FNÞ=ðTP þ TN þ FP þ FNÞ
from both columns of the confusion matrix. Changing the class dis-
[4]. Both accuracy and error rate metrics are sensitive to the imbal-
tribution can be obtained by increasing/decreasing the number of
anced data. Another problem with the accuracy is that two classi-
samples of negative/positive class. With the same classification per-
fiers can yield the same accuracy but perform differently with
formance, assume that the negative class samples are increased by a
respect to the types of correct and incorrect decisions they provide
times; thus, the TN and FP values will be aTN and aFP, respectively;
aTN [9]. However, Takaya Saito and Marc Rehmsmeier reported that the
thus, the accuracy will be, Acc ¼ TPþaTPþ TNþaFPþFN
– TPþTNþFPþFN
TPþTN
. This
accuracy is suitable with imbalanced data because they found that
means that the accuracy is affected by the changes in the class dis- the accuracy values of the balanced and imbalanced data in their
tribution. On the other hand, the GM metric will be, example were identical [17]. The reason why the accuracy values
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
GM ¼ TPþFN TP
aTNþ aTN ¼ TP
TNþFPTN
and hence the changes in the were identical in their example is that the sum of TP and TN in
aFP TPþFN
negative class cancel each other. This is the reason why the GM met- the balanced and imbalanced data was the same.
ric is suitable for the imbalanced data. Similarly, any metric can be
checked to know if it is sensitive to the imbalanced data or not. 2.3. Sensitivity and specificity
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
4 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 5
Matthews correlation coefficient (MCC): this metric was intro- Balanced classification rate or balanced accuracy (BCR): this met-
duced by Brian W. Matthews in 1975 [14], and it represents ric combines the sensitivity and specificity metrics and it is cal-
the correlation between the observed and predicted classifica- culated as follows, BCR ¼ 12 ðTPR þ TNRÞ ¼ 12 ðTPþFN
TP
þ TNþFP
TN
Þ. Also,
tions, and it is calculated directly from the confusion matrix Balance error rate (BER) or Half total error rate (HTER) represents
as in Eq. (11). A coefficient of þ1 indicates a perfect prediction, 1 BCR. Both BCR and BER metrics can be used with imbalanced
1 represents total disagreement between prediction and true datasets.
values and zero means that no better than random prediction Geometric Mean (GM): The main goal of all classifiers is to
[16,3]. This metric is sensitive to imbalanced data. improve the sensitivity, without sacrificing the specificity. How-
TP TN FP FN ever, the aims of sensitivity and specificity are often conflicting,
MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi which may not work well, especially when the dataset is imbal-
ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ
anced. Hence, the Geometric Mean (GM) metric aggregates both
TP
TPR PPV sensitivity and specificity measures according to Eq. (15) [3].
¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
N
ffi ð11Þ
PPV TPRð1 TPRÞð1 PPVÞ Adjusted Geometric Mean (AGM) is proposed to obtain as much
information as possible about each class [11]. The AGM metric
is defined according to Eq. (16).
Discriminant power (DP): this measure depends on the sensitiv- rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TP TN
ity and specificity and it is defined as follows, DP ¼ GM ¼ TPR TNR ¼ ð15Þ
pffiffi
TP þ FN TN þ FP
p ðlogð1TNRÞ þ logð1TPRÞÞ [20]. This metric evaluates how well
3 TPR TNR
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
6 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
outside the circle are classified as blue class samples. Additionally, specificity values of A, B, and C are 185
185þ15
0:93; ð175þ25Þ
175
¼ 0:875,
from the figure, it is clear that many assessment methods depend and 180
¼ 0:9, respectively.
ð180þ20Þ
on the TPR and TNR metrics, and all assessment methods can be
estimated from the confusion matrix.
3. Receiver operating characteristics (ROC)
2.9. Illustrative example
The receiver operating characteristics (ROC) curve is a two-
dimensional graph in which the TPR represents the y-axis and
In this section, two examples are introduced. These examples
FPR is the x-axis. The ROC curve has been used to evaluate many
explain how to calculate classification metrics using two classes
systems such as diagnostic systems, medical decision-making sys-
or multiple classes.
tems, and machine learning systems [26]. It is used to make a bal-
ance between the benefits, i.e., true positives, and costs, i.e., false
2.9.1. Binary classification example positives. Any classifier that has discrete outputs such as decision
In this example, assume we have two classes (A and B), i.e., bin- trees is designed to produce only a class decision, i.e., a decision
ary classification, and each class has 100 samples. The A class rep- for each testing sample, and hence it generates only one confusion
resents the positive class while the B class represents the negative matrix which in turn corresponds to one point into the ROC space.
class. The number of correctly classified samples in class A and B However, there are many methods that were introduced for gener-
are 70 and 80, respectively. Hence, the values of TP; TN; FP, and ating full ROC curve from a classifier instead of only a single point
FN are 70, 80, 20, and 30, respectively. The values of different such as using class proportions [26] or using some combinations of
classification metrics are as follows, Acc ¼ 70þ80þ20þ30 70þ80
¼ scoring and voting [8]. On the other hand, in continuous output
0:75; TPR ¼ 70þ30
70
¼ 0:7; TNR ¼ 80þ20
80
¼ 0:8; PPV ¼ 70þ20
70
0:78; classifiers such as the Naive Bayes classifier, the output is repre-
NPV ¼ 80
0:73; Err ¼ 1 Acc ¼ 0:25; BCR ¼ 12 ð0:7 þ 0:8Þ ¼ 0:75; sented by a numeric value, i.e., score, which represents the degree
80þ30
to which a sample belongs to a specific class. The ROC curve is gen-
FPR¼10:8¼0:2; FNR¼10:7¼0:3; F measure¼ ð270þ20þ30Þ
270
¼0:74;
erated by changing the threshold on the confidence score; hence,
OP ¼ Acc jTPRTNRj
TPRþTNR
¼ 0:75 j0:70:8j
0:7þ0:8
0:683; LRþ ¼ 10:8
0:7
¼ 3:5; LR ¼ each threshold generates only one point in the ROC curve [8].
10:7
0:8
¼ 0:375; DOR ¼ 0:375
3:5
9:33; YI ¼ 0:7 þ 0:8 1 ¼ 0:5, and Fig. 5 shows an example of the ROC curve. As shown, there are
Jaccard ¼ 70þ20þ30
70
0:583. four important points in the ROC curve. The point A, in the lower
We increased the number of samples of the B class to 1000 to left corner ð0; 0Þ represents a classifier where there is no positive
show how the classification metrics are changed when using classification, while all negative samples are correctly classified
imbalanced data, and there are 800 samples from class B were cor- and hence TPR ¼ 0 and FPR ¼ 0. The point C, in the top right corner
rectly classified. As a consequence, the values of TP; TN; FP, and FN (1,1), represents a classifier where all positive samples are cor-
are 70, 800, 200, and 30, respectively. Consequently, only the rectly classified, while the negative samples are misclassified.
values of accuracy, precision/PPV, NPV, error rate, Optimization The point D in the lower right corner ð1; 0Þ represents a classifier
precision, F-measure, and Jaccard are changed as follows, where all positive and negative samples are misclassified. The
point B in the upper left corner ð0; 1Þ represents a classifier where
Acc ¼ 70þ800þ200þ30
70þ800
0:79; PPV ¼ 70þ200
70
0:26; NPV ¼ 800þ30
800
0:96;
all positive and negative samples are correctly classified; thus, this
Err ¼ 1 Acc ¼ 0:21; OP ¼ 0:79 j0:70:8j
0:7þ0:8
0:723; F measure ¼ point represents the perfect classification or the Ideal operating
270
ð270þ200þ30Þ
¼ 0:378, and Jaccard ¼ 70
70þ200þ30
0:233. This example point. Fig. 5 shows the perfect classification performance. It is the
reflects that the accuracy, precision, NPV, F-measure, and Jaccard green curve which rises vertically from (0,0) to (0,1) and then hor-
metrics are sensitive to imbalanced data. izontally to (1,1). This curve reflects that the classifier perfectly
ranked the positive samples relative to the negative samples.
A point in the ROC space is better than all other points that are
2.9.2. Multi-classification example
in the southeast, i.e., the points that have lower TPR, higher FPR, or
In this example, there are three classes A, B, and C, the results of
a classification test are shown in Fig. 4. From the figure, the values
of TP A ; TP B , and TBC are 80, 70, and 90, respectively, which repre-
sent the diagonal in Fig. 4. The values of false negative for each
class (true class) are calculated as mentioned before by adding
all errors in the column of that class. For example,
FNA ¼ EAB þ EAC ¼ 15 þ 5 ¼ 20, and similarly FNB ¼ EBA þ EBC ¼
15 þ 15 ¼ 30 and FNC ¼ ECA þ ECB ¼ 0 þ 10 ¼ 10. The values of false
positive for each class (predicted class) are calculated as men-
tioned before by adding all errors in the row of that class. For
example, FP A ¼ EBA þ ECA ¼ 15 þ 0 ¼ 15, and similarly
FP B ¼ EAB þ ECB ¼ 15 þ 10 ¼ 25 and FP C ¼ EAC þ EBC ¼ 5 þ 15 ¼ 20.
The value of true negative for the class A (TN A ) can be calculated
by adding all columns and rows excluding the row and column
of class A; this is similar to the TN in the 2 2 confusion matrix.
Hence, the value of TN A is calculated as follows,
TN A ¼ 70 þ 90 þ 10 þ 15 ¼ 185, and similarly TN B ¼ 80 þ 0þ
5 þ 90 ¼ 175 and TN C ¼ 80 þ 70 þ 15 þ 15 ¼ 180. Using
TP; TN; FP, and FN we can calculate all classification measures. For
80þ70þ90
example, the accuracy is 100þ100þ100 ¼ 0:8. The sensitivity and speci-
ficity are calculated for each class. For example, the sensitivity of A
is TPATPþFN
A
¼ 80þ15þ5
80
¼ 0:8, and similarly the sensitivity of B and C
A Fig. 5. A basic ROC curve showing important points, and the optimistic, pessimistic
classes are 70
70þ15þ15
¼ 0:7 and 90
90þ0þ10
¼ 0:9, respectively, and the and expected ROC segments for equally scored samples.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 7
both (see Fig. 5). Therefore, any classifier appears in the lower right
triangle performs worse than the classifier appears in the upper
left triangle.
Fig. 6 shows an example of the ROC curve. In this example, a test
set consists of 20 samples from two classes; each class has ten
samples, i.e., ten positive and ten negative samples. As shown in
the table in Fig. 6, the initial step to plot the ROC curve is to sort
the samples according to their scores. Next, the threshold value
is changed from maximum to minimum to plot the ROC curve.
To scan all samples, the threshold is ranged from 1 to 1. The
samples are classified into the positive class if their scores are
higher than or equal the threshold; otherwise, it is estimated as
negative [8]. Figs. 7 and 8 shows how changing the threshold value
changes the TPR and FPR. As shown in Fig. 6, the threshold value is
set at maximum (t1 ¼ 1); hence, all samples are classified as neg-
ative samples and the values of FPR and TPR are zeros and the posi-
tion of t 1 is in the lower left corner (the point (0,0)). The threshold
value is decreased to 0:82, and the first sample is classified cor- Fig. 7. An illustrative example of the ROC curve. The values of TPR and FPR of each
rectly as a positive sample (see Figs. 6–8(a)). The TPR increased point/threshold are calculated in Table 1.
to 0:1, while the FPR remains zero. As the threshold is further
reduced to be 0:8, the TPR is increased to 0:2 and the FPR remains
zero. As shown in Fig. 7, increasing the TPR moves the ROC curve t8 : As the threshold further decreased to be 0:54, the threshold
up while increasing the FPR moves the ROC curve to the right as line moves to the left. This means that more positive samples
in t 4 . The ROC curve must pass through the point (0,0) where the have the chance to be correctly classified; on the other hand,
threshold value is 1 (in which all samples are classified as nega- some negative samples are misclassified. As a consequence,
tive samples) and the point (1,1) where the threshold is 1 (in the values of TP and FP are increased as shown in Fig. 8(c),
which all samples are classified as positive samples). and the values of TN and FN decreased.
Fig. 8 shows graphically the performance of the classification t11 : This is an important threshold value where the numbers of
model with different threshold values. From this figure, the follow- errors from both positive and negative classes are equal (see
ing remarks can be drawn. Fig. 8(d)) TP ¼ TN ¼ 6 and FP ¼ FN ¼ 4).
t14 : Reducing the value of the threshold to 0:37 results more
t 1 : The value of this threshold was 1 as shown in Fig. 8a) and correctly classified positive samples and this increases TP and
hence all samples are classified as negative samples. This means reduces FN as shown in Fig. 8(e). On the contrary, more negative
that (1) all positive samples are incorrectly classified; hence, the samples are misclassified and this increases FP and reduces TN.
value of TP is zero, (2) all negative samples are correctly classi- t20 : As shown in Fig. 8(f), decreasing the threshold value hides
fied and hence there is no FP (see also Fig. 6). the FN area. This is because all positive samples are correctly
t 3 : The threshold value decreased as shown in Fig. 8b) and as classified. Also, from the figure, it is clear that the FP area is
shown there are two positive samples are correctly classified. much larger than the area of TN. This is because 90% of the neg-
Therefore, according to the positive class, only the positive sam- ative samples are incorrectly classified, and only 10% of negative
ples which have scores more than or equal this threshold (t3 ) samples are correctly classified.
will be correctly classified, i.e., TP, while the other positive sam-
ples are incorrectly classified, i.e., FN. In this threshold, also all From Fig. 7 it is clear that the ROC curve is a step function. This is
negative samples are correctly classified; thus, the value of FP because we only used 20 samples (a finite set of samples) in our
is still zero. example and a true curve can be obtained when the number of
Fig. 6. An illustrative example to calculate the TPR and FPR when the threshold value is changed.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
8 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Fig. 8. A visualization of how changing the threshold changes the TP; TN; FP, and FN values.
samples increased. The figure also shows that the best accuracy each positive sample while the value of FP is increased for each
(70%) (see Table 1) is obtained at (0.1,0.5) when the threshold value negative sample. Next, the values of TPR and FPR are calculated
was P 0:6, rather than at P 0:5 as we might expect with a balanced and pushed into the ROC stack (see step 6). When the threshold
data. This means that the given learning model identifies positive becomes very low (threshold ! 1), all samples are classified as
samples better than negative samples. Since the ROC curve depends positive samples and hence the values of both TPR and FPR are one.
mainly on changing the threshold value, comparing classifiers with Steps 5–8 handle sequences of equally scored samples. Assume
different score ranges will be meaningless. For example, assume we we have a test set which consists of P positive samples and N neg-
have two classifiers, the first generates scores in the range [0,1] and ative samples. In this test set, assume we have p positive samples
the other generates scores in the range [-1,+1] and hence we cannot and n negative samples with the same score value. There are two
compare these classifiers using the ROC curve. extreme cases. In the first case which is the optimistic case, all pos-
The steps of generating ROC curve are summarized in Algorithm itive samples end up at the beginning of the sequence, and this
1. The algorithm requires OðnlognÞ for sorting samples, and OðnÞ for case represents the upper L segment of the rectangle in Fig. 5. In
scanning them; resulting in OðnlognÞ total complexity, where n is the second case, i.e., pessimistic case, all the negative samples
the number of samples. As shown, the two main steps to generate end up at the beginning of the sequence, and this case represents
ROC points are (1) sorting samples according to their scores and (2) the lower L segment of the rectangle in Fig. 5. The ROC curve rep-
changing the threshold value from maximum to minimum to pro- resents the expected performance which is the average of the two
cess one sample at a time and update the values of TP and FP in cases, and it represents the diagonal of the rectangle in Fig. 5. The
pn
each time. The algorithm shows that the TP and the FP start at zero. size of this rectangle is PN , and the number of errors in both opti-
The algorithm scans all samples and the value of TP is increased for pn
mistic and pessimistic cases can be calculated as follows, 2PN .
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 9
Table 1
Values of TP; FN; TN; FP; TPR; FPR; FNR, precision (PPV), and accuracy (Acc in %) of our ROC example when changes the threshold value.
In multi-class classification problems, plotting ROC becomes Comparing different classifiers in the ROC curve is not easy. This
much more complex than in binary classification problems. One is because there is no scalar value represents the expected perfor-
of the well-known methods to handle this problem is to produce mance. Therefore, the Area under the ROC curve (AUC) metric is
one ROC curve for each class. For plotting ROC of the class i (ci ), used to calculate the area under the ROC curve. The AUC score is
the samples from ci represent positive samples and all the other always bounded between zero and one, and there is no realistic
samples are negative samples. classifier has an AUC lower than 0.5 [4,15].
ROC curves are robust against any changes to class distribu- Fig. 9 shows the AUC value of two classifiers, A and B. As shown,
tions. Hence, if the ratio of positive to negative samples changes the AUC of B classifier is greater than A; hence, it achieves better
in a test set, the ROC curve will not change. In other words, ROC performance. Moreover, the gray shaded area is common in both
curves are insensitive with the imbalanced data. This is because classifiers, while the red shaded area represents the area where
ROC depends on TPR and FPR, and each of them is a columnar the B classifier outperforms the A classifier. It is possible for a
ratio3. lower AUC classifier to outperform a higher AUC classifier in a
The following example compares between the ROC using bal- specific region. For example, in Fig. 9, the classifier B outperforms
anced and imbalanced data. Assume the data is balanced and it A except at FPR > 0:6 where A has a slight difference (blue shaded
consists of two classes each has 1000 samples. The point (0.2,0.5) area). However, two classifiers with two different ROC curves may
on the ROC curve means that the classifier obtained 50% sensitivity have the same AUC score.
The AUC value is calculated as in Algorithm 2. As shown, the
steps in Algorithm 2 represent a slight modification from
3
As mentioned before TPR ¼ TPþFN
TP
¼ TP
P and both TP and FN are in the same column,
4
and similarly FNR. The AUC metric will be explained in Section 4
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
10 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
of ci [10]. This method of calculating the AUC score is simple and fast
but it is sensitive to class distributions and error costs.
Precision and recall metrics are widely used for evaluating the
classification performance. The Precision-Recall (PR) curve has the
same concept of the ROC curve, and it can be generated by changing
the threshold as in ROC. However, the ROC curve shows the relation
between sensitivity/recall (TPR) and 1-specificity (FPR) while the PR
curve shows the relationship between recall and precision. Thus, in
the PR curve, the x-axis is the recall and the y-axis is the precision,
i.e., the x-axis of ROC curve is the y-axis of PR curve [8]. Hence, in
the PR curve, there is no need for the TN value.
In the PR curve, the precision value for the first point is unde-
fined because the number of positive predictions is zero, i.e.,
TP ¼ 0 and FP ¼ 0. This problem can be solved by estimating the
first point in the PR curve from the second point. There are two
cases for estimating the first point depending on the value of TP
of the second point.
Fig. 9. An illustrative example of the AUC metric.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 11
6. Biometrics measures
Fig. 10. An illustrative example of the PR curve. The values of precision and recall of
each point/threshold are calculated in Table 1.
Fig. 11. Illustrative example to test the influence of changing the threshold value on the values of FAR; FRR, and EER.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
12 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx
Fig. 13. Results of our experiment. (a) ROC curve, (b) Precision-Recall curve.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 13
confusion matrix, different measures are introduced with detailed [11] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowledge Data
Eng. 21 (9) (2009) 1263–1284.
explanations. The relations between these measures and the
[12] V. López, A. Fernández, S. Garća, V. Palade, F. Herrera, An insight into
robustness of each of them against imbalanced data are also intro- classification with imbalanced data: empirical results and current trends on
duced. Additionally, an illustrative numerical example was used using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141.
for explaining how to calculate different classification measures [13] A. Maratea, A. Petrosino, M. Manzo, Adjusted f-measure and kernel scaling for
imbalanced data learning, Inf. Sci. 257 (2014) 331–341.
with binary and multi-class problems and also to show the robust- [14] B.W. Matthews, Comparison of the predicted and observed secondary
ness of different measures against the imbalanced data. Graphical structure of t4 phage lysozyme, Biochim. Biophys. Acta 405 (2) (1975) 442–
measures such as ROC, PR, and DET curves are also presented with 451.
[15] C.E. Metz, Basic principles of roc analysis, in: Seminars in nuclear medicine,
illustrative examples and visualizations. Finally, various classifica- vol. 8, Elsevier, 1978, pp. 283–298.
tion measures for evaluating biometric models are also presented. [16] D.M. Powers, Evaluation: from precision, recall and f-measure to roc,
informedness, markedness and correlation 2 (1) (2011) 37–63.
[17] T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the
References roc plot when evaluating binary classifiers on imbalanced datasets, PLoS One
10 (3) (2015) e0118432.
[1] C. Blake, Uci repository of machine learning databases, 1998. http://www. ics. [18] A. Shaffi, Measures derived from a 2 x 2 table for an accuracy of a diagnostic
uci. edu/ mlearn/MLRepository. html. test, J. Biometr. Biostat. 2 (2011) 1–4.
[2] R.M. Bolle, J.H. Connell, S. Pankanti, N.K. Ratha, A.W. Senior, Guide to [19] S. Shaikh, Measures derived from a 2 x 2 table for an accuracy of a diagnostic
biometrics, Springer Science & Business Media, 2013. test, J. Biometr. Biostat. 2 (2011) 128.
[3] S. Boughorbel, F. Jarray, M. El-Anbari, Optimal classifier for imbalanced data [20] M. Sokolova, N. Japkowicz, S. Szpakowicz, Beyond accuracy, f-score and roc: a
using matthews correlation coefficient metric, PLoS One 12 (6) (2017) family of discriminant measures for performance evaluation, in: Australasian
e0177678. Joint Conference on Artificial Intelligence, Springer, 2006, pp. 1015–1021.
[4] A.P. Bradley, The use of the area under the roc curve in the evaluation of [21] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for
machine learning algorithms, Pattern Recogn. 30 (7) (1997) 1145–1159. classification tasks, Inf. Process. Manage. 45 (4) (2009) 427–437.
[5] J. Davis, M. Goadrich, The relationship between precision-recall and roc curves, [22] A. Srinivasan, Note on the location of optimal classifiers in n-dimensional roc
in: Proceedings of the 23rd International Conference on Machine Learning, space. Technical Report PRG-TR-2-99, Oxford University Computing
ACM, 2006, pp. 233–240. Laboratory, Oxford, England, 1999.
[6] J.J. Deeks, D.G. Altman, Diagnostic tests 4: likelihood ratios, Brit. Med. J. 329 [23] A. Tharwat, Principal component analysis-a tutorial, Int. J. Appl. Pattern
(7458) (2004) 168–169. Recogn. 3 (3) (2016) 197–240.
[7] R.O. Duda, P.E. Hart, D.G. Stork, et al., Pattern Classification, vol. 2, Wiley, New [24] A. Tharwat, A.E. Hassanien, Chaotic antlion algorithm for parameter
York, 2001, second ed. optimization of support vector machine, Appl. Intelligence 48 (3) (2018)
[8] T. Fawcett, An introduction to roc analysis, Pattern Recogn. Lett. 27 (8) (2006) 670–686.
861–874. [25] A. Tharwat, Y.S. Moemen, A.E. Hassanien, Classification of toxicity effects of
[9] V. Garcia, R.A. Mollineda, J.S. Sanchez, Theoretical analysis of a performance biotransformed hepatic drugs using whale optimized support vector
measure for imbalanced data, in: 20th International Conference on Pattern machines, J. Biomed. Inf. 68 (2017) 132–149.
Recognition (ICPR), IEEE, 2010, pp. 617–620. [26] K.H. Zou, Receiver operating characteristic (roc) literature research, 2002. On-
[10] D.J. Hand, R.J. Till, A simple generalisation of the area under the roc curve for line bibliography available from: http://splweb.bwh.harvard.edu 8000.
multiple class classification problems, Mach. Learn. 45 (2) (2001) 171–186.
Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003