0% found this document useful (0 votes)
64 views13 pages

Lectura 1

This document discusses methods for evaluating classification algorithms. It introduces the confusion matrix and describes several common evaluation metrics. It also covers graphical methods like ROC curves and Precision-Recall curves. Examples are provided to demonstrate calculating the metrics and plotting the curves.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views13 pages

Lectura 1

This document discusses methods for evaluating classification algorithms. It introduces the confusion matrix and describes several common evaluation metrics. It also covers graphical methods like ROC curves and Precision-Recall curves. Examples are provided to demonstrate calculating the metrics and plotting the curves.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Applied Computing and Informatics xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

Applied Computing and Informatics


journal homepage: www.sciencedirect.com

Classification assessment methods


Alaa Tharwat
Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, 60318 Frankfurt am Main, Germany

a r t i c l e i n f o a b s t r a c t

Article history: Classification techniques have been applied to many applications in various fields of sciences. There are
Received 4 May 2018 several ways of evaluating classification algorithms. The analysis of such metrics and its significance
Revised 7 August 2018 must be interpreted correctly for evaluating different learning algorithms. Most of these measures are
Accepted 17 August 2018
scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of
Available online xxxx
the classification assessment measures with the aim of providing the basics of these measures and to
show how it works to serve as a comprehensive source for researchers who are interested in this field.
Keywords:
This overview starts by highlighting the definition of the confusion matrix in binary and multi-class clas-
Receiver operating characteristics (ROC)
Confusion matrix
sification problems. Many classification measures are also explained in details, and the influence of bal-
Precision-Recall (PR) curve anced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1)
Classification how to calculate these measures in binary and multi-class classification problems, and (2) the robustness
Assessment methods of some measures against balanced and imbalanced data. Moreover, some graphical measures such as
Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves
are presented with details. Additionally, in a step-by-step approach, different numerical examples are
demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.
Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Precision-Recall curves give different interpretations of the classi-


fication performance.
Classification techniques have been applied to many applica- Some of the measures which are derived from the confusion
tions in various fields of sciences. In classification models, the matrix for evaluating a diagnostic test are reported in [19]. In that
training data are used for building a classification model to predict paper, only eight measures were introduced. Powers introduced an
the class label for a new sample. The outputs of classification mod- excellent discussion of the precision, Recall, F-score, ROC,
els can be discrete as in the decision tree classifier or continuous as Informedness, Markedness and Correlation assessment methods
the Naive Bayes classifier [7]. However, the outputs of learning with details explanations [16]. Sokolova et al. reported some met-
algorithms need to be assessed and analyzed carefully and this rics which are used in medical diagnosis [20]. Moreover, a good
analysis must be interpreted correctly, so as to evaluate different investigation of some measures and the robustness of these mea-
learning algorithms. sures against different changes in the confusion matrix are intro-
The classification performance is represented by scalar values duced in [21]. Tom Fawcett presented a detailed introduction to
as in different metrics such as accuracy, sensitivity, and specificity. the ROC curve including (1) good explanations of the basics of
Comparing different classifiers using these measures is easy, but it the ROC curve, (2) clear example for generating the ROC curve,
has many problems such as the sensitivity to imbalanced data and (3) comprehensive discussions, and (4) good explanations of the
ignoring the performance of some classes. Graphical assessment Area under curve (AUC) metric [8]. Jesse Davis and Mark Goadrich
methods such as Receiver operating characteristics (ROC) and reported the relationship between the ROC and Precision-Recall
curves [5]. Our paper introduces a detailed overview of the classi-
fication assessment methods with the goal of providing the basic
Peer review under responsibility of King Saud University. principles of these measures and to show how it works to serve
as a comprehensive source for researchers who are interested in
this field. This paper has details of most of the well-known classi-
fication assessment methods. Moreover, this paper introduces (1)
Production and hosting by Elsevier the relations between different assessment methods, (2) numerical
examples to show how to calculate these assessment methods, (3)
E-mail address: [email protected]

https://doi.org/10.1016/j.aci.2018.08.003
2210-8327/Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
2 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx

the robustness of each method against imbalanced data which is


one of the most important problems in real-time applications,
and (4) explanations of different curves in a step-by-step approach.
This paper is divided into eight sections. Section 2 gives an over-
view of the classification assessment methods. This section begins
by explaining the confusion matrix for binary and multi-class clas-
sification problems. Based on the data that can be extracted from
the confusion matrix, many classification metrics can be calcu-
lated. Moreover, the influence of balanced and imbalanced data
Fig. 1. An illustrative example of the 2  2 confusion matrix. There are two true
on each assessment method is introduced. Additionally, an illustra- classes P and N. The output of the predicted class is true or false.
tive numerical example is presented to show (1) how to calculate
these measures in both binary and multi-class classification prob-
lems, and (2) the robustness of some measures against balanced
and imbalanced data. Section 3 introduces the basics of the ROC
curve, which are required for understanding how to plot and inter-
pret it. This section also presents visualized steps with an illustra-
tive example for plotting the ROC curve. The AUC measure is
presented in Section 4. In this section, the AUC algorithm with
detailed steps is explained. Section 5 presents the basics of the
Precision-Recall curve and how to interpret it. Further, in a step-
by-step approach, different numerical examples are demonstrated
to explain the preprocessing steps of plotting ROC and PR curves in
Sections 3 and 5. Classification assessment methods for biometric
models including steps of plotting the DET curve are presented in Fig. 2. An illustrative example of the confusion matrix for a multi-class classifica-
tion test.
Section 6. In Section 7, results in terms of different assessment
methods of a simple experiment are presented. Finally, concluding
remarks will be given in Section 8. The green diagonal represents correct predictions and the pink
diagonal indicates the incorrect predictions. If the sample is posi-
tive and it is classified as positive, i.e., correctly classified positive
2. Classification performance sample, it is counted as a true positive (TP); if it is classified as neg-
ative, it is considered as a false negative (FN) or Type II error. If the
The assessment method is a key factor in evaluating the classi- sample is negative and it is classified as negative it is considered as
fication performance and guiding the classifier modeling. There are true negative (TN); if it is classified as positive, it is counted as false
three main phases of the classification process, namely, training positive (FP), false alarm or Type I error. As we will present in the
phase, validation phase, and testing phase. The model is trained next sections, the confusion matrix is used to calculate many com-
using input patterns and this phase is called the training phase. mon classification metrics.
These input patterns are called training data which are used for Fig. 2 shows the confusion matrix for a multi-class classification
training the model. During this phase, the parameters of a classifi- problem with three classes (A, B, and C). As shown, TPA is the num-
cation model are adjusted. The training error measures how well ber of true positive samples in class A, i.e., the number of samples
the trained model fits the training data. However, the training error that are correctly classified from class A, and EAB is the samples
always smaller than the testing error and the validation error from class A that were incorrectly classified as class B, i.e., misclas-
because the trained model fits the same data which are used in sified samples. Thus, the false negative in the A class (FN A ) is the
the training phase. The goal of a learning algorithm is to learn from sum of EAB and EAC (FN A ¼ EAB þ EAC ) which indicates the sum of
the training data to predict class labels for unseen data; this is in all class A samples that were incorrectly classified as class B or C.
the testing phase. However, the testing error or out-of-sample Simply, FN of any class which is located in a column can be calcu-
error cannot be estimated because the class labels or outputs of lated by adding the errors in that class/column. Whereas the false
testing samples are unknown. This is the reason why the validation positive for any predicted class which is located in a row repre-
phase is used for evaluating the performance of the trained model. sents the sum of all errors in that row. For example, the false pos-
In the validation phase, the validation data provide an unbiased itive in class A (FP A ) is calculated as follows, FP A ¼ EBA þ ECA . With
evaluation of the trained model while tuning the model’s m  m confusion matrix there are m correct classifications and
hyperparameters. m2  m possible errors [22].
According to the number of classes, there are two types of clas-
sification problems, namely, binary classification where there are
only two classes, and multi-class classification where the number 2.1. Classification metrics with imbalanced data
of classes is higher than two. Assume we have two classes, i.e., bin-
ary classification, P for positive class and N for negative class. An Different assessment methods are sensitive to the imbalanced
unknown sample is classified to P or N. The classification model data when the samples of one class in a dataset outnumber the
that was trained in the training phase is used to predict the true samples of the other class(es) [25]. To explain this is so, consider
classes of unknown samples. This classification model produces the confusion matrix in Fig. 1. The class distribution is the ratio
continuous or discrete outputs. The discrete output that is gener- between the positive and negative samples (NP ) represents the rela-
ated from a classification model represents the predicted discrete tionship between the left column to the right column. Any assess-
class label of the unknown/test sample, while continuous output ment metric that uses values from both columns will be sensitive
represents the estimation of the sample’s class membership to the imbalanced data as reported in [8]. For example, some met-
probability. rics such as accuracy and precision1 use values from both columns
Fig. 1 shows that there are four possible outputs which repre-
sent the elements of a 2  2 confusion matrix or a contingency table. 1
More details about these two metrics are in Sections 2.2 and 2.5.

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 3

Fig. 3. Visualization of different metrics and the relations between these metrics. Given two classes, red class and blue class. The black circle represents a classifier that
classifies the sample inside the circle as red samples (belong to the red class) and the samples outside the circle as blue samples (belong to the blue class). Green regions
indicate the correctly classified regions and the red regions indicate the misclassified regions. (For interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.)

of the confusion matrix; thus, as data distributions change, these 2.2. Accuracy and error rate
metrics will change as well, even if the classifier performance does
not. Therefore, such these metrics cannot distinguish between the Accuracy (Acc) is one of the most commonly used measures for
numbers of corrected labels from different classes [11]. This fact is the classification performance, and it is defined as a ratio between
partially true because there are some metrics such as Geometric the correctly classified samples to the total number of samples as
Mean (GM) and Youden’s index (YI)2 use values from both columns follows [20]:
and these metrics can be used with balanced and imbalanced data.
This can be interpreted as that the metrics which use values from TP þ TN
Acc ¼ ð1Þ
one column cancel the changes in the class distribution. However, TP þ TN þ FP þ FN
some metrics which use values from both columns are not sensitive where P and N indicate the number of positive and negative sam-
to the imbalanced data because the changes in the class distribution ples, respectively.
cancel each other. For example, the accuracy is defined as follows, The complement of the accuracy metric is the Error rate (ERR) or
Acc ¼ TPþTNþFPþFN
TPþTN
and the GM is defined as follows, misclassification rate. This metric represents the number of misclas-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
GM ¼ TPR  TNR ¼ TPþFN  TNþFP; thus, both metrics use values
TP TN sified samples from both positive and negative classes, and it is cal-
culated as follows, EER ¼ 1  Acc ¼ ðFP þ FNÞ=ðTP þ TN þ FP þ FNÞ
from both columns of the confusion matrix. Changing the class dis-
[4]. Both accuracy and error rate metrics are sensitive to the imbal-
tribution can be obtained by increasing/decreasing the number of
anced data. Another problem with the accuracy is that two classi-
samples of negative/positive class. With the same classification per-
fiers can yield the same accuracy but perform differently with
formance, assume that the negative class samples are increased by a
respect to the types of correct and incorrect decisions they provide
times; thus, the TN and FP values will be aTN and aFP, respectively;
aTN [9]. However, Takaya Saito and Marc Rehmsmeier reported that the
thus, the accuracy will be, Acc ¼ TPþaTPþ TNþaFPþFN
– TPþTNþFPþFN
TPþTN
. This
accuracy is suitable with imbalanced data because they found that
means that the accuracy is affected by the changes in the class dis- the accuracy values of the balanced and imbalanced data in their
tribution. On the other hand, the GM metric will be, example were identical [17]. The reason why the accuracy values
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
GM ¼ TPþFN TP
 aTNþ aTN ¼ TP
 TNþFPTN
and hence the changes in the were identical in their example is that the sum of TP and TN in
aFP TPþFN

negative class cancel each other. This is the reason why the GM met- the balanced and imbalanced data was the same.
ric is suitable for the imbalanced data. Similarly, any metric can be
checked to know if it is sensitive to the imbalanced data or not. 2.3. Sensitivity and specificity

Sensitivity, True positive rate (TPR), hit rate, or recall, of a classifier


2
More details about these two metrics are in Section 2.8. represents the positive correctly classified samples to the total

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
4 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx

number of positive samples, and it is estimated according to Eq. (2) TN


NPV ¼ ¼ 1  FOR ð7Þ
[20]. Whereas specificity, True negative rate (TNR), or inverse recall is FN þ TN
expressed as the ratio of the correctly classified negative samples to The accuracy can also be defined in terms of precision and
the total number of negative samples as in Eq. (2) [20]. Thus, the inverse precision as follows [16]:
specificity represents the proportion of the negative samples that
were correctly classified, and the sensitivity is the proportion of TP þ FP TN þ FN
Acc ¼  PPV þ  NPV
the positive samples that were correctly classified. Generally, we PþN PþN
can consider sensitivity and specificity as two kinds of accuracy, TP þ FP TP TN þ FN TN
¼  þ 
where the first for actual positive samples and the second for actual PþN TP þ FP PþN TN þ FN
negative samples. Sensitivity depends on TP and FN which are in the TP þ TN
¼ ð8Þ
same column of the confusion matrix, and similarly, the specificity TP þ TN þ FP þ FN
metric depends on TN and FP which are in the same column; hence,
both sensitivity and specificity can be used for evaluating the clas-
sification performance with imbalanced data [9].
TP TP TN TN 2.6. Likelihood ratio
TPR ¼ ¼ ; TNR ¼ ¼ ð2Þ
TP þ FN P FP þ TN N
The likelihood ratio combines both sensitivity and specificity,
The accuracy can also be defined in terms of sensitivity and and it is used in diagnostic tests. In that tests, not all positive
specificity as follows [20]: results are true positives and also the same for negative results;
TP þ TN hence, the positive and negative results change the probability/
Acc ¼ likelihood of diseases. Likelihood ratio measures the influence of
TP þ TN þ FP þ FN
P N a result on the probability. Positive likelihood (LRþ) measures
¼TPR  þ TNR  how much the odds of the disease increases when a diagnostic
PþN PþN
TP P TN N test is positive, and it is calculated as in Eq. (9) [20]. Similarly,
¼ þ Negative likelihood (LR) measures how much the odds of the
TP þ FN P þ N TN þ FP P þ N
disease decreases when a diagnostic test is negative, and it is
TP TN TP þ TN
¼ þ ¼ ð3Þ calculated as in Eq. (9). Both measures depend on the sensitivity
P þ N P þ N TP þ TN þ FP þ FN
and specificity measures; thus, they are suitable for balanced and
imbalanced data [6].
TPR TPR 1  TPR
LRþ ¼ ¼ ; LR ¼ ð9Þ
2.4. False positive and false negative rates 1  TNR FPR TNR
Both LRþ and LR are combined into one measure which sum-
False positive rate (FPR) is also called false alarm rate (FAR), or
marizes the performance of the test, this measure is called Diagnos-
Fallout, and it represents the ratio between the incorrectly classi-
tic odds ratio (DOR). The DOR metric represents the ratio between
fied negative samples to the total number of negative samples
the positive likelihood ratio to the negative likelihood ratio as in
[16]. In other words, it is the proportion of the negative samples
Eq. (10). This measure is utilized for estimating the discriminative
that were incorrectly classified. Hence, it complements the speci-
ability of the test and also for comparing between two diagnostic
ficity as in Eq. (4) [21]. The False negative rate (FNR) or miss rate
tests. From Eq. (10) it can be remarked that the value of DOR
is the proportion of positive samples that were incorrectly classi-
increases when (1) the TP and TN are high and (2) the FP and FN
fied. Thus, it complements the sensitivity measure and it is defined
are low [18].
in Eq. (5). Both FPR and FNR are not sensitive to changes in data
distributions and hence both metrics can be used with imbalanced LRþ TPR TNR TP  TN
DOR ¼ ¼  ¼ ð10Þ
data [9]. LR 1  TNR 1  TPR FP  FN
FP FP
FPR ¼ 1  TNR ¼ ¼ ð4Þ
FP þ TN N 2.7. Youden’s index

FN FN Youden’s index (YI) or Bookmaker Informedness (BM) metric is


FNR ¼ 1  TPR ¼ ¼ ð5Þ
FN þ TP P one of the well-known diagnostic tests. It evaluates the discrimina-
tive power of the test. The formula of Youden’s index combines the
sensitivity and specificity as in the DOR metric, and it is defined as
2.5. Predictive values
follows, YI ¼ TPR þ TNR  1 [20]. The YI metric is ranged from zero
when the test is poor to one which represents a perfect diagnostic
Predictive values (positive and negative) reflect the performance
test. It is also suitable with imbalanced data. One of the major dis-
of the prediction. Positive prediction value (PPV) or precision repre-
advantages of this test is that it does not change concerning the
sents the proportion of positive samples that were correctly classi-
differences between the sensitivity and specificity of the test. For
fied to the total number of positive predicted samples as indicated
example, given two tests, the sensitivity values for the first and
in Eq. (6) [20]. On the contrary, Negative predictive value (NPV),
second tests are 0.7 and 0.9, respectively, and the specificity values
inverse precision, or true negative accuracy (TNA) measures the pro-
for the first and second tests are 0.8 and 0.6, respectively; the YI
portion of negative samples that were correctly classified to the total
value for both tests is 0.5.
number of negative predicted samples as indicated in Eq. (7) [16].
These two measures are sensitive to the imbalanced data [21,9].
False discovery rate (FDR) and False omission rate (FOR) measures 2.8. Another metrics
complements the PPV and NPV, respectively (see Eq. (6) and (7)).
There are many different metrics that can be calculated from
TP the previous metrics. Some details about each measure are as
PPV ¼ Precision ¼ ¼ 1  FDR ð6Þ
FP þ TP follow:

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 5

 Matthews correlation coefficient (MCC): this metric was intro-  Balanced classification rate or balanced accuracy (BCR): this met-
duced by Brian W. Matthews in 1975 [14], and it represents ric combines the sensitivity and specificity metrics and it is cal-
the correlation between the observed and predicted classifica- culated as follows, BCR ¼ 12 ðTPR þ TNRÞ ¼ 12 ðTPþFN
TP
þ TNþFP
TN
Þ. Also,
tions, and it is calculated directly from the confusion matrix Balance error rate (BER) or Half total error rate (HTER) represents
as in Eq. (11). A coefficient of þ1 indicates a perfect prediction, 1  BCR. Both BCR and BER metrics can be used with imbalanced
1 represents total disagreement between prediction and true datasets.
values and zero means that no better than random prediction  Geometric Mean (GM): The main goal of all classifiers is to
[16,3]. This metric is sensitive to imbalanced data. improve the sensitivity, without sacrificing the specificity. How-
TP  TN  FP  FN ever, the aims of sensitivity and specificity are often conflicting,
MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi which may not work well, especially when the dataset is imbal-
ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ
anced. Hence, the Geometric Mean (GM) metric aggregates both
TP
 TPR  PPV sensitivity and specificity measures according to Eq. (15) [3].
¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
N
ffi ð11Þ
PPV  TPRð1  TPRÞð1  PPVÞ Adjusted Geometric Mean (AGM) is proposed to obtain as much
information as possible about each class [11]. The AGM metric
is defined according to Eq. (16).
 Discriminant power (DP): this measure depends on the sensitiv- rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TP TN
ity and specificity and it is defined as follows, DP ¼ GM ¼ TPR  TNR ¼  ð15Þ
pffiffi
TP þ FN TN þ FP
p ðlogð1TNRÞ þ logð1TPRÞÞ [20]. This metric evaluates how well
3 TPR TNR

the classification model distinguishes between positive and (


GMþTNRðFPþTNÞ
negative samples. Since this metric depends on the sensitivity if TPR > 0
AGM ¼ 1þFPþTN ð16Þ
and specificity metrics; it can be used with imbalanced data. 0 if TPR ¼ 0
 F-measure: this is also called F 1 -score, and it represents the har-
monic mean of precision and recall as in Eq. (12) [20]. The value GM metric can be used with imbalanced datasets. Lopez et al.
of F-measure is ranged from zero to one, and high values of F- reported that the AGM metric is suitable with the imbalanced
measure indicate high classification performance. This measure data [12]. However, changing the distribution of negative class
has another variant which is called F b -measure. This variant has a small influence on the AGM metric and hence it is not suit-
represents the weighted harmonic mean between precision able with the imbalanced data. This is can be proved simply by
and recall as in Eq. (13). This metric is sensitive to changes in assuming that the negative class samples are increased by a
data distributions. Assume that the negative class samples are times. Thus, the AGM metric is calculated as follows,
aFPþaTNÞ
increased by a times; thus, the F  measure is calculated as fol- AGM ¼ GMþTNRð
1þaFPþaTN
; as a consequence, the AGM metric is
lows, F  measure ¼ 2TPþa2TP FPþaFN
and hence this metric is affected slightly affected by the changes in the class distribution.
by the changes in the class distribution.  Optimization precision (OP): This metric is defined as follows:
2PPV  TPR jTPR  TNRj
F  measure ¼ OP ¼ Acc  ð17Þ
PPV þ TPR TPR þ TNR
2TP
¼ ð12Þ where the second term jTPRTNRj computes how balanced both
2TP þ FP þ FN TPRþTNR
class accuracies are and this metric represents the difference
between the global accuracy and that term [9]. High OP value
indicates high accuracy and well-balanced class accuracies.
PPV:TPR
F b  measure ¼ð1 þ b2 Þ Since the OP metric depends on the accuracy metric, it is not
b2 PPV þ TPR suitable for imbalanced data.
ð1 þ b2 ÞTP  Jaccard: This metric is also called Tanimoto similarity coeffi-
¼ ð13Þ
ð1 þ b ÞTP þ b2 FN þ FP
2 cient. Jaccard metric explicitly ignores the correct classification
of negative samples as follows, Jaccard ¼ TPþFPþFN
TP
. Jaccard metric
is sensitive to changes in data distributions.
Adjusted F-measure (AGF) was introduced in [13]. The F-
measures used only three of the four elements of the confusion Fig. 4 shows the relations between different classification
matrix and hence two classifiers with different TNR values may assessment methods. As shown, all assessment methods can be
have the same F-score. Therefore, the AGF metric is introduced calculated from the confusion matrix. As shown, there are two
to use all elements of the confusion matrix and provide more classes; red class and blue class. After applying a classifier, the clas-
weights to samples which are correctly classified in the minority sifier is represented by a black circle and the samples which are
class. This metric is defined as follows: inside the circle are classified as red class samples and the samples
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
AGF ¼ F 2 :Inv F 0:5 ð14Þ
where F 2 is the F-measure where b ¼ 2 and Inv F 0:5 is calculated
by building a new confusion matrix where the class label of each
sample is switched (i.e. positive samples become negative and
vice versa).
 Markedness (MK): this is defined based on PPV and NPV metrics
as follows, MK ¼ PPV þ NPV  1 [16]. This metric sensitive to
data changes and hence it is not suitable for imbalanced data.
This is because the Markedness metric depends on PPV and
NPV metrics and both PPV and NPV are sensitive to changes
in data distributions. Fig. 4. Results of a multi-class classification test (our example).

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
6 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx

outside the circle are classified as blue class samples. Additionally, specificity values of A, B, and C are 185
185þ15
 0:93; ð175þ25Þ
175
¼ 0:875,
from the figure, it is clear that many assessment methods depend and 180
¼ 0:9, respectively.
ð180þ20Þ
on the TPR and TNR metrics, and all assessment methods can be
estimated from the confusion matrix.
3. Receiver operating characteristics (ROC)
2.9. Illustrative example
The receiver operating characteristics (ROC) curve is a two-
dimensional graph in which the TPR represents the y-axis and
In this section, two examples are introduced. These examples
FPR is the x-axis. The ROC curve has been used to evaluate many
explain how to calculate classification metrics using two classes
systems such as diagnostic systems, medical decision-making sys-
or multiple classes.
tems, and machine learning systems [26]. It is used to make a bal-
ance between the benefits, i.e., true positives, and costs, i.e., false
2.9.1. Binary classification example positives. Any classifier that has discrete outputs such as decision
In this example, assume we have two classes (A and B), i.e., bin- trees is designed to produce only a class decision, i.e., a decision
ary classification, and each class has 100 samples. The A class rep- for each testing sample, and hence it generates only one confusion
resents the positive class while the B class represents the negative matrix which in turn corresponds to one point into the ROC space.
class. The number of correctly classified samples in class A and B However, there are many methods that were introduced for gener-
are 70 and 80, respectively. Hence, the values of TP; TN; FP, and ating full ROC curve from a classifier instead of only a single point
FN are 70, 80, 20, and 30, respectively. The values of different such as using class proportions [26] or using some combinations of
classification metrics are as follows, Acc ¼ 70þ80þ20þ30 70þ80
¼ scoring and voting [8]. On the other hand, in continuous output
0:75; TPR ¼ 70þ30
70
¼ 0:7; TNR ¼ 80þ20
80
¼ 0:8; PPV ¼ 70þ20
70
 0:78; classifiers such as the Naive Bayes classifier, the output is repre-
NPV ¼ 80
 0:73; Err ¼ 1  Acc ¼ 0:25; BCR ¼ 12 ð0:7 þ 0:8Þ ¼ 0:75; sented by a numeric value, i.e., score, which represents the degree
80þ30
to which a sample belongs to a specific class. The ROC curve is gen-
FPR¼10:8¼0:2; FNR¼10:7¼0:3; F measure¼ ð270þ20þ30Þ
270
¼0:74;
erated by changing the threshold on the confidence score; hence,
OP ¼ Acc  jTPRTNRj
TPRþTNR
¼ 0:75  j0:70:8j
0:7þ0:8
 0:683; LRþ ¼ 10:8
0:7
¼ 3:5; LR ¼ each threshold generates only one point in the ROC curve [8].
10:7
0:8
¼ 0:375; DOR ¼ 0:375
3:5
 9:33; YI ¼ 0:7 þ 0:8  1 ¼ 0:5, and Fig. 5 shows an example of the ROC curve. As shown, there are
Jaccard ¼ 70þ20þ30
70
 0:583. four important points in the ROC curve. The point A, in the lower
We increased the number of samples of the B class to 1000 to left corner ð0; 0Þ represents a classifier where there is no positive
show how the classification metrics are changed when using classification, while all negative samples are correctly classified
imbalanced data, and there are 800 samples from class B were cor- and hence TPR ¼ 0 and FPR ¼ 0. The point C, in the top right corner
rectly classified. As a consequence, the values of TP; TN; FP, and FN (1,1), represents a classifier where all positive samples are cor-
are 70, 800, 200, and 30, respectively. Consequently, only the rectly classified, while the negative samples are misclassified.
values of accuracy, precision/PPV, NPV, error rate, Optimization The point D in the lower right corner ð1; 0Þ represents a classifier
precision, F-measure, and Jaccard are changed as follows, where all positive and negative samples are misclassified. The
point B in the upper left corner ð0; 1Þ represents a classifier where
Acc ¼ 70þ800þ200þ30
70þ800
 0:79; PPV ¼ 70þ200
70
 0:26; NPV ¼ 800þ30
800
 0:96;
all positive and negative samples are correctly classified; thus, this
Err ¼ 1  Acc ¼ 0:21; OP ¼ 0:79  j0:70:8j
0:7þ0:8
 0:723; F  measure ¼ point represents the perfect classification or the Ideal operating
270
ð270þ200þ30Þ
¼ 0:378, and Jaccard ¼ 70
70þ200þ30
 0:233. This example point. Fig. 5 shows the perfect classification performance. It is the
reflects that the accuracy, precision, NPV, F-measure, and Jaccard green curve which rises vertically from (0,0) to (0,1) and then hor-
metrics are sensitive to imbalanced data. izontally to (1,1). This curve reflects that the classifier perfectly
ranked the positive samples relative to the negative samples.
A point in the ROC space is better than all other points that are
2.9.2. Multi-classification example
in the southeast, i.e., the points that have lower TPR, higher FPR, or
In this example, there are three classes A, B, and C, the results of
a classification test are shown in Fig. 4. From the figure, the values
of TP A ; TP B , and TBC are 80, 70, and 90, respectively, which repre-
sent the diagonal in Fig. 4. The values of false negative for each
class (true class) are calculated as mentioned before by adding
all errors in the column of that class. For example,
FNA ¼ EAB þ EAC ¼ 15 þ 5 ¼ 20, and similarly FNB ¼ EBA þ EBC ¼
15 þ 15 ¼ 30 and FNC ¼ ECA þ ECB ¼ 0 þ 10 ¼ 10. The values of false
positive for each class (predicted class) are calculated as men-
tioned before by adding all errors in the row of that class. For
example, FP A ¼ EBA þ ECA ¼ 15 þ 0 ¼ 15, and similarly
FP B ¼ EAB þ ECB ¼ 15 þ 10 ¼ 25 and FP C ¼ EAC þ EBC ¼ 5 þ 15 ¼ 20.
The value of true negative for the class A (TN A ) can be calculated
by adding all columns and rows excluding the row and column
of class A; this is similar to the TN in the 2  2 confusion matrix.
Hence, the value of TN A is calculated as follows,
TN A ¼ 70 þ 90 þ 10 þ 15 ¼ 185, and similarly TN B ¼ 80 þ 0þ
5 þ 90 ¼ 175 and TN C ¼ 80 þ 70 þ 15 þ 15 ¼ 180. Using
TP; TN; FP, and FN we can calculate all classification measures. For
80þ70þ90
example, the accuracy is 100þ100þ100 ¼ 0:8. The sensitivity and speci-
ficity are calculated for each class. For example, the sensitivity of A
is TPATPþFN
A
¼ 80þ15þ5
80
¼ 0:8, and similarly the sensitivity of B and C
A Fig. 5. A basic ROC curve showing important points, and the optimistic, pessimistic
classes are 70
70þ15þ15
¼ 0:7 and 90
90þ0þ10
¼ 0:9, respectively, and the and expected ROC segments for equally scored samples.

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 7

both (see Fig. 5). Therefore, any classifier appears in the lower right
triangle performs worse than the classifier appears in the upper
left triangle.
Fig. 6 shows an example of the ROC curve. In this example, a test
set consists of 20 samples from two classes; each class has ten
samples, i.e., ten positive and ten negative samples. As shown in
the table in Fig. 6, the initial step to plot the ROC curve is to sort
the samples according to their scores. Next, the threshold value
is changed from maximum to minimum to plot the ROC curve.
To scan all samples, the threshold is ranged from 1 to 1. The
samples are classified into the positive class if their scores are
higher than or equal the threshold; otherwise, it is estimated as
negative [8]. Figs. 7 and 8 shows how changing the threshold value
changes the TPR and FPR. As shown in Fig. 6, the threshold value is
set at maximum (t1 ¼ 1); hence, all samples are classified as neg-
ative samples and the values of FPR and TPR are zeros and the posi-
tion of t 1 is in the lower left corner (the point (0,0)). The threshold
value is decreased to 0:82, and the first sample is classified cor- Fig. 7. An illustrative example of the ROC curve. The values of TPR and FPR of each
rectly as a positive sample (see Figs. 6–8(a)). The TPR increased point/threshold are calculated in Table 1.
to 0:1, while the FPR remains zero. As the threshold is further
reduced to be 0:8, the TPR is increased to 0:2 and the FPR remains
zero. As shown in Fig. 7, increasing the TPR moves the ROC curve  t8 : As the threshold further decreased to be 0:54, the threshold
up while increasing the FPR moves the ROC curve to the right as line moves to the left. This means that more positive samples
in t 4 . The ROC curve must pass through the point (0,0) where the have the chance to be correctly classified; on the other hand,
threshold value is 1 (in which all samples are classified as nega- some negative samples are misclassified. As a consequence,
tive samples) and the point (1,1) where the threshold is 1 (in the values of TP and FP are increased as shown in Fig. 8(c),
which all samples are classified as positive samples). and the values of TN and FN decreased.
Fig. 8 shows graphically the performance of the classification  t11 : This is an important threshold value where the numbers of
model with different threshold values. From this figure, the follow- errors from both positive and negative classes are equal (see
ing remarks can be drawn. Fig. 8(d)) TP ¼ TN ¼ 6 and FP ¼ FN ¼ 4).
 t14 : Reducing the value of the threshold to 0:37 results more
 t 1 : The value of this threshold was 1 as shown in Fig. 8a) and correctly classified positive samples and this increases TP and
hence all samples are classified as negative samples. This means reduces FN as shown in Fig. 8(e). On the contrary, more negative
that (1) all positive samples are incorrectly classified; hence, the samples are misclassified and this increases FP and reduces TN.
value of TP is zero, (2) all negative samples are correctly classi-  t20 : As shown in Fig. 8(f), decreasing the threshold value hides
fied and hence there is no FP (see also Fig. 6). the FN area. This is because all positive samples are correctly
 t 3 : The threshold value decreased as shown in Fig. 8b) and as classified. Also, from the figure, it is clear that the FP area is
shown there are two positive samples are correctly classified. much larger than the area of TN. This is because 90% of the neg-
Therefore, according to the positive class, only the positive sam- ative samples are incorrectly classified, and only 10% of negative
ples which have scores more than or equal this threshold (t3 ) samples are correctly classified.
will be correctly classified, i.e., TP, while the other positive sam-
ples are incorrectly classified, i.e., FN. In this threshold, also all From Fig. 7 it is clear that the ROC curve is a step function. This is
negative samples are correctly classified; thus, the value of FP because we only used 20 samples (a finite set of samples) in our
is still zero. example and a true curve can be obtained when the number of

Fig. 6. An illustrative example to calculate the TPR and FPR when the threshold value is changed.

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
8 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx

Fig. 8. A visualization of how changing the threshold changes the TP; TN; FP, and FN values.

samples increased. The figure also shows that the best accuracy each positive sample while the value of FP is increased for each
(70%) (see Table 1) is obtained at (0.1,0.5) when the threshold value negative sample. Next, the values of TPR and FPR are calculated
was P 0:6, rather than at P 0:5 as we might expect with a balanced and pushed into the ROC stack (see step 6). When the threshold
data. This means that the given learning model identifies positive becomes very low (threshold ! 1), all samples are classified as
samples better than negative samples. Since the ROC curve depends positive samples and hence the values of both TPR and FPR are one.
mainly on changing the threshold value, comparing classifiers with Steps 5–8 handle sequences of equally scored samples. Assume
different score ranges will be meaningless. For example, assume we we have a test set which consists of P positive samples and N neg-
have two classifiers, the first generates scores in the range [0,1] and ative samples. In this test set, assume we have p positive samples
the other generates scores in the range [-1,+1] and hence we cannot and n negative samples with the same score value. There are two
compare these classifiers using the ROC curve. extreme cases. In the first case which is the optimistic case, all pos-
The steps of generating ROC curve are summarized in Algorithm itive samples end up at the beginning of the sequence, and this
1. The algorithm requires OðnlognÞ for sorting samples, and OðnÞ for case represents the upper L segment of the rectangle in Fig. 5. In
scanning them; resulting in OðnlognÞ total complexity, where n is the second case, i.e., pessimistic case, all the negative samples
the number of samples. As shown, the two main steps to generate end up at the beginning of the sequence, and this case represents
ROC points are (1) sorting samples according to their scores and (2) the lower L segment of the rectangle in Fig. 5. The ROC curve rep-
changing the threshold value from maximum to minimum to pro- resents the expected performance which is the average of the two
cess one sample at a time and update the values of TP and FP in cases, and it represents the diagonal of the rectangle in Fig. 5. The
pn
each time. The algorithm shows that the TP and the FP start at zero. size of this rectangle is PN , and the number of errors in both opti-
The algorithm scans all samples and the value of TP is increased for pn
mistic and pessimistic cases can be calculated as follows, 2PN .

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 9

Table 1
Values of TP; FN; TN; FP; TPR; FPR; FNR, precision (PPV), and accuracy (Acc in %) of our ROC example when changes the threshold value.

Threshold TP FN TN FP TPR FPR FNR PPV Acc


t1 ¼ 1 0 10 10 0 0 0 1 – 50
t2 ¼ 0:82 1 9 10 0 0.1 0 0.9 1.0 55
t3 ¼ 0:80 2 8 10 0 0.2 0 0.8 1.0 60
t4 ¼ 0:75 2 8 9 1 0.2 0.1 0.8 0.67 55
t5 ¼ 0:70 3 7 9 1 0.3 0.1 0.7 0.75 60
t6 ¼ 0:62 4 6 9 1 0.4 0.1 0.6 0.80 65
t7 ¼ 0:60 5 5 9 1 0.5 0.1 0.5 0.83 70
t8 ¼ 0:54 5 5 8 2 0.5 0.2 0.5 0.71 65
t9 ¼ 0:50 5 5 7 3 0.5 0.3 0.5 0.63 60
t10 ¼ 0:49 6 4 7 3 0.6 0.3 0.4 0.67 65
t11 ¼ 0:45 6 4 6 4 0.6 0.4 0.4 0.60 60
t12 ¼ 0:40 7 3 6 4 0.7 0.4 0.3 0.64 65
t13 ¼ 0:39 7 3 5 5 0.7 0.5 0.3 0.58 60
t14 ¼ 0:37 8 2 5 5 0.8 0.5 0.2 0.62 65
t15 ¼ 0:32 8 2 4 6 0.8 0.6 0.2 0.57 60
t16 ¼ 0:30 8 2 3 7 0.8 0.7 0.2 0.53 55
t17 ¼ 0:26 8 2 2 8 0.8 0.8 0.2 0.50 50
t18 ¼ 0:23 9 1 2 8 0.9 0.8 0.1 0.53 55
t19 ¼ 0:21 9 1 1 9 0.9 0.9 0.1 0.50 50
t20 ¼ 0:19 10 0 1 9 1.0 0.9 0 0.53 55
t21 ¼ 0:10 10 0 0 10 1.0 1.0 0 0.50 50

(500 positive samples are correctly classified from 1000 positive


Algorithm 1: Generating ROC Curve. samples) and 80% specificity (800 negative samples are correctly
1: Given a set of test samples (Stest ¼ fs1 ; s2 ; . . . ; sN g), where N classified from 1000 negative samples). If the class distribution
is the total number of test samples, f ðiÞ is the classifier that changed to be imbalanced and the first and second classes have
classify the ith sample to positive or negative classes, P and 1000 and 10,000 samples, respectively. Hence, the same point
N represent the total number of positive and negative (0.2, 0.5) means that the classifier obtained 50% sensitivity (500
samples, respectively. positive samples are correctly classified from 1000 positive sam-
2: Sort the samples corresponding to their scores, where Ssorted ples) and 80% specificity (8000 negative samples are correctly clas-
is the sorted samples. sified from 1000 negative samples). The AUC4 score for both cases
3: FP 0, TP 0, f prev 1, and ROC ¼ ½ . 4: for i ¼ 1 to are the same while the other metrics which are sensitive to the
jSsorted j do imbalanced data will be changed. For example, the accuracy rates
5: if f ðiÞ – f prev then of the classifier using the balanced and imbalanced data are 65
FP TP and 77.3%, respectively, and the precision values will be  0:71
6: ROCðiÞ N ; P , f prev f ðiÞ
and 0.20, respectively. These results reflect how the precision and
7: end if accuracy metrics are sensitive to the imbalanced data as mentioned
8: if Ssorted ðiÞ is a positive sample then in Section 2.1.
9: TP TP þ 1. It is worth mentioning that the comparison between different
10: else classifiers using ROC is valid only when (1) there is only single
11: FP FP þ 1. dataset, (2) there are multiple datasets with the same data size
12: end if and the same positive:negative ratio.
13: end for
FP TP
14: ROCðiÞ N ; P .
4. Area under the ROC curve (AUC)

In multi-class classification problems, plotting ROC becomes Comparing different classifiers in the ROC curve is not easy. This
much more complex than in binary classification problems. One is because there is no scalar value represents the expected perfor-
of the well-known methods to handle this problem is to produce mance. Therefore, the Area under the ROC curve (AUC) metric is
one ROC curve for each class. For plotting ROC of the class i (ci ), used to calculate the area under the ROC curve. The AUC score is
the samples from ci represent positive samples and all the other always bounded between zero and one, and there is no realistic
samples are negative samples. classifier has an AUC lower than 0.5 [4,15].
ROC curves are robust against any changes to class distribu- Fig. 9 shows the AUC value of two classifiers, A and B. As shown,
tions. Hence, if the ratio of positive to negative samples changes the AUC of B classifier is greater than A; hence, it achieves better
in a test set, the ROC curve will not change. In other words, ROC performance. Moreover, the gray shaded area is common in both
curves are insensitive with the imbalanced data. This is because classifiers, while the red shaded area represents the area where
ROC depends on TPR and FPR, and each of them is a columnar the B classifier outperforms the A classifier. It is possible for a
ratio3. lower AUC classifier to outperform a higher AUC classifier in a
The following example compares between the ROC using bal- specific region. For example, in Fig. 9, the classifier B outperforms
anced and imbalanced data. Assume the data is balanced and it A except at FPR > 0:6 where A has a slight difference (blue shaded
consists of two classes each has 1000 samples. The point (0.2,0.5) area). However, two classifiers with two different ROC curves may
on the ROC curve means that the classifier obtained 50% sensitivity have the same AUC score.
The AUC value is calculated as in Algorithm 2. As shown, the
steps in Algorithm 2 represent a slight modification from
3
As mentioned before TPR ¼ TPþFN
TP
¼ TP
P and both TP and FN are in the same column,
4
and similarly FNR. The AUC metric will be explained in Section 4

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
10 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx

of ci [10]. This method of calculating the AUC score is simple and fast
but it is sensitive to class distributions and error costs.

5. Precision-Recall (PR) curve

Precision and recall metrics are widely used for evaluating the
classification performance. The Precision-Recall (PR) curve has the
same concept of the ROC curve, and it can be generated by changing
the threshold as in ROC. However, the ROC curve shows the relation
between sensitivity/recall (TPR) and 1-specificity (FPR) while the PR
curve shows the relationship between recall and precision. Thus, in
the PR curve, the x-axis is the recall and the y-axis is the precision,
i.e., the x-axis of ROC curve is the y-axis of PR curve [8]. Hence, in
the PR curve, there is no need for the TN value.
In the PR curve, the precision value for the first point is unde-
fined because the number of positive predictions is zero, i.e.,
TP ¼ 0 and FP ¼ 0. This problem can be solved by estimating the
first point in the PR curve from the second point. There are two
cases for estimating the first point depending on the value of TP
of the second point.
Fig. 9. An illustrative example of the AUC metric.

1. The number of true positives of the second point is zero: In this


Algorithm 1. In other words, instead of generating ROC points in
case, since the second point is (0,0), the first point is also (0,0).
Algorithm 1, Algorithm 2 adds areas of trapezoids5 of the ROC curve
2. The number of true positives of the second point is not zero:
[4]. As shown in Algorithm 2, the AUC score can be calculated by
this is similar to our example where the second point is (0.1,
adding the areas of trapezoids of the AUC measure. Fig. 9 shows an
1.0). The first point can be estimated by drawing a horizontal
example of one trapezoid; the base of this trapezoid is
line from the second point to the y-axis. Thus, the first point
ðFPR2  FPR1 Þ, and the height of the trapezoid is ðTPR1 þ TPR2 Þ=2;
is estimated as (0.0, 1.0).
hence, the total area of this trapezoid is calculated as follows,
A ¼ Base  Height ¼ ðFPR2  FPR1 Þ  ðTPR1 þ TPR2 Þ=2.
As shown in Fig. 10, the PR curve is often zigzag curve; hence,
PR curves tend to cross each other much more frequently than
Algorithm 2: Calculating the AUC measure.
ROC curves. In the PR curve, a curve above the other has a better
1: The same first two steps in Algorithm 1. classification performance. The perfect classification performance
2: FP 0, TP 0, f prev 1, FP prev 0, TPprev 0, and in the PR curve is represented in Fig. 10 by a green curve. As shown,
A 0, where A is the area under the ROC curve, i.e., AUC this curve starts from the (0,1) horizontally to (1,1) and then verti-
score. 3: for i ¼ 1 to jSsorted j do cally to (1,0), where (0,1) represents a classifier that achieves 100%
4: if f ðiÞ – f prev then precision and 0% recall, (1,1) represents a classifier that obtains
5: A A þ Trapezoid AreaðFP; FP prev ; TP; TPprev Þ. 100% precision and sensitivity and this is the ideal point in the
6: f prev f ðiÞ, FP prev FP, TPprev TP PR curve, and (1,0) indicates the classifier obtains 100% sensitivity
and 0% precision. Hence, we can say that the closer the PR curve is
7: end if
to the upper right corner, the better the classification performance
8: if Ssorted ðiÞ is a positive sample then
is. Since the PR curve depends only on the precision and recall
9: TP TP þ 1.
measures, it ignores the performance of correctly handling nega-
10: else
tive examples (TN) [16].
11: FP FP þ 1.
Eq. (18) indicates the nonlinear interpolation of the PR curve
12: end if
that was introduced by Davis and Goadrich [5].
13: end for
14: A ðA þ Trapezoid AreaðFP; FP prev ; TP; TPprev ÞÞ=ðP  NÞ.
TPA þ x
15: function Trapezoid AreaðX 1 ; X 2 ; Y 1 ; Y 2 Þ y¼ FP B FPA
ð18Þ
16: Base ! jX 1  X 2 j, Height ! ðY 1 þ Y 2 Þ=2 TP A þ x þ FPA þ TP B TP A
:x
17: return Base  Height.
where TP A and TPB represent the true positives of the first and sec-
ond points, respectively, FPA and FPB represent the false positives of
The AUC can be also calculated under the PR curve using the the first and second points, respectively, y is the precision of the
trapezoidal rule as in the ROC curve, and the AUC score of the per- new point, and x is the recall of the new point. The value of x can
fect classifier in PR curves is one as in ROC curves. be any value between zero and jTPB  TPA j. A smooth curve can be
In multi-class classification problems, Provost and Domingos obtained by calculating many intermediate points between two
calculated the total AUC of all classes by generating a ROC curve points A and B. In our example in Fig. 10, assume the first point is
for each class and calculate the AUC value for each ROC curve [10]. the fifth point and the second point is the sixth point (see Table 1).
The total AUC (AUC total ) is the summation of all AUC scores weighted From Table 1, the point A is (0.3,0.75) and the point B is (0.4,0.8).
by the prior probability of each class as follows, AUC total ¼ The value of jTP B  TP A j ¼ j4  3j ¼ 1 and hence the value of x can
P be any value between zero and one. Let x ¼ 0:5, which is the middle
ci 2C AUCðci Þ:pðci Þ, where AUCðci Þ is the AUC under the ROC curve
point between A and B and hence the recall for the new point is
of the class ci ; C is a set of classes, and pðci Þ is the prior probability 0:3þ0:4
2
¼ 0:35. The precision of the new point is calculated as follows,
5
y ¼ 3þxþ1þ
3þx
11x ¼ 3þ0:5þ1þ0  0:778, where the new point using the
3þ0:5
A trapezoid is a 4-sided shape with two parallel sides. 43

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 11

the ratio of positives and negatives defines the baseline. Hence,


changing the ratio between the positive and negative classes
changes that line and hence changes the classification performance.
As indicated in Eq. (6), according to the precision metric, lower-
ing the threshold value increases the TP or FP. Increasing TP
increases the precision while increasing the FP decreases the preci-
sion. Hence, lowering the threshold value fluctuates the precision.
On the other hand, as indicated in Eq. (2), lowering the threshold
may leave the recall value unchanged or increase it. Due to the pre-
cision axis in the PR curve; hence, the PR curve is sensitive to the
imbalanced data. In other words, the PR curves and their AUC val-
ues are different between balanced and imbalanced data.

6. Biometrics measures

Biometrics matching is slightly different than the other classifi-


cation problems and hence it is sometimes called two-instance

Fig. 10. An illustrative example of the PR curve. The values of precision and recall of
each point/threshold are calculated in Table 1.

linear interpolation is (0:3þ0:4


2
; 0:75þ0:8
2
Þ ¼ ð0:35; 0:775). In our exam-
ple, for simplicity, we used the linear interpolation.
The end point in the PR curve is calculated as follows, ð1; PþN P
Þ.
This is because (1) the recall increases by increasing the threshold
value and at the end point the recall reaches to the maximum
recall, (2) increasing the threshold value increases both TP and
FP. Therefore, if the data are balanced, the precision of the end
point is PþNP
¼ 12. The horizontal line which passes through PþN P

represents a classifier with the random performance level. This line


separates the area of the PR curve into (1) the area above the
line and this is the area of good performance and (2) the area below Fig. 12. An illustrative example of the DET curve. The values of FRR and FAR of each
the line and this is the area of poor performance (see Fig. 10). Thus, point/threshold are calculated in Table 1.

Fig. 11. Illustrative example to test the influence of changing the threshold value on the values of FAR; FRR, and EER.

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
12 A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx

Fig. 13. Results of our experiment. (a) ROC curve, (b) Precision-Recall curve.

problem. In this problem, instead of classifying one sample into


one of c groups or classes, biometric determines if the two samples
are in the same group. This can be achieved by identifying an
unknown sample by matching it with all the other known samples.
Fig. 14. Confusion matrices of the three classes in our experiments.
This step generates a score or similarity distance between the
unknown sample and the other samples. The model assigns the
unknown sample to the person which has the most similar score.
between FAR and FRR. Fig. 12 shows an example of the DET curve.
If this level of similarity is not reached, the sample is rejected. In
As shown, as in the ROC curve, the DET curve is plotted by chang-
other words, if the similarity score exceeds a pre-defined thresh-
ing the threshold on the confidence score; thus, each threshold
old; hence, the corresponding sample is said to be matched; other-
generates only one point in the DET curve. The ideal point in this
wise, the sample is not matched. Theoretically, scores of clients
curve is the origin point where the values of both FRR and FAR
(persons known by the biometric system) should always be higher
are zeros and hence the perfect classification performance in the
than the scores of imposters (persons who are not known by the
DET curve is represented in Fig. 12 by a green curve. As shown, this
system). In biometric systems, a single threshold separates the
curve starts from the point (0,1) vertically to (0,0) and then hori-
two groups of scores; thus, it can be utilized for differentiating
zontally to (1,0), where (1) the point (0,1) represents a classifier
between clients and imposters. In real applications, for many rea-
that achieves 100% FAR and 0% FRR, (2) the point (0,0) represents
sons sometimes imposter samples generate scores higher than the
a classifier that obtains 0% FAR and FRR, and (3) the point (1,0) rep-
scores of some client samples. Accordingly, it is a fact that however
resents a classifier that indicates 0% FAR and 100% FRR. Thus, we
the classification threshold is perfectly chosen, some classification
can say that the closer a DET curve is to the lower left corner,
errors occur. For example, given a high threshold; hence, the
the better the classification performance is.
imposters’ scores will not exceed this limit. As a result, no impos-
ters are incorrectly accepted by the model. On the contrary, some
clients are falsely rejected (see Fig. 11 (top panel)). In opposition 7. Experimental results
to this, lowering the threshold value accepts all clients and also
some imposters are falsely accepted. In this section, an experiment was conducted to evaluate the
Two of the most commonly used measures in biometrics are the classification performance using different assessment methods. In
False acceptance rate (FAR) and False rejection/recognition rate (FRR). this experiment, we used Iris dataset which is one of the standard
The FAR is also called false match rate (FMR) and it is the ratio classification datasets and it is obtained from the University of Cal-
between the number of false acceptance to the total number of ifornia at Irvin (UCI) Machine Learning Repository [1]. This dataset
imposters attempts. Hence, it measures the likelihood that the bio- has three classes, each class has 50 samples, and each sample is
metric model will incorrectly accept an access by an imposter or an represented by four features. We used (1) the Principal component
unauthorized user. Hence, to prevent imposter samples from being analysis (PCA) [23] for reducing the features to two features and
easily correctly identified by the model, the similarity score has to (2) Support vector machine (SVM)6 for classification.
exceed a certain level (see Fig. 11) [2]. The FRR or false non-match In our experiment, we used different assessment methods for
rate (F NMR) measures the likelihood that the biometric model will evaluating the learning model. Fig. 13 shows the ROC and
incorrectly reject a client, and it represents the ratio between the Precision-Recall curves. As shown, there are three curves, one
number of false recognitions to the total number of clients’ curve for each class and as shown, the first class obtained results
attempts [2]. For example, if FAR ¼ 10% this means that for one better than the other two classes. Fig. 14 shows the confusion
hundred attempts to access the system by imposters, only ten will matrix for each class. From these confusion matrices we can calcu-
be succeeded and hence increasing FAR decreases the accuracy of late different metrics as mentioned before (see Fig. 3). For example,
the model. On the other hand, with FRR ¼ 10%, ten authorized per- the results of the first class were as follows, Acc; TPR; TNR; PPV, and
sons will be rejected from 100 attempts and hence reducing FRR NPV were 99.33, 100, 98.0, 99.01, 100, respectively. Similarly, the
will help to avoid a high number of trails of authorized clients. results of the other two classes can be calculated.
As a consequence, FAR and FRR in biometrics are similar to false
positive rate (FPR) and false negative rate (FNR), respectively (see
Section 2.4). Equal error rate (EER) measure solves the problem 8. Conclusions
of selecting a threshold value partially, and it represents the failure
rate when the values of FMR and F NMR are equal. Fig. 11 shows the In this paper, the definition, mathematics, and visualizations of
FAR and FRR curves and also the EER measure. the most well-known classification assessment methods were pre-
Detection Error Trade-off (DET) curve is used for evaluating bio- sented and explained. The paper aimed to give a detailed overview
metric models. In this curve, as in the ROC and PR curves, the of the classification assessment measures. Moreover, based on the
threshold value is changed and the values of FAR and FRR are cal-
culated at each threshold. Hence, this curve shows the relation 6
More details about SVM can be found in [24].

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003
A. Tharwat / Applied Computing and Informatics xxx (2018) xxx–xxx 13

confusion matrix, different measures are introduced with detailed [11] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowledge Data
Eng. 21 (9) (2009) 1263–1284.
explanations. The relations between these measures and the
[12] V. López, A. Fernández, S. Garća, V. Palade, F. Herrera, An insight into
robustness of each of them against imbalanced data are also intro- classification with imbalanced data: empirical results and current trends on
duced. Additionally, an illustrative numerical example was used using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141.
for explaining how to calculate different classification measures [13] A. Maratea, A. Petrosino, M. Manzo, Adjusted f-measure and kernel scaling for
imbalanced data learning, Inf. Sci. 257 (2014) 331–341.
with binary and multi-class problems and also to show the robust- [14] B.W. Matthews, Comparison of the predicted and observed secondary
ness of different measures against the imbalanced data. Graphical structure of t4 phage lysozyme, Biochim. Biophys. Acta 405 (2) (1975) 442–
measures such as ROC, PR, and DET curves are also presented with 451.
[15] C.E. Metz, Basic principles of roc analysis, in: Seminars in nuclear medicine,
illustrative examples and visualizations. Finally, various classifica- vol. 8, Elsevier, 1978, pp. 283–298.
tion measures for evaluating biometric models are also presented. [16] D.M. Powers, Evaluation: from precision, recall and f-measure to roc,
informedness, markedness and correlation 2 (1) (2011) 37–63.
[17] T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the
References roc plot when evaluating binary classifiers on imbalanced datasets, PLoS One
10 (3) (2015) e0118432.
[1] C. Blake, Uci repository of machine learning databases, 1998. http://www. ics. [18] A. Shaffi, Measures derived from a 2 x 2 table for an accuracy of a diagnostic
uci. edu/ mlearn/MLRepository. html. test, J. Biometr. Biostat. 2 (2011) 1–4.
[2] R.M. Bolle, J.H. Connell, S. Pankanti, N.K. Ratha, A.W. Senior, Guide to [19] S. Shaikh, Measures derived from a 2 x 2 table for an accuracy of a diagnostic
biometrics, Springer Science & Business Media, 2013. test, J. Biometr. Biostat. 2 (2011) 128.
[3] S. Boughorbel, F. Jarray, M. El-Anbari, Optimal classifier for imbalanced data [20] M. Sokolova, N. Japkowicz, S. Szpakowicz, Beyond accuracy, f-score and roc: a
using matthews correlation coefficient metric, PLoS One 12 (6) (2017) family of discriminant measures for performance evaluation, in: Australasian
e0177678. Joint Conference on Artificial Intelligence, Springer, 2006, pp. 1015–1021.
[4] A.P. Bradley, The use of the area under the roc curve in the evaluation of [21] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for
machine learning algorithms, Pattern Recogn. 30 (7) (1997) 1145–1159. classification tasks, Inf. Process. Manage. 45 (4) (2009) 427–437.
[5] J. Davis, M. Goadrich, The relationship between precision-recall and roc curves, [22] A. Srinivasan, Note on the location of optimal classifiers in n-dimensional roc
in: Proceedings of the 23rd International Conference on Machine Learning, space. Technical Report PRG-TR-2-99, Oxford University Computing
ACM, 2006, pp. 233–240. Laboratory, Oxford, England, 1999.
[6] J.J. Deeks, D.G. Altman, Diagnostic tests 4: likelihood ratios, Brit. Med. J. 329 [23] A. Tharwat, Principal component analysis-a tutorial, Int. J. Appl. Pattern
(7458) (2004) 168–169. Recogn. 3 (3) (2016) 197–240.
[7] R.O. Duda, P.E. Hart, D.G. Stork, et al., Pattern Classification, vol. 2, Wiley, New [24] A. Tharwat, A.E. Hassanien, Chaotic antlion algorithm for parameter
York, 2001, second ed. optimization of support vector machine, Appl. Intelligence 48 (3) (2018)
[8] T. Fawcett, An introduction to roc analysis, Pattern Recogn. Lett. 27 (8) (2006) 670–686.
861–874. [25] A. Tharwat, Y.S. Moemen, A.E. Hassanien, Classification of toxicity effects of
[9] V. Garcia, R.A. Mollineda, J.S. Sanchez, Theoretical analysis of a performance biotransformed hepatic drugs using whale optimized support vector
measure for imbalanced data, in: 20th International Conference on Pattern machines, J. Biomed. Inf. 68 (2017) 132–149.
Recognition (ICPR), IEEE, 2010, pp. 617–620. [26] K.H. Zou, Receiver operating characteristic (roc) literature research, 2002. On-
[10] D.J. Hand, R.J. Till, A simple generalisation of the area under the roc curve for line bibliography available from: http://splweb.bwh.harvard.edu 8000.
multiple class classification problems, Mach. Learn. 45 (2) (2001) 171–186.

Please cite this article in press as: A. Tharwat, Applied Computing and Informatics (2018), https://doi.org/10.1016/j.aci.2018.08.003

You might also like