0% found this document useful (0 votes)
16 views

A Framework For Supervised Classification Performance Analysis

analyse des performances de la classification
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

A Framework For Supervised Classification Performance Analysis

analyse des performances de la classification
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO.

11, NOVEMBER 2020 2075

A Framework for Supervised


Classification Performance Analysis with
Information-Theoretic Methods
ez-Moreno , Member, IEEE
Francisco J. Valverde-Albacete , Member, IEEE and Carmen Pela

Abstract—We introduce a framework for the evaluation of multiclass classifiers by exploring their confusion matrices. Instead of using
error-counting measures of performance, we concentrate in quantifying the information transfer from true to estimated labels using
information-theoretic measures. First, the Entropy Triangle allows us to visualize the balance of mutual information, variation of
information, and the deviation from uniformity in the true and estimated label distributions. Next, the Entropy-Modified Accuracy allows
us to rank classifiers by performance while the Normalized Information Transfer rate allows us to evaluate classifiers by the amount of
information accrued during learning. Finally, if the question rises to elucidate which errors are systematically committed by the
classifier, we use a generalization of Formal Concept Analysis to elicit such knowledge. All such techniques can be applied either to
artificially or biologically embodied classifiers—e.g., human performance on perceptual tasks. We instantiate the framework in a
number of examples to provide guidelines for the use of these tools in the case of assessing single classifiers or populations of them—
whether induced with the same technique or not—either on single tasks or in a set of them. These include well-known UCI tasks and
the more complex KDD cup 99 competition on Intrusion Detection.

Index Terms—Performance evaluation, classification algorithms, information entropy, mutual information, formal concept analysis

1 INTRODUCTION AND MOTIVATION

T HE last 60 years of engineering practice have revealed


that the applicability of Information Theory [1], [2] is far
broader than initially envisaged, and many problems, both
Advances in data science have made evident how impor-
tant is for data collections to closely capture the phenomena
they sample. But often overlooked issues are: 1) how appro-
theoretical and applied, can be characterized as “relating to priate collections are for further downstream processing
the transmission of information”, that is, in information- techniques, and, 2) how they impact any eventually discov-
theoretical terms [3], [4]. In particular, there is a strong ered items of knowledge. A significant example of their rele-
current to use information-theoretic principles and heuris- vance is the growing attention to the problem of learning
tics in machine learning [5] and statistical inference [6], and from imbalanced data [13], [14], [15], [16]. Though most of
several methods for evaluation and analysis based on the times the accent is placed on the learning procedure, the
entropic measures have been recently published [7], [8], [9], importance of the choice of suitable assessment metrics has
[10], [11], [12]. already been highlighted. For instance, there exist a number
For the latter aim, Confusion Matrices (CM) are a of assessment alternatives for binary classification problems
long-standing technique to quantify the performance of (precision and recall, F-measure, G-mean, ROC curves, etc.).
supervised multiclass classifiers. CM are data aggregates to But their generalization to multiclass classification often
measure the performance of any type of classifier—whether relies on simple, pair-wise combinations of such figures-of-
embodied or artificial—by means of counting the errors and merit that fail to capture multiclass peculiarities [13].
successes of iterated acts of classification on specific classifica- In this paper, we present a framework for multi-class
tion datasets. In spite of their pervasiveness in the social and performance analysis based on information-theoretic metrics
natural sciences—where they are commonly treated as a spe- extracted from confusion matrices. By adopting such a
cial kind of contingency tables—tools for manipulating and framework the resulting assessment is distribution-agnostic
extracting knowledge from these performance-recording data and therefore can deal with imbalanced as well as balanced
are scarce. datasets. Furthermore, we distill the theoretical ground-
work for the visualization and assessment of confusion
matrices (also based in information-theoretic concepts)
developed in [7], [11] into a systematic, exploratory, point-
 The authors are with the Department of Signal Theory and Communica-
tions, University Carlos III Madrid, Leganes 28911, Spain. wise mutual information-based technique to elicit the errors
E-mail: {fva, carmen}@tsc.uc3m.es. in confusion matrices first attempted in [17]. This frame-
Manuscript received 3 Oct. 2016; revised 4 Mar. 2019; accepted 28 Apr. 2019. work facilitates a number of steps usually present in a data
Date of publication 8 May 2019; date of current version 6 Oct. 2020. mining process: assessment of a single classifier on a single
(Corresponding author: Carmen Pel aez-Moreno.) task, several classifiers on the same task, a single classifica-
Recommended for acceptance by Q. He.
Digital Object Identifier no. 10.1109/TKDE.2019.2915643 tion technique on different tasks, and several classification

1041-4347 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2076 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

Fig. 1. Information Theoretic Model of supervised multiclass classification. K is the true class label, K ^ the estimated class label. X are the observa-
tions and Y the transformed observations, when applicable. (a) (adapted from [18]) Channel Model. The enclosed area is a black box for classifier
evaluation. (b) (adapted from [11]) Entropy (above) and perplexity (below) decomposition chains for PKK^ . To the left, perplexity reduction due to
learning effectiveness; right, perplexity increase in the output chain, related to classifier specialization.

techniques on different tasks. Moreover, the complexity of the estimate K ^ ¼ kðy^ ys Þ. In this model, classification is a
the algorithms involved only depends on the number of multi-step process: given an input label K ¼ ks , its co-
classes and it is therefore scalable to big databases. indexed vector is produced as a proxy X ¼ x s , perhaps the
We first motivate and establish our proposal for the proxy is transformed into Y ¼ y s , and, finally, the classifier
assessment and evaluation of classifiers through their con- obtains an output label K ^ ¼ k^s .
fusion matrices and classification errors (Section 2.1), In this sense, such classification model is an analogue of a
and then use the rest of the paper to flesh out and show (random) discrete, memoryless virtual communication chan-
how to use such proposal. For this purpose, we recall nel between input K and output labels K, ^ and such models of
the standard way to interpret and represent a confusion communication are the turf of information theory [19]. Induc-
matrix (Section 2.2), the basics of analyzing a confusion ing this classifier function can be the object of Information
matrix by information-theoretic means first by visualiza- Theory, as in Information-Theoretic Learning [5] or Maxi-
tion (Section 2.3), then by assessment (Section 2.4) and mum Entropy Modeling [6], but is not the focus of this paper.
finally by exploratory analysis (Section 2.5). A related work Rather, for assessment purposes, the basic conceptual
review with a discussion (Section 2.6) completes the theo- experiment is: “presenting a true label ki to the channel to
retical contribution. This methodology is then applied to obtain an estimated label k^j from it,” that is, generating data
different use cases (Section 3) as a ready-guide for each dif- ðK ¼ ki ; K^ ¼ k^j Þ from a joint distribution P ^ ði; jÞ. We sup-
KK
ferent use intended: the assessment of single classifiers on pose that there are k true labels and k0 possible estimated
single tasks (Section 3.1), several classifiers on the same labels, and—though not strictly necessary—that k ¼ k0 for
task (Section 3.2), single classifier induction technique on this application.
different tasks (Section 3.3) and several classifier induction The properties of PKK^ can be studied with the methods
techniques on different tasks (Section 3.4). We finish with of signal detection theory, multidimensional scaling or
some Conclusions. cluster analysis, among others [20]. In this paper we
focus on the methods of Information Theory [19]. For
instance, it was soon recognized that joint distribution
2 AN INFORMATION-THEORETIC FRAMEWORK FOR
PKK^ ði; jÞ needs a correction to account for random confu-
CLASSIFIER ASSESSMENT sions [21]. The original proposal was to compare it to the
2.1 A Proposal for the Information-Theoretic product of the marginals, effectively comparing the dis-
Assessment of Classifiers on Datasets tribution to that obtained supposing its marginals PK ðiÞ
Consider the scheme of Fig. 1a of a multi-class classification and PK^ ðjÞ were independent. This is exactly the definition
task cast as a transmission channel that renders it amenable of the pointwise mutual information between K and K ^ [22,
to information-theoretic analysis. There is a set of S inde- Section 2.3]
pendent, identically distributed (iid) realizations of a ran-
dom vector X of (observed) variables or features paired with as PKK^ ði; jÞ
MI PKK^ ði; jÞ ¼ log : (1)
many iid realizations of a class variable K. The set of pairs of PK ðiÞ  PK^ ðjÞ
instances fðks ; x s ÞgSs¼1 will be called a dataset (of samples) . P
The feature instances X ¼ x s may be further transformed to Its expected value MI PKK^ ¼ i;j PKK^ ði; jÞ  MI KK^ ði; jÞ is the
obtain instances of a random vector Y, through a transfor- mutual information between K and K ^ and has sometimes
mation function f : X ! Y; x s 7! ys ¼ fðx xs Þ with desired been used to quantify the amount of transinformation in
characteristics, e.g., statistical independence among the classification experiments [23], [24], using a simple evalua-
transformed features. tion heuristic: The more information transmitted from K to K, ^
For supervised classification, classifier induction is the the better the classifier.
subtask of inducing a function k^ : Y ! K; y s 7! k^s ¼ kðy ^ ys Þ In this paper we propound a framework for the eva-
that tries to estimate the original K ¼ k but can only obtain
s
luation of classifiers based on the exploration of the

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2077

transmission of information along the steps of the virtual also DHPK ¼ HUK  HPK and DHPK^ ¼ HUK^  HPK^ are the
chain depicted in Fig. 1b. As in all good exploratory practi- divergences of PK and PK^ with respect to uniformity, so
ces, a number of tools should be brought to bear to sustain
the conclusions, so we suggest a many-pronged approach DHPK PK^ ¼ DHPK þ DHPK^ ; (4)
to classifier assessment:
the mutual information between the random variables is
(1) Estimating PKK^ ði; jÞ and MIPKK^ ði; jÞ by means of iter- MI PKK^ , and, the variation of information
ated experiments (Section 2.2).
VIPKK^ ¼ HPKjK^ þ HPKjK
^
: (5)
(2) Visualizing an information-theoretic characterization
of PKK^ ði; jÞ (Section 2.3).
Equation (2) can be further normalized in HUK UK^ as
(3) Characterizing and ranking classifiers using MIPKK^
(Section 2.4). 1 ¼ DHP0 K P ^ þ 2MI 0P ^
þ VI 0P ^
; (6)
(4) Exploring an estimate of the pointwise mutual infor- K KK KK

mation MI PKK^ ðki ; k^j Þ to analyze classifier errors meaning that entropy balances are compositional data [26]
(Section 2.5) that can be represented in a de Finetti or ternary diagram as
We believe this combination of exploratory assessment the equation of the 2-simplex in normalized space
methods can provide a more complete diagnosis on classi- ½DH 0 PKK^ ; 2MI 0P ^ ; VI 0 PKK^  hence the name entropy triangle
KK
fiers and datasets, specially when different combinations of
(ET). See [7] for further information on this construction.
them are used. Next we explain the concepts and tools
In our framework, the performance of a particular
underlying it.
classifier on a particular dataset characterized by PKK^
shows as a point in an entropy triangle whose entropic
2.2 Estimating PKK^ and MIKK^ for Classifiers
components can be read off the axes (see Supplemental
To estimate PKK^ , the results of the S iid realizations are
0 Materials, which can be found on the Computer Society
aggregated into a CM, NKK^ 2 Nkk , where NKK^ ði; jÞ ¼
PS Digital Library at http://doi.ieeecomputersociety.org/
s¼1 dðki ; kj Þ—with d the Kronecker delta over true and
s ^s
10.1109/TKDE.2019.2915643, for indications about how
estimated labels. A confusion matrix is, therefore, a par- to read an ET). Furthermore, several different classifiers
ticular kind of contingency table for a classification pro- on the same task are visually comparable in terms of the
cess [25, p.51], the contingency being that k^j was returned information they transmit from K to K. ^ Due to normali-
as a response to ki . P zation, this is even true for the same classifier on differ-
For a count matrix NKK^ call nij ¼ NKK^ ði; jÞ, ni ¼ j nij ent datasets. Indeed, it is also true for comparing
P P
and nj ¼ i nij . Note that in our setting,1 S ¼ ij nij . With different classifiers on different tasks. In each of these
these data, the empirical estimate or maximum-likelihood instances, though, the interpretation becomes more
estimator of the joint probability distribution between nuanced and subject to provisos: possible different cardi-
n
inputs and outputs is P^KK^ ði; jÞ ¼ Sij , and its marginals nalities of the datasets, underlying technologies and
P nij P nij parameters for the classifiers, etc. This makes it at the
P^K ðiÞ ¼ j S and P^K^ ðjÞ ¼ i S .
same time easier to carry out cross-comparisons of classi-
In turn these estimates are plugged into (1) to give an
fiers on actual datasets, and more difficult to interpret
estimate of the empirical pointwise mutual information
c P ði; jÞ and its expectation MI c P , the empirical mutal the results.
MI ^
KK KK^
In this paper, we introduce a new result that enhances the
information.
interpretation of the triangle: every dataset has an inherent
theoretical limit for performance. To define it we need to start
2.3 Visualizing Classifiers with Entropy Triangles
from the so-called split entropy triangle [7]: due to (3), (4) and
For the joint probability distribution PKK^ of any two random
^ Valverde-Albacete et al. [7] gave a new (5) the balance Equation (2) can be split into two equations
variables K and K,
related each to one of the marginals,
Shannon-type entropy decomposition2
HUK ¼ DHPK þ MI PKK^ þ HPKjK^
HUK UK^ ¼ DHPK PK^ þ 2MI PKK^ þ VI PKK^ ; (2) (7)
P HUK^ ¼ DHPK^ þ MI PKK^ þ HPKjK
^
;
where HPK ¼ i PK ðiÞlog PK ðiÞ is the entropy of distribu-
tion PK ðiÞ, UK (respectively, UK^ ) is a uniform distribution which may be normalized in each of HUK and HUK^ ,
with the same support as PK (respectively, as PK^ ) and
UK  UK^ is their product, so their entropies are 1 ¼ DH 0PK þ MI 0P ^
þ H 0P ^
KK KjK
00 00 00
HUK UK^ ¼ HUK þ HUK^ ; (3) 1 ¼ DH PK ^
þ MI PKK^
þH PKjK
^
;

so that they can be represented in the same entropy triangle.


An explanation on how to interpret this for confusion
1. In the context of machine learning, it is customary to have m as matrices of k ¼ k0 classes can be found in [7]: the points
the number of feature vectors to be classified and n as their dimension.
In a statistical context n is often taken to mean a count frequency. ½DH 0PK ; MI 0P ^ ; H 0P ^  and ½DH 00P ^ ; MI 00P ^ ; H 00P ^  lie in a
KK KjK K KK KjK
2. We have modified the notation to adapt it to the general frame-
work introduced in this paper: X in the original formulation takes the horizontal segment whose middle point is ½DH 0 PK PK^ ; 2MI 0
^
role of the true label K while Y is the estimated label K. PKK^ ; VI 0 PKK^  .

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2078 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

Notice that for a particular dataset, DHPK is a constant and the Normalized Information Transfer (NIT) factor, or rate,
across classifiers. In the context of this paper, we find it use-
ful to represent it as a line from ½DH 0PK ; 0; 1  DH 0PK  on the MI
mKK^ 2 PKK^ 1
qðPKK^ Þ ¼ ¼ ;  qðPKK^ Þ  a0 ðPKK^ Þ  1;
bottom side of the ET, to ½DH 0PK ; 1  DH 0PK ; 0 on the right k k k
side of the triangle. This line constrains the amount of mutual (12)
information transferred by the classifier to this last quantity, whence we get the EMA-NIT equation
and establishes a limit on the learning capability of any classi-
fier on that task as represented by the mutual information. a0 ðPKK^ Þ ¼ dK  qðPKK^ Þ: (13)

2.4 Ranking Classifiers with the Entropy Modified The interpretation of these quantities is clear:
Accuracy and the Normalized Information
Transfer  The EMA a0 ðPKK^ Þ is the expected proportion of times
Simple figures-of-merit are sometimes preferable to complex the classifier will guess the output class correctly.
diagrams. Given the aforementioned unsuitability of classical  The NIT factor qðPKK^ Þ is the proportion of available
metrics—such as accuracy or error rate—specially in imbal- information transferred from input to output.
anced datasets [13], we next obtain figures-of-merit particu-  The dK is a factor that constrains how much informa-
larly meaningful for our purposes, as introduced in [11].3 tion is available for learning.
We may write the split entropy in (7) in multiplicative form As such, the EMA can be used to provide a ranking to com-
HP
pare classifiers by their performance in a task, whereas the NIT
2HUK ¼ 2DHPK  2
MI P ^
KK 2 ^
KjK (8) factor provides an estimate of how efficient the classifier induction
process was. Likewise, when the task is balanced across classes
HU ^ DHP ^ MI P HP ^
2 K ¼2 K 2 ^
KK 2 KjK : (9) PK ¼ UK then DHPK ¼ 0 whence dK ¼ 1 so a0 ðPKK^ Þ ¼ qðPKK^ Þ
and all the (entropy modulated) accuracy of the classifier comes
Recall that, in language modeling [27], the perplexity of a from the learning process.
predictive distribution PX jZ —where Z is the language model- For the same reasons, the perplexity of the task as
ing context—is the apparent number of options kX jY available kK ¼ 2HPK ¼ mKK^  kKjK^ , the apparent number of equiproba-
H
for prediction in case they were equally likely kPX jZ ¼ 2 PXjZ , ble classes in the task as embodied in the input distribution,
where HPX jZ is the conditional entropy of the prediction. an important quantity to assess the potential performance
Now for the uniform input distribution UK entropy is of a classifier [11], since
H
maximal and therefore k ¼ 2HUK and likewise for k0 ¼ 2 UK^
for datasets with k true labels and (possibly different) k0 esti- k ¼ dK  kK k0 ¼ dK^  kK^ :
mated labels.
Since classification acts are iid (see Section 2), the previ- Further insights into EMA, the NIT factor and perplexity are
ous paragraph suggests that (8) is the decomposition of the described in [11].
theoretical perplexity k of a classifier. To arrive at this, first call
DH 2.5 Exploring Errors with Generalized Formal
dK ¼ 2DHPK and dK^ ¼ 2 PK^ , to quantify the decrease in per-
Concept Analysis
plexity due to the non-uniformity of the marginals. Then,
MI Questions often addressed about confusion matrices are:
the quantity mKK^ ¼ 2 PKK^ has the interpretation of an infor-
mation transfer rate, and since HPKjK^ has the interpretation  Is there a sensible way to cluster classification
of a remanent entropy— that is information not transferred events? Are there any subclusters or a hierarchical
HP
to the output—then kKjK^ ¼ 2 KjK^ can be interpreted as a organization of such clusters?
remanent perplexity, the actual number of equally probable  Can we reorder the columns and rows so that these
label alternatives for the classifier after the learning process. clusters and relations are better made evident?
Note that the perplexity and the remanent perplexity can be  Are there any row- or column-transformations that
dually defined for the output distribution, provide insights into the underlying confusions?
We contend that Formal Concept Analysis (FCA) [28] and,
k ¼ dK  mKK^  kKjK^ specifically, a generalization of it for non-binary matrices
(10) called KFormal Concept Analysis (KFCA)4 [29], [30] can
k0 ¼ dK^  mKK^  kKjK
^ ; help answer these questions.
0
In FCA, I 2 2kk is interpreted as a binary confusion
whence Fig. 1b represents both the transmission of entropy
matrix, between a set of real classes K (i.e., the domain of
from the input distribution of true labels to the output distribu-
the possible outcomes of the random variable K) and a set
tion of estimated labels and the perplexity reduction through- ^ (i.e, the domain of the
m of predicted or recognised classes K
out this process. From (10) we get k 1 ¼ dK  Kk K^ what ^ where
^
KjK possible outcomes of the random variable K),
suggests the definition of the Entropy-Modified Accuracy (EMA), Iði; jÞ ¼ 1 (or likewise I t ðj; iÞ ¼ 1) can be interpreted in
1 1 1 many different ways:
a0 ðPKK^ Þ ¼ ¼ H ;  a0 ðPKK^ Þ  1; (11)
kKjK^ 2 KjK^ k
P

4. K stands for the underlying algebra employed by the algorithm


and should not be confused with the random variable of Section 2.1.
3. As in Section 2.4, we have modified the notation to adapt it to the Since in this paper it will only appear attached to the name FCA, we
general framework of this paper. have decided to keep it consistent with the literature on the matter.

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2079

 (input) class ki is recognized as (output) class k^j , or 2.6 Related Work and Discussion
class ki is classified as k^j , or class ki is confused with The main contribution of this paper is the introduction of an
k^j , , class k^j is predicted of ki , assessment framework using information-theoretic tools to
 class ki is read as k^j , class k^j is read for ki . analyze the performance of classifiers, grounded on the
 class ki is substituted by class k^j , class k^j is substi- metaphor/model of an information-transmitting channel
tuted for class ki . for classification tasks.
The triple K  ðK; K; ^ IÞ is called a formal context and The existence of such frameworks and metaphors is capital
summarizes all the available information about K, K ^ and I. for the use of assessment techniques, since the metaphors
In Formal Concept Analysis one specially studies two map- “enable” a way of thinking about the task, while frameworks
pings between the powersets of true and estimated classes, “prescribe” a way to use metaphors to advantage. These are
called the polars: given a set of true classes A K, the esti- capital questions for the progress of science and we believe
mated class polar A"I ¼ fk^j 2 K ^ j Iði; jÞ ¼ 1; 8ki 2 Ag is the widely-cited works like [32] are just about proposing and
set of estimated classes that the classifier predicted from A understanding such frameworks. Incidentally, we also adopt
and the true class polar B#I ¼ fki 2 K j Iði; jÞ ¼ 1; 8k^j 2 Bg the suggestion that the need for evaluation arises from typical
is the set of true classes that are confused with a given set of classification tasks [32, Section 9.1] which are well exemplified
estimated classes B K. ^ Specifically in the archetypal use cases to be presented in the next Section.
Note that a model for motivating information-theoretic
k^j 2 A"I , k^j is predicted of every ki 2 A assessment procedures is missing in [32]—although indi-
vidual information measures are not. Indeed, our work
ki 2 B# , ki is confused with every k^j 2 B:
I
could be conceived as fleshing out that framework for
^ assessment based in a completely different model of what a
Pairs of sets of objects A 2 2K and attributes B 2 2K that
good classifier is, viz. that which is capable of maximizing the
map to each other A"I ¼ B and B#I ¼ A are called formal con- mutual information between the real and predicted class labels.
cepts. For a concept ðA; BÞ, the set of objects A is called its This is an alternate, but certainly highly correlated, criterion
extent while the set of attributes B is its intent and the set of to that of minimizing classification errors. But in our
formal concepts is written BðK; K; ^ IÞ
opinion, the assumptions and heuristics flowing from
^ IÞ , A" ¼ B , B# ¼ A: our hypothesis and their entailments are all new and
ðA; BÞ 2 BðK; K; I I perspective-opening.
Formal concepts ðA1 ; B1 Þ; ðA2 ; B2 Þ 2 BðK; K;
^ IÞ are partially An initial step towards the development of this frame-
ordered by the inclusion (resp. reverse inclusion) of extents work was implicitly taken in [17] by using an exploratory,
(resp. intents) pointwise mutual information-based technique to elicit the
errors in confusion matrices with the help of formal concept
ðA1 ; B1 Þ  ðA2 ; B2 Þ , A1 A2 , B1 B2 : (14) analysis [28]. This was greatly expanded in [31] towards a
full-fledged exploratory technique whose application possi-
With the concept order, the set of formal concepts bilities somehow exceed the present ones.
hBðK; K;^ IÞ; i is actually a complete lattice called the Later, [7], [11] provided the theoretical groundwork for
concept lattice (CL) of the formal context K. the visualization and assessment of confusion matrices
This technique on boolean matrices is used as a stepping based in information-theoretic concepts. In particular, [11]
stone to develop KFCA, a technique to build concept latti- discusses the suitability of information-theoretic metrics
ces from multi-valued confusion matrices. We claim that the on imbalanced datasets evaluating synthetically genera-
inclusion order of confusions between true and estimated ted multi-class classification problems with progressively
labels in a CM can be efficiently represented by concept lat- increased degrees of imbalance in their classes. These ideas
tices. After the point-wise mutual information defined in had been previously applied to Automatic Speech Recogni-
Equation (1), the true and estimated labels can be confused tion in [33].
to a certain degree ’. So if makes sense to sweep over the There are alternatives to all of the tools and techniques we
different values of ’ observed for the data using it as a recommend in this paper, but most have strong objections to
threshold, to obtain a binary matrix for each value of ’. In being used, as we outline in the following paragraphs.
this manner, we obtain a sequence of CL fI’ g indexed by ’, A detailed confusion matrix is often presented as an out-
each amenable to FCA put of a machine learning task either in the form of heatmaps
(see an example in Fig. 3b) or merely numerically. Also,
c P ðki ; k^j Þ
I’  ðMI ’Þ: behavioral sciences frequently choose this representation to
^
KK
illustrate the results of perception experiments. Those repre-
A detailed description of exploring contingency matrices sentations do not answer the questions proposed in Section
with KFCA is presented in [17] for the analysis of human 2.5 satisfactorily since their exploration is not principled or
phonetic confusions. In [31] recipes are explained to choose systematic and is usually guided (and biased) by researcher
a judicious ’, so that it can be used as an illustration of the intuition. The hierarchical ordering of input and output
affordances of FCA to support data-induced scientific classes as interpreted by the classifier is what is revealed by
enquiry and discovery in Gene Expression Data analysis. a context lattice. Its representation as an order diagram pro-
An example of this is shown in Fig. 3, and a guide on how vides an entirely data-driven visualization.
to read confusions off it can be found in Supplemental On a classification setting, the optimal classifier’s distri-
Materials, available online. bution would be diagonal P^KK^ ðki ; k^j Þ ¼ nmi dði; jÞ; and the

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2080 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

ni
worst possible is row-constant P^KK^ ðki ; k^j Þ ¼ pn . But we are these assessment procedures is due to the analysis of the
at a loss as to how to assess the classifier at a glance from confusion matrices, which is very conservatively estimated
such estimated distribution for intermediate cases. Direct as Oðk2 Þ in the number of classes k. Given that datasets typi-
observation by human experts is usually the way to induce cally contain at present at most hundreds of classes but
conclusions from these data. Sometimes this involves some millions—sometimes billions—of instances/samples for
sort of block diagonalization procedure using mostly heu- induction, the cost of the assessment is negligible com-
ristic rules involving the choice of a particular metric that pared to the cost of learning. On the other hand, the ass-
should be corroborated a posteriori. In [34], for example, a essment measures and visualizations are straightforward
taxicab metric was defined in order to find an optimal computations of mathematical formulae from Information
ordering: the metric weighs down the off-diagonal values Theory. Their interest lies in the communication channel
of the CM in proportion to their distance to the diagonal. metaphor and the framework introduced to make the
Another representation, the directed threshold graph, is pro- assessment principled and systematic and not in their algo-
posed in [35, p.52]. Note that these issues are laid to rest rithmic implementations—indeed we provide four pack-
by the use of the Entropy Triangle as a plotting technique ages which have been developed by three different people
and its interpretation (see Supplemental Materials, avail- yet have similar time complexity.
able online). The size of the dataset S, however, has an important
Custom has that CM be evaluated by an error counting impact in the accuracy of the probability estimates from the
measure like accuracy, that has several problems [36]. For error counts collected for each experiment. As a rule of
instance, [13] highlights the problem of accuracy on imbal- thumb, we trust the probability estimators if the sampling
anced sets and [11] shows how in this case accuracy is grossly procedurepisffiffiffiffi correct and the number of classes respects the
overestimated. EMA was devised to correct for this effect. rule k < S . Note also that the estimation of entropy is a
The balance Equation (2) implies that this mutual infor- well-known problem in the field, and off-the-shelf imple-
mation maximization entails the minimization of the varia- mentations are available for a number of major data proc-
tion of information of (5), since the divergence from unity essing platforms.
DHPK is an inherent quantity of the marginals of PKK^ and
unmodifiable by classifiers. The variation of information 3 EXAMPLE USE CASES AND EXPERIMENTS
was found to be an important quantity for the evaluation of
unsupervised clustering in [37]. There it was proven as a We next present several different use cases for the frame-
true metric or measure of dissimilarity, that agrees with the work presented above: these are archetypal cases to apply
intuition that the mutual information is a similarity between the techniques. Code for each different use is available on
distributions. The mutual information maximization pro- request. A full example spanning several of these cases
cess entails an increase in NIT factor parallel to an increase around the KDD cup 99 on intrusion detection is presented
in EMA. in Supplemental Materials, available online.
How classifiers can actually “cheat” in reporting accu-
racy is by artificially (optimistically) increasing the diver- 3.1 Assessing a Single Classifier on a Single Task
gence from unity of the predicted labels DHPK^ : this Using the ET for single classifiers is an interesting tech-
increases accuracy—in a non-generalizable way—at the nique when the practitioner cannot control the induction
expense of the variation of information, but leaves the of the classifier, or there is a particular instantiation of it
mutual information unchanged, so the NIT factor and EMA which is “natural”, as in the case of classifiers embodied in
are unaffected. living beings.
Note that the present endeavor is different to that of On such classifiers it also makes sense to use exploratory
inducing classifiers, e.g., by mutual information maximiza- analysis to try to find systematic confusions between
tion. The work in [38] provides the groundwork to relate true and estimated labels. For this purpose, the ET
mutual information maximization to error probability mini- and KFormal Concept Analysis have already been suc-
mization independent of dataset class cardinality using cessfully used in the evaluation of phoneme recognition by
Fano’s inequality [22]: as a general rule, the higher the humans [7], [40]. Here, the ET is just a tool to visualize the
mutual information the lower bound on the error probabil- entropies related to the perceptual channel, while the confu-
ity, and this seems to be aligned with our intuition. Along sion lattice is an effective tool to elicit significant errors
these lines, [39] provides a comparison of Bayesian and committed by the classifier.
maximum mutual information classification issues for two- Note that in the absence of a model of classification by
class tasks. It would be interesting to develop a systematic humans, or a comparison of the performance of the different
comparison of information-based induction schemes, e.g., perceptual modalities, it makes no sense to rank systems
MaxEnt or logistic regression, to others based in non-entro- using EMA or NIT factor.
pic loss functions under this framework. This is left for Example 1. In the following examples we present the per-
future work. formance of humans in several perceptual tasks:
Some reviewers of this work expressed a concern about
the algorithmic complexity of the techniques presented: we  Visual distinction of segmented numeral digits in
believe this is not a concern at present. On the one hand, humans. In this task from [41], the k ¼ 10 seven-
the starting data for our analysis are the joint distributions segment digits of digital displays were presented
of prediction errors as obtained from count distributions to testers under different conditions designed to
by the maximum likelihood estimator. The complexity of guarantee that confusions would appear.

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2081

TABLE 1
Perceptual Confusion Matrix Analysis

perceptual confusions k kK kKjK^ mKK^ a a0 q


odorants 10 10.000 1.378 7.256 0.926 0.726 0.726
seg. numerals 10 10.000 2.985 3.350 0.699 0.335 0.335
morsecodes 36 35.178 25.336 1.388 0.127 0.039 0.039

Perplexities k; kK ; kKjK^ , accuracy a, modified accuracy a0 , and normalized


information transfer factor q for perceptual confusion matrices. The number of
instances
pffiffiffiffi in the datasets is 997, 9,997, and 25,438, respectively, and therefore
k < S in every case.

correctly assess the confusability of the classes and to


avoid the influence of prior knowledge about their distri-
bution they should keep these values similar.
Classification Error Analysis. When the classifier is as
interesting to grasp as those enacting human perfor-
mance, it is useful to know whether it commits system-
atic confusions with the help of KFCA (Section 2.5). A
Fig. 2. Entropy triangle for some human perceptual classification tasks.
confusion lattice for the odorant confusion matrix
from [20] is presented in Fig. 3a for ’ ¼ 1:78. Different
 Acoustic-perceptual distinction of morse-codes in
values of ’, allow us to observe the perceptual confusions
humans. Likewise, in [42] human testers would
with different levels of detail. See [17] or [43] for a
listen to k ¼ 36 Morse codes for English letters
detailed explanation of the evolution of the CL with this
and digits and provide the transcodified symbol
variable. The Supplemental Materials, available online,
under tests conditions that induced confusions.
contain a short guide to interpret CL diagrams.
 Odorant confusions in healthy humans. In [20],
Fig. 3a allows us to observe three differentiated
k ¼ 10 different odorants selected on the basis
groups as adjunct sub-lattices. This is an important
that they could be identified by a population were
observation since it means that at the level of detail
presented to 10 female and 10 male subjects and
chosen there are no confusions across these groups:
their confusions noted down.
Information-Theoretical Metrics. Fig. 2 represents the  Right sub-lattice: stimulus “Licorice” (white label),
human performance in the tasks mentioned above by is only correctly perceived as “Licorice” (gray
means of the entropy triangle. The levels of performance label), meaning that is the only perfectly charac-
in each task are incomparable. terized odorant,
Notice from Table 1 that k kK , implying that the  Center sub-lattice: stimulus “Vanilla” (white label)
experimental designs made sure that the task was as is correctly perceived as “Vanilla” and wrongly as
hard as possible, by balancing stimuli in the input distri- “Peppermint” (reading the gray labels from the
bution, since the perplexity of the classification problem node upwards), however stimulus “Peppermint”
is almost exactly equal to the number of classes. As a con- (white label) is only rightly perceived as
sequence, the EMA equals the NIT factor, since dK ¼ 1. “Peppermint” (gray label) and
This is also evident from Fig. 2 since all of the tasks are  Left sub-lattice: including the rest of the odorants
situated along the left hand side of the ET. It is remark- is much more complex. Two loosely profiled
able how their designers were already aware that to concepts appear right below the top with the

Fig. 3. Two information-theoretic representations of the Odorant confusions: (a) the Concept Lattice diagram at ’ ¼ 1:78 . (Stimuli are labelled in
c KK^.
white and responses in gray) and (b) a gray scale heatmap of its mutual information MI

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2082 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

responses “Turpentine” and “Clorox” (gray TABLE 2


labels) respectively since they are elicited from a Ranking of Weka Classifiers on Accuracy
large number of stimuli. On the one hand,
Weka classifier kK kKjK^ mKjK^ a a0 q
“Vicks” and “Vinegar” are wrongly perceived as
“Turpentine” together with the right answer trees.RandomForest 2.281 1.042 2.189 0.993 0.959 0.365
“Turpentine” (white labels) and on the other lazy.IBk 2.281 1.056 2.160 0.991 0.947 0.360
functions.Logistic 2.281 1.049 2.174 0.991 0.953 0.362
“Vinegar”, “Ammonia” and “Roses” are meta.Bagging 2.281 1.077 2.117 0.986 0.928 0.353
(wrongly) perceived as “Clorox” as well as cor- trees.J48 2.281 1.088 2.096 0.984 0.919 0.349
rectly as themselves (gray labels). The stimulus rules.JRip 2.281 1.090 2.093 0.983 0.917 0.349
“Clorox” is however only correctly perceived as functions.SMO 2.281 1.131 2.017 0.974 0.884 0.336
itself (“Clorox”, gray label). The leftmost concept bayes.NaiveBayes 2.281 1.224 1.864 0.863 0.817 0.311
ðfMothballs; Vicksg; fMothballsgÞ represents an meta.AdaBoostM1 2.281 1.749 1.304 0.836 0.572 0.217
rules.ZeroR 2.281 2.281 1.000 0.762 0.438 0.167
odorant (“Mothballs”) that is always correctly
recognized as itself but does not constitute a sepa- Perplexities kK ; kKjK^ , accuracy a, EMA a0 , and NIT factor q for some classi-
rate sublattice since the subjects of the experiment fiers in Weka on the UCI anneal task.
also perceived stimulus “Vicks” as “Mothballs”.
We can say that the lower concept ðfVicksg; 2) Using the ET to individually assess each classifier wrt the
fMothballs; Vicks; TurpentinegÞ is more defined rest of the population of classifiers. This particular type
than the former since it has three attributes of ET should include the straight line DHPK as
(responses) instead of one or that stimulus described in Section 2.3, and provide an overall
“Vicks” is more generic than “Mothballs” since visual summary of the performance of the chosen set
the former is confused with the latter whilst the of classifiers on the task.
latter is only returned as a single response (itself). 3) Using EMA to rank classifiers. This provides a distri-
It is therefore more specific and, in general, we can bution-independent approximation to the classical
see more specific stimuli appearing in upper nodes concept of accuracy, hence leveraging previous expe-
of the lattice. rience of the human evaluator.
The structure and dimensionality of the odorant per- 4) Using the NIT factor to assess whether the population of
ceptual spaces has remained elusive [44] and an explana- classifiers has solved the task. This part of the evalua-
tion of the large number of human odorant receptor types tion should conclude at what stage of solving the
( 1000) and the comparatively reduced dimensionality task is the population of classifiers being assessed.
observed in perception and behavior [45] is still an open Examples of the use of this evaluation scheme can
research question. Wright et al. [20] pioneered the use of already be found in [46] where it was applied to the results
confusion matrices for this task and this is why we have of a Sentiment Analysis task. We now present an example
chosen their data to showcase the potential of our frame- on a well-known task and classifier induction schemes for
work. Its interpretation is however beyond the scope explanatory purposes.
of this paper. We shall just mention that in that work
Example 2. Consider the UCI Annealing dataset as a classi-
the authors point out that “Clorox”, “Ammonia” and
fication task [47].6 This task has k ¼ 6 classes, but it is
“Vinegar” have a strong trigeminal component and we
highly imbalanced and has empty classes. In fact, from
can observe them interrelated in the same sub-lattice.
Table 2 we can see that its actual perplexity is kK ¼ 2:281,
Also, the fact that “Vicks” contains “Turpentine” is
implying that the task is slightly more difficult than a
clearly portrayed in the lattice diagram.
two-class balanced one, hence a fair baseline should be
a0 ðPKK^ Þ ¼ 2:281
1
¼ 0:43.
3.2 Assessing Several Classifiers on the Same Task Using the Weka explorer [48] we induced the 10 dif-
The next performance assessment case demonstrates the use ferent classifiers in the first column of Table 2 with the
of a population of classifiers trying to solve a single task. This default parameters supplied by the software, to produce
is the envisioned use for people who need to solve a specific the ranking induced by accuracy aðPKK^ Þ.
task using classification. It is perhaps its most widespread The choice of classifiers was suggested by similar
use, scientifically speaking. experiments reported in [32] to which trees.RandomForests
The assessment process after classifier population5 and rules.ZeroR were added. The first of these had proven
induction consists basically in: a good generic classifier in dataset exploration and the
second one is a specialized classifier in the definitions of
1) Using kK to assess the effective number of classes in the data. Section 2.3.
This gives an idea of how difficult the class is to solve: The highest ranking classifiers provided almost per-
The closer kK is to k, the more difficult the task. On fect accuracy, yet, a little bit less good EMA. Although
skewed, easier tasks this number will be much less than the example does not show it in full, in our experience, a
k and closer to 1. As a limit, a kK 1 would declare this common finding in these population evaluations is that
to be a detection, not a classification task. the top-ranked classifiers would be slightly re-ranked by

5. Note this population can also include several instances of the


same classification algorithm with different values for its free 6. https://archive.ics.uci.edu/ml/datasets/Annealing. Access date:
parameters. 09/19/2016.

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2083

Fig. 4. Entropy triangle for some Weka classifiers on the UCI anneal task.

EMA. Occasionally, the second or third classifiers are algorithms of the table employ different strategies to
better by EMA than that ranked first by accuracy [46], solve the problem from the information theoretic point of
although in this example we would only see a re-ranking view: while zeroR and AdaboostM1, minimize the error
of the second and third classifiers since most classifiers count by especialisation on the minority classes (note
solve the problem very well. their positions towards the lower right hand angle of the
However, it is remarkable that the lowest ranking clas- ET) Bayes actually obtains a fairly good NIT and EMA
sifier actually did not use any information in the induc- close to the one obtained by the best classifiers. If we had
tion process in spite of its claiming to offer an accuracy of only paid attention to the classical accuracy metric, we
aðPKK^ Þ ¼ 0:762. Indeed, its remanent perplexity is the would have discarded this classifier, that is actually not
same as the actual perplexity kKjK^ ¼ kK entailing there is that bad. This can also be observed in Fig. 4b where the
no mutual information between input and output, con- colour bar to the right is proportional to NIT.
firmed by the fact that mKK^ ¼ 1. Note that its NIT factor
is the worst possible at qðPKK^ Þ ¼ k1.7 3.3 Assessing a Single Classifier Induction
Technique on Different Tasks
The entropy triangles in Fig. 4 provide a more visual
This is the type of use we envision for classification algorithm
assessment. For example, we can observe how ZeroR
research and design. The practitioner has a battery of tasks for
appears naturally at the bottom side providing the right
evaluation with different class cardinalities and, if possible,
intuition that no properly trained algorithm should be
with different input marginals, ranging from the uniform to
below.
highly skewed ones (or imbalanced). The purpose of the
Recall that the height amounts to mutual informa-
uniform marginals is, of course, to subject the induction
tion transferred from input to output labels, so the
technique to the most stringent learning conditions.
rules.ZeroR classifier actually transmits no information
from input to output. Example 3. Consider the functions. logistic classifier of Weka
The best classifiers in this population, for this task seem implementing multiclass logistic regression. We obtained
to be trees.RandomForest, lazy.IBk and functions.Logistic confusion matrices for 10 different UCI tasks and plotted
almost indistinctly. Note that the line at DHPK ¼ 0:54 them in the ET in Fig. 5.
implies any classifier can at most reach a maximum The assessment process after classifier population
MIP0 ^ ¼ 1  0:54 ¼ 0:46, inducing a NIT factor of induction consists basically in:
KK
qðPKK^ Þ ¼ 20:46log 2 ð6Þ =6 ¼ 0:38 , due to the lack of balance 1) Using kK to assess the effective number of classes in the
in the dataset. Indeed the best three classifiers are not far data. This allows us to see the real complexity of
away from these values, implying that such techniques the tasks.
actually solve this classification task satisfactorily. We 2) Using the ET to individually assess the classifier wrt to
can also clearly observe in the ET how the three lower the population of tasks. This provides a visual sum-
mary of the performance on different tasks wrt
7. Of course, this is the typical behavior of the ZeroR algorithm since the tasks class balance.
it selects the majority class in the dataset and uses it to make all the pre- 3) Using the NIT factor to rank classifiers. The rank pro-
dictions. In this case, this is very profitable from the point of view of vides insight in the learning capabilities of the
classical accuracy since the database is very imbalanced. This trivial algorithm as expressed in each of the tasks.
algorithm is usually employed as the minimum threshold of perfor-
mance for this reason, an example of an extreme behavior that we can 4) Balancing EMA versus NIT factor. EMA is the crite-
observe in a more moderate way in AdaBoostM1, as well. rion for performance in solving the task so the

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2084 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

Fig. 5. A study of the logistic classifier on a population of UCI tasks. Fig. 6. A study of three UCI tasks liver-disorders (red), breast-cancer (blue)
and hepatitis (purple) with a sample of classifier technologies on the ET.

TABLE 3 anneal and audiology where the EMA is unreasonably large


Performance of Logistic.Regression on Some UCI Tasks with respect to the learned information. Note that this is
also the case wrt to standard accuracy. Very clearly, in the
functions.logistic k kK kKjK^ mKK^ a a0 q
case of anneal and audiology the classifier has been able to
iris 3 3.000 1.161 2.584 0.960 0.861 0.861 leverage the imbalance of the dataset, to produce a high
soybean 19 14.276 1.243 11.485 0.939 0.805 0.604 accuracy. But this is not the case, e.g., for glass.
hepatitis 2 1.664 1.389 1.198 0.890 0.720 0.599
diabetes 2 1.909 1.705 1.120 0.772 0.586 0.560 3.4 Assessing Several Classifier Induction
liver disorders 2 1.975 1.835 1.076 0.704 0.545 0.538
breast cancer 2 1.838 1.803 1.019 0.689 0.555 0.510 Techniques on Different Tasks
vowels 11 11.000 2.027 5.428 0.818 0.493 0.493 This mode of comparison is useful to demonstrate, for instance,
anneal 6 2.281 1.049 2.174 0.991 0.953 0.362 the “no free lunch” in supervised classifier induction: several
glass 7 4.521 2.532 1.786 0.640 0.395 0.255 classifier techniques are shown side by side on different tasks,
audiology 24 10.718 1.759 6.094 0.792 0.569 0.254 with different class cardinalities and marginals, which should
Number of classes k, perplexities kK ; kKjK^ , accuracy a, EMA a0 and NIT factor
be enough to manifest differences for the classifier techniques.
q for functions.logistic confusion matrices over a sample of UCI tasks. Rank- Since each task has a different range of EMA and NIT factor,
ing by NIT factor. The number of instances in the datasets is 150, 307, 155,
pffiffiffiffi there is no point in showing the ranking on these or, for that
442, 345, 569, 640, 798, 214 and 226 respectively and therefore k < S in matter, any other ranking. For practical purposes we suggest to:
every case except soybean and audiology.
1) Select tasks that are somehow comparable. For instance
previous ranking based on NIT should be set
choose tasks with the same class cardinality and sim-
against the expected performance. We disregard
ilar input entropy. On a different direction, choose
here the use of accuracy altogether.
tasks of the same class cardinality but a gradation in
We can see that the range of source information hypo-
input entropy.
thetically available for transmission (HUK ¼ log kK ) is
2) Represent each of these tasks and the classifiers chosen as
wide. We might hypothesize that those tasks with less or
suggested in Section 3.2, to get the gist of how the pop-
more imbalanced classes would be easier if we only
ulation of classifiers is capable of solving these tasks.
attend to the classical accuracy metric. The analysis of
3) Represent all tasks together in the same ET to get a better
Table 3 dispels this idea for logistic regression.
grasp of the complexities of the parameters being
For instance iris (kK ¼ 3) and vowels (kK ¼ 11) are bal-
investigated, for instance as in Fig. 6.
anced but reach the highest information transfer from
input to output. On the other hand, anneal, glass and audi- Example 4. We consider three UCI tasks: liver-
ology are very imbalanced and they show very poor NIT disorders and breast-cancer—as suggested by
factors.8 Note that the NIT factor takes precedence as [32, p, 83]—and we add hepatitis for illustration pur-
ranking criterion because we are specifically interested in poses. All are binary but have different actual perplexi-
K ¼ 1:97, kK ¼ 1:84, and kK ¼ 1:66, respectively,
ties, kld bc he
how good the algorithm is at learning information.
The EMA and NIT factor also show almost linear corre- whence the different dashed lines in Fig. 6. Fig. 6 also
lation for many tasks, with notable outliers, in particular shows the performances of a population of classifiers
induced with different techniques on these three tasks.
The different colors help tell the tasks apart.
8. Of course, apart from the perplexity, there are other aspects of the
A number of inferences can be extracted from Fig. 6
classification problem that influence how the classifier is learning that
are not being considered here, notably the representation capabilities of and Table 4 showing perplexities, accuracies, EMA and
the features. NIT factors on these tasks. The first thing to be noticed is

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2085

TABLE 4
Performance of Several Weka Classifiers on Some UCI Tasks

Tasks ) liver-disorders breast-cancer hepatitis

Classifiers + kK kKjK^ mKK^ a a0 q kK kKjK^ mKK^ a a0 q kK kKjK^ mKK^ a a0 q

functions.Logistic 1.975 1.835 1.076 0.704 0.545 0.538 1.838 1.803 1.019 0.689 0.555 0.510 1.664 1.389 1.198 0.890 0.720 0.599
trees.RandomForest 1.975 1.831 1.078 0.699 0.546 0.539 1.838 1.150 1.598 0.969 0.870 0.799 1.664 1.068 1.557 0.987 0.936 0.779
meta.Bagging 1.975 1.846 1.070 0.696 0.542 0.535 1.838 1.570 1.171 0.832 0.637 0.585 1.664 1.485 1.121 0.865 0.674 0.560
trees.J48 1.975 1.860 1.062 0.687 0.538 0.531 1.838 1.744 1.053 0.755 0.573 0.527 1.664 1.313 1.267 0.923 0.762 0.634
meta.AdaBoostM1 1.975 1.896 1.042 0.661 0.527 0.521 1.838 1.726 1.065 0.755 0.579 0.532 1.664 1.384 1.203 0.897 0.723 0.601

rules.JRip 1.975 1.911 1.033 0.646 0.523 0.517 1.838 1.715 1.071 0.769 0.583 0.536 1.664 1.367 1.217 0.877 0.732 0.609
lazy.IBk 1.975 1.918 1.029 0.629 0.521 0.515 1.838 1.106 1.661 0.979 0.904 0.831 1.664 1.146 1.452 0.968 0.873 0.726
functions.SMO 1.975 1.970 1.003 0.583 0.508 0.501 1.838 1.720 1.068 0.762 0.581 0.534 1.664 1.401 1.188 0.884 0.714 0.594
rules.ZeroR 1.975 1.975 1.000 0.580 0.506 0.500 1.838 1.838 1.000 0.703 0.544 0.500 1.664 1.664 1.000 0.794 0.601 0.500
bayes.NaiveBayes 1.975 1.931 1.023 0.568 0.518 0.511 1.838 1.767 1.040 0.717 0.566 0.520 1.664 1.406 1.183 0.858 0.711 0.592

Perplexities kK ; kKjK^ , accuracy a, EMA a0 , and NIT factor q for some classifiers in Weka on the UCI liver disorders, breast cancer, and hepatitis tasks. Ranking is
by accuracy on liver-disorders.

that the liver-disorder proves unsolved by any of the - A python module for the Entropy Triangle.12
classifiers chosen, a result already noticed in other works  Some of the UCI tasks [47] used in this work are dis-
and blamed on the poor quality of the features [49]. Nei- tributed with Weka itself and some have been down-
ther are breast-cancer and hepatitis satisfactorily solved. loaded from the UCI repository.13
This could have been seen in the isolated triangles  To carry out a KFormal Concept Analysis explora-
similar to those of Section 3.2. tion of a matrix, a web service with and advanced
A recurrent finding is that the majority rule rule. interface is available. Confusion matrices—among
ZeroR is essentially a classifier that transmits no infor- other types of contingency matrices—can be ana-
mation, whatever the accuracy it may claim on any lyzed.14 Also, Matlab code is available from the
task. This is an inherent characteristic of that classifier authors on request.
rule made the more evident with entropic diagrams  There are many open-source resources for FCA of
but note as well, that function.SMO has a similar binary data matrices.15
behavior for liver-disorders, more unexpected
and relevant. 4 CONCLUSIONS AND FURTHER WORK
With careful reading we may also notice a manifesta-
tion of the no-free-lunch theorem: the mutual informa- The main contribution of this paper is a framework for clas-
tion transmitted by any classifier is not consistently sifier-on-a-dataset performance assessment and visualiza-
superior to that of every other classifier. Fixing on a par- tion based on information-theoretic principles. In order to
ticular classifier, say trees.RandomForest or lazy.IBk, you do so, we make explicit a model of a virtual channel
can see it in action, since the tasks were chosen, among between the true and estimated class labels. In this model,
others, for this purpose. If you could see some consis- our notion of what is a good classifier is aligned with the
tency in the relative behavior of two classifiers, this is a heuristic of mutual information maximization between real and
mirage easily dispelled by including other tasks. estimated class label distributions: the ET visually favors the
vertical dimension that displays mutual information and
3.5 Materials the NIT factor actually quantifies the learning performance
The results we present in this work have been obtained with of classifiers. EMA, as a measure of error is tied to an exp-
software operating over the results of the Weka frame- ectation obtained from “whitening” the perplexity of the
work [48] for machine learning on some of the UCI tasks: dataset, first by extracting dataset entropy and then by
extracting the mutual information as captured by the classi-
 The ET, EMA and NIT can be calculated with pub- fier. In this sense it is a derived measure from both mutual
licly available software: information and actual task perplexity.
- There is a Weka plugin for the Entropy Triangle, Notice that the actual description of K ¼ ks in terms of a
EMA and NIT.9 vector of features X ¼ x s and the classifier that embodies
- There is a matlab toolbox for the Entropy the classification process are not taken into consideration
Triangle.10 for the tallying, and this is what makes an evaluation inde-
- An experimental R package for the Entropy pendent of the technological decisions around classifier
Triangle.11 induction possible. Therefore the model applies whether

9. http://apastor.github.io/entropy-triangle-weka-package/ 12. https://github.com/Jaimedlrm/entropytriangle


10. http://www.mathworks.com/matlabcentral/fileexchange/ 13. http://repository.seasr.org/Datasets/UCI/arff/
30914-entropy-triangle 14. https://webgenekfca.com/general/changetype/kfca
11. https://github.com/FJValverde/entropies 15. http://www.upriss.org.uk/fca/fca.html

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2086 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020

the agent carrying out the classification is either natural, proposed measures in our framework, but this is also
e.g., perceptual tasks, or artificial, e.g., a machine. intended for future efforts.
The assessment framework is completed with a more Finally, many of those measures refer to non-classifica-
detailed exploration of the structure of the confusions of a tion tasks [37], [50] or to the measurement of multivariate
single classifier: concept lattices are rendered as an ade- information [50]. These are matters we are already trying to
quate means for displaying clusters of confusions and their integrate in this framework.
hierarchical organization so as to provide insights into the
underlying natural phenomena being analyzed or the way a ACKNOWLEDGMENTS
machine classifier has captured the essence of the task being
This work was partly supported by the Spanish Ministry of
learned. This provides a guidance for improving its perfor-
Economy & Competitiveness projects TEC2014-53390-P and
mance either in the design of the classifier or the feature
TEC2017-84395-P.
extraction procedure for its representation, in the case of
machine classifiers. For human perceptual tasks, it may
help in eliciting how our sensory apparatus processes stim- REFERENCES
uli. But how to make this knowledge flow back into the [1] C. E. Shannon, “A mathematical theory of Communication,” The
induction step is left for future work. Bell Syst. Tech. J., vol. XXVII, no. 3, pp. 379–423, 1948.
[2] C. E. Shannon, “A mathematical theory of communication,” The
With respect to previous assessment tools on populations Bell Syst. Tech. J., vol. XXVII, no. 3, pp. 623–656, 1948.
of classifiers with the ET, we have added in this paper the [3] D. J. C. MacKay, Information Theory, Inference and Learning Algo-
segment at constant DHPK in Sections 2.3 and 3.4. The seg- rithms. Cambridge, U.K.: Cambridge Univ. Press, Sep. 2003.
[4] L. Brillouin, Science and Information Theory, 2nd ed. New York, NY,
ment provides a clearer view on how good classifiers are by USA: Academic Press, 1962.
highlighting the intersection between it and the side [5] J. C. Principe, Information Theoretic Learning. New York, NY, USA:
VIP0 ^ ¼ 0 where the best classifiers for a task should lie. Springer, 2010.
KK [6] E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge,
To illustrate the framework with reproducibility in mind, U.K.: Cambridge Univ. Press, 1996.
we have chosen publicly available datasets and classifier [7] F. J. Valverde-Albacete and C. Pelaez-Moreno, “Two information-
implementations. In particular, the former include human theoretic tools to assess the performance of multi-class classifiers,”
Pattern Recognit. Lett., vol. 31, no. 12, pp. 1665–1671, 2010.
performance tasks such as the segmented numerals digits, [8] M. Zhou, Z. Tian, K. Xu, X. Yu, and H. Wu, “Theoretical entropy
morse codes and odorant confusion, the KDD cup 99 on assessment of fingerprint-based Wi-Fi localization accuracy,”
intrusion detection and well-known UCI examples. For the Expert Syst. Appl., vol. 40, no. 15, pp. 6136–6149, 2013.
the classifiers, we have used some from contestants in the [9] W. R€ odder, D. Brenner, and F. Kulmann, “Entropy based evalua-
tion of net structures-deployed in social network analysis,” Expert
KDD cup 99 and from the pool of those readily available Syst. Appl., vol. 41, no. 17, pp. 7968–7979, 2014.
from Weka [48] extending the list that [32, Section 3.2] used [10] T. Chen, Y. Jin, X. Qiu, and X. Chen, “A hybrid fuzzy evaluation
to include trees.RandomForest and rules.ZeroR. The former method for safety assessment of food-waste feed based on entropy
and the analytic hierarchy process methods,” Expert Syst. Appl.,
because it is a quite successful, extensively used type of clas- vol. 41, no. 16, pp. 7328–7337, 2014.
sifier, and the latter because it is a technique that has a quite [11] F. J. Valverde-Albacete and C. Pelaez-Moreno, “100% classifica-
extreme characterization by the methods proposed herein: tion accuracy considered harmful: The normalized information
no other classifier systematically learns less than it does. transfer factor explains the accuracy paradox,” PLoS One, vol. 9,
no. 1, pp. 1–10, Jan. 10, 2014.
Note, however, that our aim was to made the tools collected [12] C. F. Hempelmann, U. Sakoglu, V. P. Gurupur, and S. Jampana,
in this paper available to the community and not really to “An entropy-based evaluation method for knowledge bases of
evaluate specific classifiers or datasets. medical information systems,” Expert Syst. Appl., vol. 46,
Note also that the ET, EMA, and NIT factor are distribu- pp. 262–273, 2016.
[13] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
tion-agnostic devices and measures in the sense that they Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
deal with imbalanced and balanced data in the same way, [14] V. Garcıa, R. A. Mollineda, and J. S. Sanchez, “A bias correction func-
unlike previous work that specifically corrects for imbal- tion for classification performance assessment in two-class imbalanced
ance [14], [15], [16]. Imbalance in the task manifests itself in problems,” Knowl.-Based Syst., vol. 59, pp. 66–74, Mar. 2014.
[15] V. L opez, A. Fernandez, S. Garcıa, V. Palade, and F. Herrera, “An
lower kK —entailing an easier classification task than if it insight into classification with imbalanced data: Empirical results
were balanced16—and a higher EMA-penalizing factor— and current trends on using data intrinsic characteristics,” Inf.
which means a heavier correction on standard accuracy. Sci., vol. 250, pp. 113–141, Nov. 2013.
Still rarely—but increasingly—there are classifications [16] N. Tomasev and D. Mladenic, “Class imbalance and the curse of
minority hubs,” Knowl.-Based Syst., vol. 53, pp. 157–172, 2013.
tasks with high number of classes, e.g., image classification. [17] C. Pelaez-Moreno, A. I. Garcıa-Moral, and F. J. Valverde-Albacete,
Since the sizes of the datasets have a great impact in the ade- “Analyzing phonetic confusions using Formal Concept Analysis,” J.
quacy of the probability estimators, this question should be Acoustical Soc. Amer., vol. 128, no. 3, pp. 1377–1390, Sep. 2010.
[18] F. J. Valverde-Albacete and C. Pelaez-Moreno, “The evaluation of
revisited in future work.
data sources using multivariate entropy tools,” Expert Syst. Appl.,
The field of information-theory for classification is vol. 78, pp. 145–157, 2017.
really vast. Unlike other works which propose a measure [19] T. M. Cover, Elements of Information Theory. Hoboken, NJ, USA:
of performance, we are proposing a framework to under- Wiley, 2006.
[20] H. N. Wright, “Characterization of olfactory dysfunction,”
stand and sustain such measures. It would be a good Archives Otolaryngology, vol. 113, no. 2, pp. 163–168, Feb. 1987.
exercise for a review paper to try to cast previously [21] D. Baxter and B. Keiser, “A speech channel evaluation divorced
from talker-listener influence,” IEEE Trans. Commun. Technol.,
vol. CT-14, no. 2, pp. 101–113, Jul. 1966.
16. In the sense that it is easier to obtain a higher number or correctly [22] R. M. Fano, Transmission of Information: A Statistical Theory of Com-
classified samples concentrating the learning capabilities of the munication. Cambridge, MA, USA: MIT Press, Jan. 1961.
machine in the majority classes.

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2087

[23] M. Wang and R. Bilger, “Consonant confusions in noise: A study [49] B. Venkata Ramana, M. S. P. Babu, and N. B. Venkateswarlu, “A
of perceptual features,” J. Acoustical Soc. Amer., vol. 54, no. 5, critical study of selected classification algorithms for liver disease
pp. 1248–1266, Jan. 1973. diagnosis,” Int. J. Database Manage. Syst., vol. 3, no. 2, pp. 101–114,
[24] M. Wang, C. Reed, and R. Bilger, “A comparison of the effects of May 2011.
filtering and sensorineural hearing loss on patterns of consonant [50] J. R. Vergara and P. A. Estevez, “A review of feature selection
confusions,” J. Speech Hearing Res., vol. 21, pp. 5–36, Jan. 1978. methods based on mutual information,” Neural Comput. Appl.,
[25] B. Mirkin, Clustering for Data Mining. A Data Recovery Approach. vol. 24, no. 1, pp. 175–186, Jan. 2014.
London, U.K.: Chapman & Hall, 2005.
[26] V. Pawlowsky-Glahn, J. J. Egozcue, and R. Tolosana-Delgado, Modeling Francisco J. Valverde-Albacete received the
and Analysis of Compositional Data. Chichester, UK: Wiley, Feb. 2015. Eng. degree in telecommunications from the Uni-
[27] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge,  cnica de Madrid, in 1992 and the
versidad Polite
MA, USA; London, U.K.: MIT Press, 1997. DEng degree in telecommunications from Univer-
[28] B. Ganter and R. Wille, Formal Concept Analysis: Mathematical Foun- sidad Carlos III Madrid, in 2002. He has been a
dations. Berlin, Germany: Springer, 1999. researcher with the Computer Science Dept.,
[29] F. J. Valverde-Albacete and C. Pelaez-Moreno, “Towards a gener- UNED, Madrid and an associate professor with
alisation of formal concept analysis for data mining purposes,” in the Signal Theory and Communications Dept.
Proc. Int. Conf. Formal Concept Anal., Dec. 2006, vol. LNAI 3874, Universidad Carlos III. He has also visited with
pp. 161—176. the University of Strathclyde, United Kingdom,
[30] F. J. Valverde-Albacete and C. Pelaez-Moreno, “Extending con- University of Trento, Italy, and the International
ceptualisation modes for generalised Formal Concept Analysis,” Computer Science Institute, ICSI, Berkeley. At present, he is a
Inf. Sci., vol. 181, pp. 1888–1909, May 2011. researcher with the Dept. de Teorıa de la Sen ~al y de las Comunica-
[31] F. J. Valverde-Albacete, J. M. Gonzalez-Calabozo, A. Pe~ nas, and C. ciones, Universidad Carlos III Madrid, Spain. He has published more
Pelaez-Moreno, “Supporting scientific knowledge discovery with than 60 papers in applied maths, cognitive, speech, language process-
extended, generalized formal concept analysis,” Expert Syst. Appl., ing, machine learning, and data mining. He is a member of the IEEE
vol. 44, pp. 198–216, 2016. Computer Intelligence Society the ACM, and the IEEE.
[32] N. Japkowicz and M. Shah, Evaluating Learning Algorithms: A Clas-
sification Perspective. Cambridge, U.K.: Cambridge Univ. Press,
Jan. 2011. Carmen Pela  ez-Moreno received the telecom-
[33] A. I. Garcıa-Moral, R. Solera-Ure~ na, C. Pelaez-Moreno, and munication Eng. degree from the Public Univer-
F. Dıaz-de Marıa, “Data balancing for efficient training of hybrid sity of Navarre, in 1997 and the PhD degree from
ANN/HMM automatic speech recognition systems,” IEEE Trans. the University Carlos III Madrid, in 2002. Her PhD
Audio Speech Lang. Process., vol. 19, no. 3, pp. 468–481, Mar. 2011. thesis has been awarded a 2002 Best Doctoral
[34] A. Lovitt and J. Allen, “50 years later: Repeating Miller-Nicely Thesis Prize from the Spanish Official Telecom.
1955, ” in Proc. 9th Int. Conf. Spoken Lang. Process., Jan. 2006, Eng. Association (COIT-AEIT). From March to
pp. 2154–2157. Dec. 2004, she participated in the International
[35] B. Mirkin, Mathematical Classification and Clustering, vol. 11. Nor- Computer Science Institute’s (ICSI, Berkeley
well, MA, USA: Kluwer Academic Publishers, 1996. (CA)) Fellowship Program. Since Nov. 2009, she
[36] A. Ben-David, “A lot of randomness is hiding in accuracy,” Eng. has been an associate professor in the Dept. of
Appl. Artif. Intell., vol. 20, no. 7, pp. 875–885, 2007. Signal Theory & Communications, University Carlos III of Madrid. Her
[37] M. Meila, “Comparing clusterings—an information based dis- research interests include speech recognition and perception, multime-
tance,” J. Multivariate Anal., vol. 28, pp. 875–893, 2007. dia processing, machine learning and data analysis and she has co-
[38] G. Brown, “A new perspective for information theoretic feature authored more than 70 papers. She is a member of the IEEE Signal
selection,” in Proc, 12th Int. Conf. Artif. Intell. Statist., 2009, Processing Society. She is a member of the IEEE.
pp. 49–56.
[39] B.-G. Hu, “What Are the Differences Between Bayesian Classifiers
and Mutual-Information Classifiers?” IEEE Trans. Neural Netw. " For more information on this or any other computing topic,
Learn. Syst., vol. 25, no. 2, pp. 249–264, Feb. 2014. please visit our Digital Library at www.computer.org/csdl.
[40] D. Mejıa-Navarrete, A. Gallardo-Antolın, C. Pelaez-Moreno, and
F. J. Valverde-Albacete, “Feature extraction assessment for an
acoustic-event classification task using the entropy triangle,”
in Proc. 12th Annu. Conf. Int. Speech Commun. Assoc., 2011,
pp. 309–312.
[41] G. Keren and S. Baggen, “Recognition models of alphanumeric
characters,” Perception Psychophysics, vol. 29, pp. 234–246, 1981.
[42] E. Rothkopf, “A measure of stimulus similarity and errors in some
paired-associate learning tasks,” J. Exp. Psychology, vol. 53, no. 2,
pp. 94–101, 1957.
[43] J. M. Gonz alez-Calabozo, F. J. Valverde-Albacete, and
C. Pel aez-Moreno, “Interactive knowledge discovery and data
mining on genomic expression data with numeric formal concept
analysis,” BMC Bioinf., vol. 17, no. 1, pp. 1–15, 2016.
[44] D. B. Kurtz, P. R. Sheehe, P. F. Kent, T. L. White, D. E. Hornung,
and H. N. Wright, “Odorant quality perception: A metric individ-
ual differences approach,” Perception Psychophysics, vol. 62, no. 5,
pp. 1121–1129, Jul. 2000.
[45] L. Secundo, K. Snitz, and N. Sobel, “The perceptual logic of
smell,” Current Opinion Neurobiology, vol. 25, pp. 107–115,
Apr. 2014.
[46] F. J. Valverde-Albacete, J. C. de Albornoz, and C. Pelaez-Moreno,
“A proposal for new evaluation metrics and result visualization
technique for sentiment analysis tasks,” in Proc. Int. Conf. Cross-
Lang. Eval. Forum Eur. Lang., 2013, pp. 41–52.
[47] K. Bache and M. Lichman, “UCI machine learning repository,”
2013. [Online]. Available: http://archive.ics.uci.edu/ml
[48] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I. H. Witten, “The WEKA data mining software: An update,”
ACM SIGKDD Explorations, vol. 11, no. 1, pp. 10–18, 2009.

Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.

You might also like