A Framework For Supervised Classification Performance Analysis
A Framework For Supervised Classification Performance Analysis
Abstract—We introduce a framework for the evaluation of multiclass classifiers by exploring their confusion matrices. Instead of using
error-counting measures of performance, we concentrate in quantifying the information transfer from true to estimated labels using
information-theoretic measures. First, the Entropy Triangle allows us to visualize the balance of mutual information, variation of
information, and the deviation from uniformity in the true and estimated label distributions. Next, the Entropy-Modified Accuracy allows
us to rank classifiers by performance while the Normalized Information Transfer rate allows us to evaluate classifiers by the amount of
information accrued during learning. Finally, if the question rises to elucidate which errors are systematically committed by the
classifier, we use a generalization of Formal Concept Analysis to elicit such knowledge. All such techniques can be applied either to
artificially or biologically embodied classifiers—e.g., human performance on perceptual tasks. We instantiate the framework in a
number of examples to provide guidelines for the use of these tools in the case of assessing single classifiers or populations of them—
whether induced with the same technique or not—either on single tasks or in a set of them. These include well-known UCI tasks and
the more complex KDD cup 99 competition on Intrusion Detection.
Index Terms—Performance evaluation, classification algorithms, information entropy, mutual information, formal concept analysis
1041-4347 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2076 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Fig. 1. Information Theoretic Model of supervised multiclass classification. K is the true class label, K ^ the estimated class label. X are the observa-
tions and Y the transformed observations, when applicable. (a) (adapted from [18]) Channel Model. The enclosed area is a black box for classifier
evaluation. (b) (adapted from [11]) Entropy (above) and perplexity (below) decomposition chains for PKK^ . To the left, perplexity reduction due to
learning effectiveness; right, perplexity increase in the output chain, related to classifier specialization.
techniques on different tasks. Moreover, the complexity of the estimate K ^ ¼ kðy^ ys Þ. In this model, classification is a
the algorithms involved only depends on the number of multi-step process: given an input label K ¼ ks , its co-
classes and it is therefore scalable to big databases. indexed vector is produced as a proxy X ¼ x s , perhaps the
We first motivate and establish our proposal for the proxy is transformed into Y ¼ y s , and, finally, the classifier
assessment and evaluation of classifiers through their con- obtains an output label K ^ ¼ k^s .
fusion matrices and classification errors (Section 2.1), In this sense, such classification model is an analogue of a
and then use the rest of the paper to flesh out and show (random) discrete, memoryless virtual communication chan-
how to use such proposal. For this purpose, we recall nel between input K and output labels K, ^ and such models of
the standard way to interpret and represent a confusion communication are the turf of information theory [19]. Induc-
matrix (Section 2.2), the basics of analyzing a confusion ing this classifier function can be the object of Information
matrix by information-theoretic means first by visualiza- Theory, as in Information-Theoretic Learning [5] or Maxi-
tion (Section 2.3), then by assessment (Section 2.4) and mum Entropy Modeling [6], but is not the focus of this paper.
finally by exploratory analysis (Section 2.5). A related work Rather, for assessment purposes, the basic conceptual
review with a discussion (Section 2.6) completes the theo- experiment is: “presenting a true label ki to the channel to
retical contribution. This methodology is then applied to obtain an estimated label k^j from it,” that is, generating data
different use cases (Section 3) as a ready-guide for each dif- ðK ¼ ki ; K^ ¼ k^j Þ from a joint distribution P ^ ði; jÞ. We sup-
KK
ferent use intended: the assessment of single classifiers on pose that there are k true labels and k0 possible estimated
single tasks (Section 3.1), several classifiers on the same labels, and—though not strictly necessary—that k ¼ k0 for
task (Section 3.2), single classifier induction technique on this application.
different tasks (Section 3.3) and several classifier induction The properties of PKK^ can be studied with the methods
techniques on different tasks (Section 3.4). We finish with of signal detection theory, multidimensional scaling or
some Conclusions. cluster analysis, among others [20]. In this paper we
focus on the methods of Information Theory [19]. For
instance, it was soon recognized that joint distribution
2 AN INFORMATION-THEORETIC FRAMEWORK FOR
PKK^ ði; jÞ needs a correction to account for random confu-
CLASSIFIER ASSESSMENT sions [21]. The original proposal was to compare it to the
2.1 A Proposal for the Information-Theoretic product of the marginals, effectively comparing the dis-
Assessment of Classifiers on Datasets tribution to that obtained supposing its marginals PK ðiÞ
Consider the scheme of Fig. 1a of a multi-class classification and PK^ ðjÞ were independent. This is exactly the definition
task cast as a transmission channel that renders it amenable of the pointwise mutual information between K and K ^ [22,
to information-theoretic analysis. There is a set of S inde- Section 2.3]
pendent, identically distributed (iid) realizations of a ran-
dom vector X of (observed) variables or features paired with as PKK^ ði; jÞ
MI PKK^ ði; jÞ ¼ log : (1)
many iid realizations of a class variable K. The set of pairs of PK ðiÞ PK^ ðjÞ
instances fðks ; x s ÞgSs¼1 will be called a dataset (of samples) . P
The feature instances X ¼ x s may be further transformed to Its expected value MI PKK^ ¼ i;j PKK^ ði; jÞ MI KK^ ði; jÞ is the
obtain instances of a random vector Y, through a transfor- mutual information between K and K ^ and has sometimes
mation function f : X ! Y; x s 7! ys ¼ fðx xs Þ with desired been used to quantify the amount of transinformation in
characteristics, e.g., statistical independence among the classification experiments [23], [24], using a simple evalua-
transformed features. tion heuristic: The more information transmitted from K to K, ^
For supervised classification, classifier induction is the the better the classifier.
subtask of inducing a function k^ : Y ! K; y s 7! k^s ¼ kðy ^ ys Þ In this paper we propound a framework for the eva-
that tries to estimate the original K ¼ k but can only obtain
s
luation of classifiers based on the exploration of the
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2077
transmission of information along the steps of the virtual also DHPK ¼ HUK HPK and DHPK^ ¼ HUK^ HPK^ are the
chain depicted in Fig. 1b. As in all good exploratory practi- divergences of PK and PK^ with respect to uniformity, so
ces, a number of tools should be brought to bear to sustain
the conclusions, so we suggest a many-pronged approach DHPK PK^ ¼ DHPK þ DHPK^ ; (4)
to classifier assessment:
the mutual information between the random variables is
(1) Estimating PKK^ ði; jÞ and MIPKK^ ði; jÞ by means of iter- MI PKK^ , and, the variation of information
ated experiments (Section 2.2).
VIPKK^ ¼ HPKjK^ þ HPKjK
^
: (5)
(2) Visualizing an information-theoretic characterization
of PKK^ ði; jÞ (Section 2.3).
Equation (2) can be further normalized in HUK UK^ as
(3) Characterizing and ranking classifiers using MIPKK^
(Section 2.4). 1 ¼ DHP0 K P ^ þ 2MI 0P ^
þ VI 0P ^
; (6)
(4) Exploring an estimate of the pointwise mutual infor- K KK KK
mation MI PKK^ ðki ; k^j Þ to analyze classifier errors meaning that entropy balances are compositional data [26]
(Section 2.5) that can be represented in a de Finetti or ternary diagram as
We believe this combination of exploratory assessment the equation of the 2-simplex in normalized space
methods can provide a more complete diagnosis on classi- ½DH 0 PKK^ ; 2MI 0P ^ ; VI 0 PKK^ hence the name entropy triangle
KK
fiers and datasets, specially when different combinations of
(ET). See [7] for further information on this construction.
them are used. Next we explain the concepts and tools
In our framework, the performance of a particular
underlying it.
classifier on a particular dataset characterized by PKK^
shows as a point in an entropy triangle whose entropic
2.2 Estimating PKK^ and MIKK^ for Classifiers
components can be read off the axes (see Supplemental
To estimate PKK^ , the results of the S iid realizations are
0 Materials, which can be found on the Computer Society
aggregated into a CM, NKK^ 2 Nkk , where NKK^ ði; jÞ ¼
PS Digital Library at http://doi.ieeecomputersociety.org/
s¼1 dðki ; kj Þ—with d the Kronecker delta over true and
s ^s
10.1109/TKDE.2019.2915643, for indications about how
estimated labels. A confusion matrix is, therefore, a par- to read an ET). Furthermore, several different classifiers
ticular kind of contingency table for a classification pro- on the same task are visually comparable in terms of the
cess [25, p.51], the contingency being that k^j was returned information they transmit from K to K. ^ Due to normali-
as a response to ki . P zation, this is even true for the same classifier on differ-
For a count matrix NKK^ call nij ¼ NKK^ ði; jÞ, ni ¼ j nij ent datasets. Indeed, it is also true for comparing
P P
and nj ¼ i nij . Note that in our setting,1 S ¼ ij nij . With different classifiers on different tasks. In each of these
these data, the empirical estimate or maximum-likelihood instances, though, the interpretation becomes more
estimator of the joint probability distribution between nuanced and subject to provisos: possible different cardi-
n
inputs and outputs is P^KK^ ði; jÞ ¼ Sij , and its marginals nalities of the datasets, underlying technologies and
P nij P nij parameters for the classifiers, etc. This makes it at the
P^K ðiÞ ¼ j S and P^K^ ðjÞ ¼ i S .
same time easier to carry out cross-comparisons of classi-
In turn these estimates are plugged into (1) to give an
fiers on actual datasets, and more difficult to interpret
estimate of the empirical pointwise mutual information
c P ði; jÞ and its expectation MI c P , the empirical mutal the results.
MI ^
KK KK^
In this paper, we introduce a new result that enhances the
information.
interpretation of the triangle: every dataset has an inherent
theoretical limit for performance. To define it we need to start
2.3 Visualizing Classifiers with Entropy Triangles
from the so-called split entropy triangle [7]: due to (3), (4) and
For the joint probability distribution PKK^ of any two random
^ Valverde-Albacete et al. [7] gave a new (5) the balance Equation (2) can be split into two equations
variables K and K,
related each to one of the marginals,
Shannon-type entropy decomposition2
HUK ¼ DHPK þ MI PKK^ þ HPKjK^
HUK UK^ ¼ DHPK PK^ þ 2MI PKK^ þ VI PKK^ ; (2) (7)
P HUK^ ¼ DHPK^ þ MI PKK^ þ HPKjK
^
;
where HPK ¼ i PK ðiÞlog PK ðiÞ is the entropy of distribu-
tion PK ðiÞ, UK (respectively, UK^ ) is a uniform distribution which may be normalized in each of HUK and HUK^ ,
with the same support as PK (respectively, as PK^ ) and
UK UK^ is their product, so their entropies are 1 ¼ DH 0PK þ MI 0P ^
þ H 0P ^
KK KjK
00 00 00
HUK UK^ ¼ HUK þ HUK^ ; (3) 1 ¼ DH PK ^
þ MI PKK^
þH PKjK
^
;
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2078 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Notice that for a particular dataset, DHPK is a constant and the Normalized Information Transfer (NIT) factor, or rate,
across classifiers. In the context of this paper, we find it use-
ful to represent it as a line from ½DH 0PK ; 0; 1 DH 0PK on the MI
mKK^ 2 PKK^ 1
qðPKK^ Þ ¼ ¼ ; qðPKK^ Þ a0 ðPKK^ Þ 1;
bottom side of the ET, to ½DH 0PK ; 1 DH 0PK ; 0 on the right k k k
side of the triangle. This line constrains the amount of mutual (12)
information transferred by the classifier to this last quantity, whence we get the EMA-NIT equation
and establishes a limit on the learning capability of any classi-
fier on that task as represented by the mutual information. a0 ðPKK^ Þ ¼ dK qðPKK^ Þ: (13)
2.4 Ranking Classifiers with the Entropy Modified The interpretation of these quantities is clear:
Accuracy and the Normalized Information
Transfer The EMA a0 ðPKK^ Þ is the expected proportion of times
Simple figures-of-merit are sometimes preferable to complex the classifier will guess the output class correctly.
diagrams. Given the aforementioned unsuitability of classical The NIT factor qðPKK^ Þ is the proportion of available
metrics—such as accuracy or error rate—specially in imbal- information transferred from input to output.
anced datasets [13], we next obtain figures-of-merit particu- The dK is a factor that constrains how much informa-
larly meaningful for our purposes, as introduced in [11].3 tion is available for learning.
We may write the split entropy in (7) in multiplicative form As such, the EMA can be used to provide a ranking to com-
HP
pare classifiers by their performance in a task, whereas the NIT
2HUK ¼ 2DHPK 2
MI P ^
KK 2 ^
KjK (8) factor provides an estimate of how efficient the classifier induction
process was. Likewise, when the task is balanced across classes
HU ^ DHP ^ MI P HP ^
2 K ¼2 K 2 ^
KK 2 KjK : (9) PK ¼ UK then DHPK ¼ 0 whence dK ¼ 1 so a0 ðPKK^ Þ ¼ qðPKK^ Þ
and all the (entropy modulated) accuracy of the classifier comes
Recall that, in language modeling [27], the perplexity of a from the learning process.
predictive distribution PX jZ —where Z is the language model- For the same reasons, the perplexity of the task as
ing context—is the apparent number of options kX jY available kK ¼ 2HPK ¼ mKK^ kKjK^ , the apparent number of equiproba-
H
for prediction in case they were equally likely kPX jZ ¼ 2 PXjZ , ble classes in the task as embodied in the input distribution,
where HPX jZ is the conditional entropy of the prediction. an important quantity to assess the potential performance
Now for the uniform input distribution UK entropy is of a classifier [11], since
H
maximal and therefore k ¼ 2HUK and likewise for k0 ¼ 2 UK^
for datasets with k true labels and (possibly different) k0 esti- k ¼ dK kK k0 ¼ dK^ kK^ :
mated labels.
Since classification acts are iid (see Section 2), the previ- Further insights into EMA, the NIT factor and perplexity are
ous paragraph suggests that (8) is the decomposition of the described in [11].
theoretical perplexity k of a classifier. To arrive at this, first call
DH 2.5 Exploring Errors with Generalized Formal
dK ¼ 2DHPK and dK^ ¼ 2 PK^ , to quantify the decrease in per-
Concept Analysis
plexity due to the non-uniformity of the marginals. Then,
MI Questions often addressed about confusion matrices are:
the quantity mKK^ ¼ 2 PKK^ has the interpretation of an infor-
mation transfer rate, and since HPKjK^ has the interpretation Is there a sensible way to cluster classification
of a remanent entropy— that is information not transferred events? Are there any subclusters or a hierarchical
HP
to the output—then kKjK^ ¼ 2 KjK^ can be interpreted as a organization of such clusters?
remanent perplexity, the actual number of equally probable Can we reorder the columns and rows so that these
label alternatives for the classifier after the learning process. clusters and relations are better made evident?
Note that the perplexity and the remanent perplexity can be Are there any row- or column-transformations that
dually defined for the output distribution, provide insights into the underlying confusions?
We contend that Formal Concept Analysis (FCA) [28] and,
k ¼ dK mKK^ kKjK^ specifically, a generalization of it for non-binary matrices
(10) called KFormal Concept Analysis (KFCA)4 [29], [30] can
k0 ¼ dK^ mKK^ kKjK
^ ; help answer these questions.
0
In FCA, I 2 2kk is interpreted as a binary confusion
whence Fig. 1b represents both the transmission of entropy
matrix, between a set of real classes K (i.e., the domain of
from the input distribution of true labels to the output distribu-
the possible outcomes of the random variable K) and a set
tion of estimated labels and the perplexity reduction through- ^ (i.e, the domain of the
m of predicted or recognised classes K
out this process. From (10) we get k 1 ¼ dK Kk K^ what ^ where
^
KjK possible outcomes of the random variable K),
suggests the definition of the Entropy-Modified Accuracy (EMA), Iði; jÞ ¼ 1 (or likewise I t ðj; iÞ ¼ 1) can be interpreted in
1 1 1 many different ways:
a0 ðPKK^ Þ ¼ ¼ H ; a0 ðPKK^ Þ 1; (11)
kKjK^ 2 KjK^ k
P
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2079
(input) class ki is recognized as (output) class k^j , or 2.6 Related Work and Discussion
class ki is classified as k^j , or class ki is confused with The main contribution of this paper is the introduction of an
k^j , , class k^j is predicted of ki , assessment framework using information-theoretic tools to
class ki is read as k^j , class k^j is read for ki . analyze the performance of classifiers, grounded on the
class ki is substituted by class k^j , class k^j is substi- metaphor/model of an information-transmitting channel
tuted for class ki . for classification tasks.
The triple K ðK; K; ^ IÞ is called a formal context and The existence of such frameworks and metaphors is capital
summarizes all the available information about K, K ^ and I. for the use of assessment techniques, since the metaphors
In Formal Concept Analysis one specially studies two map- “enable” a way of thinking about the task, while frameworks
pings between the powersets of true and estimated classes, “prescribe” a way to use metaphors to advantage. These are
called the polars: given a set of true classes A K, the esti- capital questions for the progress of science and we believe
mated class polar A"I ¼ fk^j 2 K ^ j Iði; jÞ ¼ 1; 8ki 2 Ag is the widely-cited works like [32] are just about proposing and
set of estimated classes that the classifier predicted from A understanding such frameworks. Incidentally, we also adopt
and the true class polar B#I ¼ fki 2 K j Iði; jÞ ¼ 1; 8k^j 2 Bg the suggestion that the need for evaluation arises from typical
is the set of true classes that are confused with a given set of classification tasks [32, Section 9.1] which are well exemplified
estimated classes B K. ^ Specifically in the archetypal use cases to be presented in the next Section.
Note that a model for motivating information-theoretic
k^j 2 A"I , k^j is predicted of every ki 2 A assessment procedures is missing in [32]—although indi-
vidual information measures are not. Indeed, our work
ki 2 B# , ki is confused with every k^j 2 B:
I
could be conceived as fleshing out that framework for
^ assessment based in a completely different model of what a
Pairs of sets of objects A 2 2K and attributes B 2 2K that
good classifier is, viz. that which is capable of maximizing the
map to each other A"I ¼ B and B#I ¼ A are called formal con- mutual information between the real and predicted class labels.
cepts. For a concept ðA; BÞ, the set of objects A is called its This is an alternate, but certainly highly correlated, criterion
extent while the set of attributes B is its intent and the set of to that of minimizing classification errors. But in our
formal concepts is written BðK; K; ^ IÞ
opinion, the assumptions and heuristics flowing from
^ IÞ , A" ¼ B , B# ¼ A: our hypothesis and their entailments are all new and
ðA; BÞ 2 BðK; K; I I perspective-opening.
Formal concepts ðA1 ; B1 Þ; ðA2 ; B2 Þ 2 BðK; K;
^ IÞ are partially An initial step towards the development of this frame-
ordered by the inclusion (resp. reverse inclusion) of extents work was implicitly taken in [17] by using an exploratory,
(resp. intents) pointwise mutual information-based technique to elicit the
errors in confusion matrices with the help of formal concept
ðA1 ; B1 Þ ðA2 ; B2 Þ , A1 A2 , B1 B2 : (14) analysis [28]. This was greatly expanded in [31] towards a
full-fledged exploratory technique whose application possi-
With the concept order, the set of formal concepts bilities somehow exceed the present ones.
hBðK; K;^ IÞ; i is actually a complete lattice called the Later, [7], [11] provided the theoretical groundwork for
concept lattice (CL) of the formal context K. the visualization and assessment of confusion matrices
This technique on boolean matrices is used as a stepping based in information-theoretic concepts. In particular, [11]
stone to develop KFCA, a technique to build concept latti- discusses the suitability of information-theoretic metrics
ces from multi-valued confusion matrices. We claim that the on imbalanced datasets evaluating synthetically genera-
inclusion order of confusions between true and estimated ted multi-class classification problems with progressively
labels in a CM can be efficiently represented by concept lat- increased degrees of imbalance in their classes. These ideas
tices. After the point-wise mutual information defined in had been previously applied to Automatic Speech Recogni-
Equation (1), the true and estimated labels can be confused tion in [33].
to a certain degree ’. So if makes sense to sweep over the There are alternatives to all of the tools and techniques we
different values of ’ observed for the data using it as a recommend in this paper, but most have strong objections to
threshold, to obtain a binary matrix for each value of ’. In being used, as we outline in the following paragraphs.
this manner, we obtain a sequence of CL fI’ g indexed by ’, A detailed confusion matrix is often presented as an out-
each amenable to FCA put of a machine learning task either in the form of heatmaps
(see an example in Fig. 3b) or merely numerically. Also,
c P ðki ; k^j Þ
I’ ðMI ’Þ: behavioral sciences frequently choose this representation to
^
KK
illustrate the results of perception experiments. Those repre-
A detailed description of exploring contingency matrices sentations do not answer the questions proposed in Section
with KFCA is presented in [17] for the analysis of human 2.5 satisfactorily since their exploration is not principled or
phonetic confusions. In [31] recipes are explained to choose systematic and is usually guided (and biased) by researcher
a judicious ’, so that it can be used as an illustration of the intuition. The hierarchical ordering of input and output
affordances of FCA to support data-induced scientific classes as interpreted by the classifier is what is revealed by
enquiry and discovery in Gene Expression Data analysis. a context lattice. Its representation as an order diagram pro-
An example of this is shown in Fig. 3, and a guide on how vides an entirely data-driven visualization.
to read confusions off it can be found in Supplemental On a classification setting, the optimal classifier’s distri-
Materials, available online. bution would be diagonal P^KK^ ðki ; k^j Þ ¼ nmi dði; jÞ; and the
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2080 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
ni
worst possible is row-constant P^KK^ ðki ; k^j Þ ¼ pn . But we are these assessment procedures is due to the analysis of the
at a loss as to how to assess the classifier at a glance from confusion matrices, which is very conservatively estimated
such estimated distribution for intermediate cases. Direct as Oðk2 Þ in the number of classes k. Given that datasets typi-
observation by human experts is usually the way to induce cally contain at present at most hundreds of classes but
conclusions from these data. Sometimes this involves some millions—sometimes billions—of instances/samples for
sort of block diagonalization procedure using mostly heu- induction, the cost of the assessment is negligible com-
ristic rules involving the choice of a particular metric that pared to the cost of learning. On the other hand, the ass-
should be corroborated a posteriori. In [34], for example, a essment measures and visualizations are straightforward
taxicab metric was defined in order to find an optimal computations of mathematical formulae from Information
ordering: the metric weighs down the off-diagonal values Theory. Their interest lies in the communication channel
of the CM in proportion to their distance to the diagonal. metaphor and the framework introduced to make the
Another representation, the directed threshold graph, is pro- assessment principled and systematic and not in their algo-
posed in [35, p.52]. Note that these issues are laid to rest rithmic implementations—indeed we provide four pack-
by the use of the Entropy Triangle as a plotting technique ages which have been developed by three different people
and its interpretation (see Supplemental Materials, avail- yet have similar time complexity.
able online). The size of the dataset S, however, has an important
Custom has that CM be evaluated by an error counting impact in the accuracy of the probability estimates from the
measure like accuracy, that has several problems [36]. For error counts collected for each experiment. As a rule of
instance, [13] highlights the problem of accuracy on imbal- thumb, we trust the probability estimators if the sampling
anced sets and [11] shows how in this case accuracy is grossly procedurepisffiffiffiffi correct and the number of classes respects the
overestimated. EMA was devised to correct for this effect. rule k < S . Note also that the estimation of entropy is a
The balance Equation (2) implies that this mutual infor- well-known problem in the field, and off-the-shelf imple-
mation maximization entails the minimization of the varia- mentations are available for a number of major data proc-
tion of information of (5), since the divergence from unity essing platforms.
DHPK is an inherent quantity of the marginals of PKK^ and
unmodifiable by classifiers. The variation of information 3 EXAMPLE USE CASES AND EXPERIMENTS
was found to be an important quantity for the evaluation of
unsupervised clustering in [37]. There it was proven as a We next present several different use cases for the frame-
true metric or measure of dissimilarity, that agrees with the work presented above: these are archetypal cases to apply
intuition that the mutual information is a similarity between the techniques. Code for each different use is available on
distributions. The mutual information maximization pro- request. A full example spanning several of these cases
cess entails an increase in NIT factor parallel to an increase around the KDD cup 99 on intrusion detection is presented
in EMA. in Supplemental Materials, available online.
How classifiers can actually “cheat” in reporting accu-
racy is by artificially (optimistically) increasing the diver- 3.1 Assessing a Single Classifier on a Single Task
gence from unity of the predicted labels DHPK^ : this Using the ET for single classifiers is an interesting tech-
increases accuracy—in a non-generalizable way—at the nique when the practitioner cannot control the induction
expense of the variation of information, but leaves the of the classifier, or there is a particular instantiation of it
mutual information unchanged, so the NIT factor and EMA which is “natural”, as in the case of classifiers embodied in
are unaffected. living beings.
Note that the present endeavor is different to that of On such classifiers it also makes sense to use exploratory
inducing classifiers, e.g., by mutual information maximiza- analysis to try to find systematic confusions between
tion. The work in [38] provides the groundwork to relate true and estimated labels. For this purpose, the ET
mutual information maximization to error probability mini- and KFormal Concept Analysis have already been suc-
mization independent of dataset class cardinality using cessfully used in the evaluation of phoneme recognition by
Fano’s inequality [22]: as a general rule, the higher the humans [7], [40]. Here, the ET is just a tool to visualize the
mutual information the lower bound on the error probabil- entropies related to the perceptual channel, while the confu-
ity, and this seems to be aligned with our intuition. Along sion lattice is an effective tool to elicit significant errors
these lines, [39] provides a comparison of Bayesian and committed by the classifier.
maximum mutual information classification issues for two- Note that in the absence of a model of classification by
class tasks. It would be interesting to develop a systematic humans, or a comparison of the performance of the different
comparison of information-based induction schemes, e.g., perceptual modalities, it makes no sense to rank systems
MaxEnt or logistic regression, to others based in non-entro- using EMA or NIT factor.
pic loss functions under this framework. This is left for Example 1. In the following examples we present the per-
future work. formance of humans in several perceptual tasks:
Some reviewers of this work expressed a concern about
the algorithmic complexity of the techniques presented: we Visual distinction of segmented numeral digits in
believe this is not a concern at present. On the one hand, humans. In this task from [41], the k ¼ 10 seven-
the starting data for our analysis are the joint distributions segment digits of digital displays were presented
of prediction errors as obtained from count distributions to testers under different conditions designed to
by the maximum likelihood estimator. The complexity of guarantee that confusions would appear.
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2081
TABLE 1
Perceptual Confusion Matrix Analysis
Fig. 3. Two information-theoretic representations of the Odorant confusions: (a) the Concept Lattice diagram at ’ ¼ 1:78 . (Stimuli are labelled in
c KK^.
white and responses in gray) and (b) a gray scale heatmap of its mutual information MI
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2082 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2083
Fig. 4. Entropy triangle for some Weka classifiers on the UCI anneal task.
EMA. Occasionally, the second or third classifiers are algorithms of the table employ different strategies to
better by EMA than that ranked first by accuracy [46], solve the problem from the information theoretic point of
although in this example we would only see a re-ranking view: while zeroR and AdaboostM1, minimize the error
of the second and third classifiers since most classifiers count by especialisation on the minority classes (note
solve the problem very well. their positions towards the lower right hand angle of the
However, it is remarkable that the lowest ranking clas- ET) Bayes actually obtains a fairly good NIT and EMA
sifier actually did not use any information in the induc- close to the one obtained by the best classifiers. If we had
tion process in spite of its claiming to offer an accuracy of only paid attention to the classical accuracy metric, we
aðPKK^ Þ ¼ 0:762. Indeed, its remanent perplexity is the would have discarded this classifier, that is actually not
same as the actual perplexity kKjK^ ¼ kK entailing there is that bad. This can also be observed in Fig. 4b where the
no mutual information between input and output, con- colour bar to the right is proportional to NIT.
firmed by the fact that mKK^ ¼ 1. Note that its NIT factor
is the worst possible at qðPKK^ Þ ¼ k1.7 3.3 Assessing a Single Classifier Induction
Technique on Different Tasks
The entropy triangles in Fig. 4 provide a more visual
This is the type of use we envision for classification algorithm
assessment. For example, we can observe how ZeroR
research and design. The practitioner has a battery of tasks for
appears naturally at the bottom side providing the right
evaluation with different class cardinalities and, if possible,
intuition that no properly trained algorithm should be
with different input marginals, ranging from the uniform to
below.
highly skewed ones (or imbalanced). The purpose of the
Recall that the height amounts to mutual informa-
uniform marginals is, of course, to subject the induction
tion transferred from input to output labels, so the
technique to the most stringent learning conditions.
rules.ZeroR classifier actually transmits no information
from input to output. Example 3. Consider the functions. logistic classifier of Weka
The best classifiers in this population, for this task seem implementing multiclass logistic regression. We obtained
to be trees.RandomForest, lazy.IBk and functions.Logistic confusion matrices for 10 different UCI tasks and plotted
almost indistinctly. Note that the line at DHPK ¼ 0:54 them in the ET in Fig. 5.
implies any classifier can at most reach a maximum The assessment process after classifier population
MIP0 ^ ¼ 1 0:54 ¼ 0:46, inducing a NIT factor of induction consists basically in:
KK
qðPKK^ Þ ¼ 20:46log 2 ð6Þ =6 ¼ 0:38 , due to the lack of balance 1) Using kK to assess the effective number of classes in the
in the dataset. Indeed the best three classifiers are not far data. This allows us to see the real complexity of
away from these values, implying that such techniques the tasks.
actually solve this classification task satisfactorily. We 2) Using the ET to individually assess the classifier wrt to
can also clearly observe in the ET how the three lower the population of tasks. This provides a visual sum-
mary of the performance on different tasks wrt
7. Of course, this is the typical behavior of the ZeroR algorithm since the tasks class balance.
it selects the majority class in the dataset and uses it to make all the pre- 3) Using the NIT factor to rank classifiers. The rank pro-
dictions. In this case, this is very profitable from the point of view of vides insight in the learning capabilities of the
classical accuracy since the database is very imbalanced. This trivial algorithm as expressed in each of the tasks.
algorithm is usually employed as the minimum threshold of perfor-
mance for this reason, an example of an extreme behavior that we can 4) Balancing EMA versus NIT factor. EMA is the crite-
observe in a more moderate way in AdaBoostM1, as well. rion for performance in solving the task so the
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2084 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
Fig. 5. A study of the logistic classifier on a population of UCI tasks. Fig. 6. A study of three UCI tasks liver-disorders (red), breast-cancer (blue)
and hepatitis (purple) with a sample of classifier technologies on the ET.
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2085
TABLE 4
Performance of Several Weka Classifiers on Some UCI Tasks
functions.Logistic 1.975 1.835 1.076 0.704 0.545 0.538 1.838 1.803 1.019 0.689 0.555 0.510 1.664 1.389 1.198 0.890 0.720 0.599
trees.RandomForest 1.975 1.831 1.078 0.699 0.546 0.539 1.838 1.150 1.598 0.969 0.870 0.799 1.664 1.068 1.557 0.987 0.936 0.779
meta.Bagging 1.975 1.846 1.070 0.696 0.542 0.535 1.838 1.570 1.171 0.832 0.637 0.585 1.664 1.485 1.121 0.865 0.674 0.560
trees.J48 1.975 1.860 1.062 0.687 0.538 0.531 1.838 1.744 1.053 0.755 0.573 0.527 1.664 1.313 1.267 0.923 0.762 0.634
meta.AdaBoostM1 1.975 1.896 1.042 0.661 0.527 0.521 1.838 1.726 1.065 0.755 0.579 0.532 1.664 1.384 1.203 0.897 0.723 0.601
rules.JRip 1.975 1.911 1.033 0.646 0.523 0.517 1.838 1.715 1.071 0.769 0.583 0.536 1.664 1.367 1.217 0.877 0.732 0.609
lazy.IBk 1.975 1.918 1.029 0.629 0.521 0.515 1.838 1.106 1.661 0.979 0.904 0.831 1.664 1.146 1.452 0.968 0.873 0.726
functions.SMO 1.975 1.970 1.003 0.583 0.508 0.501 1.838 1.720 1.068 0.762 0.581 0.534 1.664 1.401 1.188 0.884 0.714 0.594
rules.ZeroR 1.975 1.975 1.000 0.580 0.506 0.500 1.838 1.838 1.000 0.703 0.544 0.500 1.664 1.664 1.000 0.794 0.601 0.500
bayes.NaiveBayes 1.975 1.931 1.023 0.568 0.518 0.511 1.838 1.767 1.040 0.717 0.566 0.520 1.664 1.406 1.183 0.858 0.711 0.592
Perplexities kK ; kKjK^ , accuracy a, EMA a0 , and NIT factor q for some classifiers in Weka on the UCI liver disorders, breast cancer, and hepatitis tasks. Ranking is
by accuracy on liver-disorders.
that the liver-disorder proves unsolved by any of the - A python module for the Entropy Triangle.12
classifiers chosen, a result already noticed in other works Some of the UCI tasks [47] used in this work are dis-
and blamed on the poor quality of the features [49]. Nei- tributed with Weka itself and some have been down-
ther are breast-cancer and hepatitis satisfactorily solved. loaded from the UCI repository.13
This could have been seen in the isolated triangles To carry out a KFormal Concept Analysis explora-
similar to those of Section 3.2. tion of a matrix, a web service with and advanced
A recurrent finding is that the majority rule rule. interface is available. Confusion matrices—among
ZeroR is essentially a classifier that transmits no infor- other types of contingency matrices—can be ana-
mation, whatever the accuracy it may claim on any lyzed.14 Also, Matlab code is available from the
task. This is an inherent characteristic of that classifier authors on request.
rule made the more evident with entropic diagrams There are many open-source resources for FCA of
but note as well, that function.SMO has a similar binary data matrices.15
behavior for liver-disorders, more unexpected
and relevant. 4 CONCLUSIONS AND FURTHER WORK
With careful reading we may also notice a manifesta-
tion of the no-free-lunch theorem: the mutual informa- The main contribution of this paper is a framework for clas-
tion transmitted by any classifier is not consistently sifier-on-a-dataset performance assessment and visualiza-
superior to that of every other classifier. Fixing on a par- tion based on information-theoretic principles. In order to
ticular classifier, say trees.RandomForest or lazy.IBk, you do so, we make explicit a model of a virtual channel
can see it in action, since the tasks were chosen, among between the true and estimated class labels. In this model,
others, for this purpose. If you could see some consis- our notion of what is a good classifier is aligned with the
tency in the relative behavior of two classifiers, this is a heuristic of mutual information maximization between real and
mirage easily dispelled by including other tasks. estimated class label distributions: the ET visually favors the
vertical dimension that displays mutual information and
3.5 Materials the NIT factor actually quantifies the learning performance
The results we present in this work have been obtained with of classifiers. EMA, as a measure of error is tied to an exp-
software operating over the results of the Weka frame- ectation obtained from “whitening” the perplexity of the
work [48] for machine learning on some of the UCI tasks: dataset, first by extracting dataset entropy and then by
extracting the mutual information as captured by the classi-
The ET, EMA and NIT can be calculated with pub- fier. In this sense it is a derived measure from both mutual
licly available software: information and actual task perplexity.
- There is a Weka plugin for the Entropy Triangle, Notice that the actual description of K ¼ ks in terms of a
EMA and NIT.9 vector of features X ¼ x s and the classifier that embodies
- There is a matlab toolbox for the Entropy the classification process are not taken into consideration
Triangle.10 for the tallying, and this is what makes an evaluation inde-
- An experimental R package for the Entropy pendent of the technological decisions around classifier
Triangle.11 induction possible. Therefore the model applies whether
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
2086 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 11, NOVEMBER 2020
the agent carrying out the classification is either natural, proposed measures in our framework, but this is also
e.g., perceptual tasks, or artificial, e.g., a machine. intended for future efforts.
The assessment framework is completed with a more Finally, many of those measures refer to non-classifica-
detailed exploration of the structure of the confusions of a tion tasks [37], [50] or to the measurement of multivariate
single classifier: concept lattices are rendered as an ade- information [50]. These are matters we are already trying to
quate means for displaying clusters of confusions and their integrate in this framework.
hierarchical organization so as to provide insights into the
underlying natural phenomena being analyzed or the way a ACKNOWLEDGMENTS
machine classifier has captured the essence of the task being
This work was partly supported by the Spanish Ministry of
learned. This provides a guidance for improving its perfor-
Economy & Competitiveness projects TEC2014-53390-P and
mance either in the design of the classifier or the feature
TEC2017-84395-P.
extraction procedure for its representation, in the case of
machine classifiers. For human perceptual tasks, it may
help in eliciting how our sensory apparatus processes stim- REFERENCES
uli. But how to make this knowledge flow back into the [1] C. E. Shannon, “A mathematical theory of Communication,” The
induction step is left for future work. Bell Syst. Tech. J., vol. XXVII, no. 3, pp. 379–423, 1948.
[2] C. E. Shannon, “A mathematical theory of communication,” The
With respect to previous assessment tools on populations Bell Syst. Tech. J., vol. XXVII, no. 3, pp. 623–656, 1948.
of classifiers with the ET, we have added in this paper the [3] D. J. C. MacKay, Information Theory, Inference and Learning Algo-
segment at constant DHPK in Sections 2.3 and 3.4. The seg- rithms. Cambridge, U.K.: Cambridge Univ. Press, Sep. 2003.
[4] L. Brillouin, Science and Information Theory, 2nd ed. New York, NY,
ment provides a clearer view on how good classifiers are by USA: Academic Press, 1962.
highlighting the intersection between it and the side [5] J. C. Principe, Information Theoretic Learning. New York, NY, USA:
VIP0 ^ ¼ 0 where the best classifiers for a task should lie. Springer, 2010.
KK [6] E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge,
To illustrate the framework with reproducibility in mind, U.K.: Cambridge Univ. Press, 1996.
we have chosen publicly available datasets and classifier [7] F. J. Valverde-Albacete and C. Pelaez-Moreno, “Two information-
implementations. In particular, the former include human theoretic tools to assess the performance of multi-class classifiers,”
Pattern Recognit. Lett., vol. 31, no. 12, pp. 1665–1671, 2010.
performance tasks such as the segmented numerals digits, [8] M. Zhou, Z. Tian, K. Xu, X. Yu, and H. Wu, “Theoretical entropy
morse codes and odorant confusion, the KDD cup 99 on assessment of fingerprint-based Wi-Fi localization accuracy,”
intrusion detection and well-known UCI examples. For the Expert Syst. Appl., vol. 40, no. 15, pp. 6136–6149, 2013.
the classifiers, we have used some from contestants in the [9] W. R€ odder, D. Brenner, and F. Kulmann, “Entropy based evalua-
tion of net structures-deployed in social network analysis,” Expert
KDD cup 99 and from the pool of those readily available Syst. Appl., vol. 41, no. 17, pp. 7968–7979, 2014.
from Weka [48] extending the list that [32, Section 3.2] used [10] T. Chen, Y. Jin, X. Qiu, and X. Chen, “A hybrid fuzzy evaluation
to include trees.RandomForest and rules.ZeroR. The former method for safety assessment of food-waste feed based on entropy
and the analytic hierarchy process methods,” Expert Syst. Appl.,
because it is a quite successful, extensively used type of clas- vol. 41, no. 16, pp. 7328–7337, 2014.
sifier, and the latter because it is a technique that has a quite [11] F. J. Valverde-Albacete and C. Pelaez-Moreno, “100% classifica-
extreme characterization by the methods proposed herein: tion accuracy considered harmful: The normalized information
no other classifier systematically learns less than it does. transfer factor explains the accuracy paradox,” PLoS One, vol. 9,
no. 1, pp. 1–10, Jan. 10, 2014.
Note, however, that our aim was to made the tools collected [12] C. F. Hempelmann, U. Sakoglu, V. P. Gurupur, and S. Jampana,
in this paper available to the community and not really to “An entropy-based evaluation method for knowledge bases of
evaluate specific classifiers or datasets. medical information systems,” Expert Syst. Appl., vol. 46,
Note also that the ET, EMA, and NIT factor are distribu- pp. 262–273, 2016.
[13] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
tion-agnostic devices and measures in the sense that they Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
deal with imbalanced and balanced data in the same way, [14] V. Garcıa, R. A. Mollineda, and J. S. Sanchez, “A bias correction func-
unlike previous work that specifically corrects for imbal- tion for classification performance assessment in two-class imbalanced
ance [14], [15], [16]. Imbalance in the task manifests itself in problems,” Knowl.-Based Syst., vol. 59, pp. 66–74, Mar. 2014.
[15] V. L opez, A. Fernandez, S. Garcıa, V. Palade, and F. Herrera, “An
lower kK —entailing an easier classification task than if it insight into classification with imbalanced data: Empirical results
were balanced16—and a higher EMA-penalizing factor— and current trends on using data intrinsic characteristics,” Inf.
which means a heavier correction on standard accuracy. Sci., vol. 250, pp. 113–141, Nov. 2013.
Still rarely—but increasingly—there are classifications [16] N. Tomasev and D. Mladenic, “Class imbalance and the curse of
minority hubs,” Knowl.-Based Syst., vol. 53, pp. 157–172, 2013.
tasks with high number of classes, e.g., image classification. [17] C. Pelaez-Moreno, A. I. Garcıa-Moral, and F. J. Valverde-Albacete,
Since the sizes of the datasets have a great impact in the ade- “Analyzing phonetic confusions using Formal Concept Analysis,” J.
quacy of the probability estimators, this question should be Acoustical Soc. Amer., vol. 128, no. 3, pp. 1377–1390, Sep. 2010.
[18] F. J. Valverde-Albacete and C. Pelaez-Moreno, “The evaluation of
revisited in future work.
data sources using multivariate entropy tools,” Expert Syst. Appl.,
The field of information-theory for classification is vol. 78, pp. 145–157, 2017.
really vast. Unlike other works which propose a measure [19] T. M. Cover, Elements of Information Theory. Hoboken, NJ, USA:
of performance, we are proposing a framework to under- Wiley, 2006.
[20] H. N. Wright, “Characterization of olfactory dysfunction,”
stand and sustain such measures. It would be a good Archives Otolaryngology, vol. 113, no. 2, pp. 163–168, Feb. 1987.
exercise for a review paper to try to cast previously [21] D. Baxter and B. Keiser, “A speech channel evaluation divorced
from talker-listener influence,” IEEE Trans. Commun. Technol.,
vol. CT-14, no. 2, pp. 101–113, Jul. 1966.
16. In the sense that it is easier to obtain a higher number or correctly [22] R. M. Fano, Transmission of Information: A Statistical Theory of Com-
classified samples concentrating the learning capabilities of the munication. Cambridge, MA, USA: MIT Press, Jan. 1961.
machine in the majority classes.
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.
VALVERDE-ALBACETE AND PELAEZ-MORENO: A FRAMEWORK FOR SUPERVISED CLASSIFICATION PERFORMANCE ANALYSIS WITH... 2087
[23] M. Wang and R. Bilger, “Consonant confusions in noise: A study [49] B. Venkata Ramana, M. S. P. Babu, and N. B. Venkateswarlu, “A
of perceptual features,” J. Acoustical Soc. Amer., vol. 54, no. 5, critical study of selected classification algorithms for liver disease
pp. 1248–1266, Jan. 1973. diagnosis,” Int. J. Database Manage. Syst., vol. 3, no. 2, pp. 101–114,
[24] M. Wang, C. Reed, and R. Bilger, “A comparison of the effects of May 2011.
filtering and sensorineural hearing loss on patterns of consonant [50] J. R. Vergara and P. A. Estevez, “A review of feature selection
confusions,” J. Speech Hearing Res., vol. 21, pp. 5–36, Jan. 1978. methods based on mutual information,” Neural Comput. Appl.,
[25] B. Mirkin, Clustering for Data Mining. A Data Recovery Approach. vol. 24, no. 1, pp. 175–186, Jan. 2014.
London, U.K.: Chapman & Hall, 2005.
[26] V. Pawlowsky-Glahn, J. J. Egozcue, and R. Tolosana-Delgado, Modeling Francisco J. Valverde-Albacete received the
and Analysis of Compositional Data. Chichester, UK: Wiley, Feb. 2015. Eng. degree in telecommunications from the Uni-
[27] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, cnica de Madrid, in 1992 and the
versidad Polite
MA, USA; London, U.K.: MIT Press, 1997. DEng degree in telecommunications from Univer-
[28] B. Ganter and R. Wille, Formal Concept Analysis: Mathematical Foun- sidad Carlos III Madrid, in 2002. He has been a
dations. Berlin, Germany: Springer, 1999. researcher with the Computer Science Dept.,
[29] F. J. Valverde-Albacete and C. Pelaez-Moreno, “Towards a gener- UNED, Madrid and an associate professor with
alisation of formal concept analysis for data mining purposes,” in the Signal Theory and Communications Dept.
Proc. Int. Conf. Formal Concept Anal., Dec. 2006, vol. LNAI 3874, Universidad Carlos III. He has also visited with
pp. 161—176. the University of Strathclyde, United Kingdom,
[30] F. J. Valverde-Albacete and C. Pelaez-Moreno, “Extending con- University of Trento, Italy, and the International
ceptualisation modes for generalised Formal Concept Analysis,” Computer Science Institute, ICSI, Berkeley. At present, he is a
Inf. Sci., vol. 181, pp. 1888–1909, May 2011. researcher with the Dept. de Teorıa de la Sen ~al y de las Comunica-
[31] F. J. Valverde-Albacete, J. M. Gonzalez-Calabozo, A. Pe~ nas, and C. ciones, Universidad Carlos III Madrid, Spain. He has published more
Pelaez-Moreno, “Supporting scientific knowledge discovery with than 60 papers in applied maths, cognitive, speech, language process-
extended, generalized formal concept analysis,” Expert Syst. Appl., ing, machine learning, and data mining. He is a member of the IEEE
vol. 44, pp. 198–216, 2016. Computer Intelligence Society the ACM, and the IEEE.
[32] N. Japkowicz and M. Shah, Evaluating Learning Algorithms: A Clas-
sification Perspective. Cambridge, U.K.: Cambridge Univ. Press,
Jan. 2011. Carmen Pela ez-Moreno received the telecom-
[33] A. I. Garcıa-Moral, R. Solera-Ure~ na, C. Pelaez-Moreno, and munication Eng. degree from the Public Univer-
F. Dıaz-de Marıa, “Data balancing for efficient training of hybrid sity of Navarre, in 1997 and the PhD degree from
ANN/HMM automatic speech recognition systems,” IEEE Trans. the University Carlos III Madrid, in 2002. Her PhD
Audio Speech Lang. Process., vol. 19, no. 3, pp. 468–481, Mar. 2011. thesis has been awarded a 2002 Best Doctoral
[34] A. Lovitt and J. Allen, “50 years later: Repeating Miller-Nicely Thesis Prize from the Spanish Official Telecom.
1955, ” in Proc. 9th Int. Conf. Spoken Lang. Process., Jan. 2006, Eng. Association (COIT-AEIT). From March to
pp. 2154–2157. Dec. 2004, she participated in the International
[35] B. Mirkin, Mathematical Classification and Clustering, vol. 11. Nor- Computer Science Institute’s (ICSI, Berkeley
well, MA, USA: Kluwer Academic Publishers, 1996. (CA)) Fellowship Program. Since Nov. 2009, she
[36] A. Ben-David, “A lot of randomness is hiding in accuracy,” Eng. has been an associate professor in the Dept. of
Appl. Artif. Intell., vol. 20, no. 7, pp. 875–885, 2007. Signal Theory & Communications, University Carlos III of Madrid. Her
[37] M. Meila, “Comparing clusterings—an information based dis- research interests include speech recognition and perception, multime-
tance,” J. Multivariate Anal., vol. 28, pp. 875–893, 2007. dia processing, machine learning and data analysis and she has co-
[38] G. Brown, “A new perspective for information theoretic feature authored more than 70 papers. She is a member of the IEEE Signal
selection,” in Proc, 12th Int. Conf. Artif. Intell. Statist., 2009, Processing Society. She is a member of the IEEE.
pp. 49–56.
[39] B.-G. Hu, “What Are the Differences Between Bayesian Classifiers
and Mutual-Information Classifiers?” IEEE Trans. Neural Netw. " For more information on this or any other computing topic,
Learn. Syst., vol. 25, no. 2, pp. 249–264, Feb. 2014. please visit our Digital Library at www.computer.org/csdl.
[40] D. Mejıa-Navarrete, A. Gallardo-Antolın, C. Pelaez-Moreno, and
F. J. Valverde-Albacete, “Feature extraction assessment for an
acoustic-event classification task using the entropy triangle,”
in Proc. 12th Annu. Conf. Int. Speech Commun. Assoc., 2011,
pp. 309–312.
[41] G. Keren and S. Baggen, “Recognition models of alphanumeric
characters,” Perception Psychophysics, vol. 29, pp. 234–246, 1981.
[42] E. Rothkopf, “A measure of stimulus similarity and errors in some
paired-associate learning tasks,” J. Exp. Psychology, vol. 53, no. 2,
pp. 94–101, 1957.
[43] J. M. Gonz alez-Calabozo, F. J. Valverde-Albacete, and
C. Pel aez-Moreno, “Interactive knowledge discovery and data
mining on genomic expression data with numeric formal concept
analysis,” BMC Bioinf., vol. 17, no. 1, pp. 1–15, 2016.
[44] D. B. Kurtz, P. R. Sheehe, P. F. Kent, T. L. White, D. E. Hornung,
and H. N. Wright, “Odorant quality perception: A metric individ-
ual differences approach,” Perception Psychophysics, vol. 62, no. 5,
pp. 1121–1129, Jul. 2000.
[45] L. Secundo, K. Snitz, and N. Sobel, “The perceptual logic of
smell,” Current Opinion Neurobiology, vol. 25, pp. 107–115,
Apr. 2014.
[46] F. J. Valverde-Albacete, J. C. de Albornoz, and C. Pelaez-Moreno,
“A proposal for new evaluation metrics and result visualization
technique for sentiment analysis tasks,” in Proc. Int. Conf. Cross-
Lang. Eval. Forum Eur. Lang., 2013, pp. 41–52.
[47] K. Bache and M. Lichman, “UCI machine learning repository,”
2013. [Online]. Available: http://archive.ics.uci.edu/ml
[48] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I. H. Witten, “The WEKA data mining software: An update,”
ACM SIGKDD Explorations, vol. 11, no. 1, pp. 10–18, 2009.
Authorized licensed use limited to: UNIVERSITE DE LILLE. Downloaded on October 28,2020 at 18:25:47 UTC from IEEE Xplore. Restrictions apply.