Towards Better Evaluation Multitarget Models Camera Ready
Towards Better Evaluation Multitarget Models Camera Ready
⋆
of Multi-Target Regression Models
1 Introduction
Multi-target learning refers to building machine learning models that are capable
of simultaneously predicting several target attributes, which allows the model
to capture inter-dependencies between the targets and, as a result, make better
predictions. If the target attributes are binary, the problem is referred to as multi-
label classification. Multi-dimensional classification is a more general setting
where each instance is associated with a set of non-binary labels. Multi-target
regression problems, in turn, refer to predicting multiple numerical attributes at
the same time.
Due to a large number of real-world applications, the field of multi-target
prediction is rapidly expanding. Multi-target problems often occur in ecological
modelling, bioinformatics, life sciences, e-commerce, finance, etc. Consider, for
instance, predicting several water or air quality indicators (multi-target regres-
sion) or product or text categorization (multi-label classification).
Many widely-used machine learning algorithms have been extended towards
multi-target prediction. In addition, various specialized methods have been de-
signed to tackle multi-target prediction tasks.
⋆
This research is supported by Research Foundation - Flanders (project G079416N,
MERCS)
2 E. Korneva, H. Blockeel.
The more algorithms are being proposed to solve multi-target problems, the
higher the need to compare them among each other is. However, no methodology
to properly evaluate multi-target algorithms has been developed so far. There
exist established techniques for comparing conventional, single-target models,
but they are not directly applicable in the multi-target setting.
2 Common practices
Fig. 1: In the multi-target setting, one obtains several performance scores per
dataset (one per each target). It is not trivial to come up with a suitable sta-
tistical test to compare such multivariate data. Typical approach is to average
scores within a dataset.
test to check if there are any statistically significant differences between the
compared algorithms (or different parameter settings for the same algorithm).
If the answer is positive, additional post-hoc tests are performed to find out
what these differences are. In addition, average ranks diagrams, which show
all the compared algorithms in the order of their average ranks and indicate
statistically significant differences, are often plotted to make the results of the
statistical analysis easier to comprehend.
Interestingly, however, there seem to be a common uncertainty about which
performance scores to run these statistical tests on. Aho et al. [1] state two
options, namely:
3 Can we do better?
Apart from not always having a meaningful interpretation, averages are easily
affected by the outliers, e.g., when some target is much easier or much more
difficult to predict than the others. Excellent performance on an easy target
may compensate for the overall bad performance, and vice versa. In addition,
when many such targets are strongly correlated, it may appear that the model
does very well (badly) on the whole dataset, while actually it is just one task
that it did (didn’t) manage to learn. Besides, and most importantly, averaging
always hides a lot of information. Consider a fictional example given in Figure 2,
3
Per-target analysis (2) always finds more significant differences in performance of the
compared techniques than the per-dataset comparison (1) indicates. This is expected,
because statistical test is biased and overly confident in the presence of dependent
observations.
Towards Better Evaluation of Multi-Target Regression Models 5
(b)
Fig. 2: (a) In a fictional example where three multi-target models are compared
on a dataset with five targets, aRRMSE is the same. However, the target-specific
performances are quiet different. Per-target ranks (in brackets) can help highlight
the differences. (b) Visualization is key in understanding such differences. Radar
plots can be helpful.
where three multi-target models are compared on a dataset with five targets.
While the average scores across all targets are the same, models A, B and C
perform quite differently. It is not true that all three methods are equally good:
depending on the application, one of them can be preferred.
Since within-dataset average scores do not fully reflect the performance of
multi-target models, comparisons in terms of such aggregates are not informa-
tive, and can even be misleading.
As has been mentioned in the previous section, the only alternative strategy
sometimes used in practice to avoid averaging is to compare target-specific scores
across all datasets. As has already been noticed by some researchers, the scores
coming from the same dataset are dependent, which violates the assumptions
of the Friedman test, commonly applied to compare these scores across the
algorithms. Thus, such an approach is not statistically sound and should not be
used in practice because the results of the test are not reliable.
Furthermore, even if a statistical test existed that would take the dependen-
cies between performance scores coming from the same dataset into account, it
would allow one to compare multi-target models in terms of their performance
on a single randomly selected task. Arguably, this is not what we want: one is
rather interested in comparing the models based on their joint performance on
a set of related targets.
6 E. Korneva, H. Blockeel.
the situation in the Figure 2a). In this case, all models get the same rank. If
this happens for multiple datasets, no insights can be gained from such a con-
servative procedure. At the same time, if some algorithm is the best in terms
of Pareto rank, one can be sure that it outperforms the competitors on all tar-
gets, which is not the case when comparison is based on aRRMSE. Every model
which is the best in terms of aRRMSE is Pareto-optimal, but the opposite is not
true: major improvement on one target can lead to the lowest aRRMSE even if
model’s performance on the rest of the targets is worse compared to some other
methods.
Datasets used to evaluate the models are as important as the procedures used to
draw conclusions about models’ performances on them. In this section, we take
a closer look at the datasets commonly used to evaluate multi-target regression
algorithms.
Illustrating some properties of multi-target methods on toy datasets, or eval-
uating them on synthetic datasets is unfortunately not common. Only in [19], the
methods are evaluated on a synthetic dataset generated using a simulated two-
output time series process. This synthetic dataset, however, is not constructed
to highlight the differences in the behavior of the compared techniques, but is
rather used as an addition to the available real-world data.
8 E. Korneva, H. Blockeel.
E H W Z H H Q W D U J H W V D E V
3 D U Z L V H F R U U H O D W L R Q
V I
U R G
G P Q
E U D
O H U I
U I
G G S V I
V I
S T
G W S
W S G H M X V V
V F P Z
Q H H H V D P
V O
X
D D D R R R V F V F
P
'