0% found this document useful (0 votes)

29 views

Towards Better Evaluation Multitarget Models Camera Ready

This document discusses challenges in evaluating multi-target regression models and proposes directions for improvement. It reviews common practices in 10 papers on multi-target regression, finding that most average performance metrics like RRMSE across targets within datasets for statistical comparison, but note drawbacks of this approach. The document suggests alternative ways could better account for dependencies between targets from the same dataset in statistical analysis and model comparison. Overall, it aims to identify key issues in multi-target regression evaluation and start a discussion on developing more suitable methodology.

Uploaded by

caioboccato

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Towards Better Evaluation Multitarget Models Camera Ready

Uploaded by

caioboccato

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Towards Better Evaluation

⋆
of Multi-Target Regression Models

Evgeniya Korneva1 and Hendrik Blockeel1,2

1
KU Leuven, Department of Computer Science
Celesteijnenlaan 200A, 3001 Leuven, Belgium
2
Leuven AI
[email protected]

Abstract. Multi-target models are machine learning models that si-

multaneously predict several target attributes. Due to a high number of
real-world applications, the field of multi-target prediction is actively de-
veloping. With the growing number of multi-target techniques, there is a
need for comparing them among each other. However, while established
procedures exist for comparing conventional, single-target models, little
research has been done on making such comparisons in the presence of
multiple targets. In this paper, we highlight the challenges of evaluat-
ing multi-target models, focusing on multi-target regression algorithms.
This paper reviews the common practice and discusses its shortcomings,
indicating directions for future research.

Keywords: multi-task learning · multi-target regression · evaluation

1 Introduction
Multi-target learning refers to building machine learning models that are capable
of simultaneously predicting several target attributes, which allows the model
to capture inter-dependencies between the targets and, as a result, make better
predictions. If the target attributes are binary, the problem is referred to as multi-
label classification. Multi-dimensional classification is a more general setting
where each instance is associated with a set of non-binary labels. Multi-target
regression problems, in turn, refer to predicting multiple numerical attributes at
the same time.
Due to a large number of real-world applications, the field of multi-target
prediction is rapidly expanding. Multi-target problems often occur in ecological
modelling, bioinformatics, life sciences, e-commerce, finance, etc. Consider, for
instance, predicting several water or air quality indicators (multi-target regres-
sion) or product or text categorization (multi-label classification).
Many widely-used machine learning algorithms have been extended towards
multi-target prediction. In addition, various specialized methods have been de-
signed to tackle multi-target prediction tasks.
⋆
This research is supported by Research Foundation - Flanders (project G079416N,
MERCS)
2 E. Korneva, H. Blockeel.

Approach Method References Year

Random linear target combinations Tsoumakas et al. [17] 2012
Problem
Regressor chains Spyromitros-Xioufis et al. [15] 2016
transformation
SVR Melki et al. [13] 2017
Kocev et al. [11] 2013
Multi-target regression trees
Breskvar et al. [3] 2018
Rules Aho et al. [1] 2012
Algorithm
Low-rannk learning Zhen et al.[21] 2017
adaptation
Neural networks Hadavandi et al. [8] 2015
Xu et al. [19] 2013
Multi-target SVR
Tuia et al.[18] 2017

Table 1: To review the common practice in evaluation of multi-target regression

models, we consider ten representative papers from the field.

The more algorithms are being proposed to solve multi-target problems, the
higher the need to compare them among each other is. However, no methodology
to properly evaluate multi-target algorithms has been developed so far. There
exist established techniques for comparing conventional, single-target models,
but they are not directly applicable in the multi-target setting.

In this paper, we consider multi-target regression as an example of a multi-

target problem and review the current practice of evaluating multi-target re-
gression algorithms. Our goal is to identify key challenges and open problems
in evaluating such models, propose possible solutions and start the discussion
around the topic.
The rest of the paper is organized as follows. In Section 2, we review re-
cent publications on multi-target regression and provide an overview of the most
common approaches towards model evaluation. Their shortcomings are then dis-
cussed in Section 3, where we also suggest possible improvements. Additionally,
Section 4 inspects widely used benchmark multi-target regression datasets. Fi-
nally, Section 5 concludes the paper with a summary of key findings and future
research directions.

2 Common practices

A typical machine learning paper introducing a new machine learning algorithm

normally contains an evaluation section, where a number of algorithms are run
on a set of suitable datasets. The performance of each algorithm on each of
the datasets is evaluated using some metric. The obtained scores are compared,
sometimes using statistical analysis, in order to make conclusions about the
predictive performance of the newly proposed approach. This last step, namely
comparison of the models based on experimental results, is not trivial in the
presence of multiple target attributes.
Towards Better Evaluation of Multi-Target Regression Models 3

Fig. 1: In the multi-target setting, one obtains several performance scores per
dataset (one per each target). It is not trivial to come up with a suitable sta-
tistical test to compare such multivariate data. Typical approach is to average
scores within a dataset.

To review the current practice, we have analyzed a number of papers that

introduce a multi-target regression method and evaluate it by comparing it to a
number of competitor techniques. We focused on the works published in the past
ten years, and selected representative papers from a variety of authors following
different approaches towards tackling a multi-target problem, as well as using
different machine learning algorithms as a basis for their methods. The summary
of the reviewed papers can be found in Table 1.
Most authors chose for Relative Root Mean Squared Error (RRMSE) to
measure the accuracy of the target-specific performance. It is a relative measure
that is computed as a ratio of the model’s Root Mean Squared Error (RMSE)
to that of predicting the average value of the target attribute. The lower the
RRMSE, the better.
Other common choices are plain RMSE or Mean Squared Error (MSE). In
that case, the authors first standardize the targets so that their values are in
the same scales and the corresponding errors are comparable. Additionally, cor-
relation coefficient (CC) between the true and predicted values is sometimes
computed.
All estimates are typically obtained via cross-validation (an exception is [19],
where a holdout set was used to assess the performance).
Whatever metric is chosen, it leaves one with several scores per dataset,
namely one per each target attribute, as illustrated in Figure 1. The common
approach is then to average these scores across all targets within each dataset.
As a result, a single aggregated performance score per dataset is obtained (e.g.,
aRRMSE), which makes experimental results look like those in the conventional,
single-target case.
The next step is to compare the models among each other. While some papers
do not include statistical analysis of the obtained results ([18], [19], [21]), most
authors follow the recommendations given by Demšar [4] and perform Friedman
4 E. Korneva, H. Blockeel.

test to check if there are any statistically significant differences between the
compared algorithms (or different parameter settings for the same algorithm).
If the answer is positive, additional post-hoc tests are performed to find out
what these differences are. In addition, average ranks diagrams, which show
all the compared algorithms in the order of their average ranks and indicate
statistically significant differences, are often plotted to make the results of the
statistical analysis easier to comprehend.
Interestingly, however, there seem to be a common uncertainty about which
performance scores to run these statistical tests on. Aho et al. [1] state two
options, namely:

1. compare the aggregated scores (e.g., aRRMSE), one per dataset;

2. compare individual, target-specific scores (e.g., RRMSE), one per each target
in all the datasets.

The authors explicitly state the drawbacks of both approaches. Aggregation

across different targets within a dataset (1) is “summing apples and oranges”,
while comparing target-specific scores (2) is wrong since targets coming from the
same dataset are obviously dependent. Nonetheless,“in the absence of a better
solution”, the authors present results of the statistical analysis for the both
options. Similar remarks can be found in [15]. Also in [3] results are reported for
the two scenarios3 , while other works ([11], [8], [17], [13]) base their statistical
evaluation on the within-dataset averages (1) only.
In the next section, we discuss options (1) and (2) in more detail, as well as
suggest possible alternative ways of comparing multi-traget models.

3 Can we do better?

Both approaches to statistical comparison of multi-target models commonly used

in practice have a number of drawbacks. We discuss them below.

3.1 Why comparing aggregated scores is bad

Apart from not always having a meaningful interpretation, averages are easily
affected by the outliers, e.g., when some target is much easier or much more
difficult to predict than the others. Excellent performance on an easy target
may compensate for the overall bad performance, and vice versa. In addition,
when many such targets are strongly correlated, it may appear that the model
does very well (badly) on the whole dataset, while actually it is just one task
that it did (didn’t) manage to learn. Besides, and most importantly, averaging
always hides a lot of information. Consider a fictional example given in Figure 2,
3
Per-target analysis (2) always finds more significant differences in performance of the
compared techniques than the per-dataset comparison (1) indicates. This is expected,
because statistical test is biased and overly confident in the presence of dependent
observations.
Towards Better Evaluation of Multi-Target Regression Models 5

Model A Model B Model C

T01
(a) 1
A B C
0.5
T1 0.50 (2) 0.90 (3) 0.10 (1) T05 T02
T2 0.60 (2) 0.90 (3) 0.50 (1) 0
T3 0.50 (2.5) 0.20 (1) 0.50 (2.5)
T4 0.40 (2) 0.30 (1) 0.60 (3)
T5 0.60 (2) 0.30 (1) 0.90 (3)
0.52 (2.1) 0.52 (1.8) 0.52 (2.1)
T04 T03

(b)

Fig. 2: (a) In a fictional example where three multi-target models are compared
on a dataset with five targets, aRRMSE is the same. However, the target-specific
performances are quiet different. Per-target ranks (in brackets) can help highlight
the differences. (b) Visualization is key in understanding such differences. Radar
plots can be helpful.

where three multi-target models are compared on a dataset with five targets.
While the average scores across all targets are the same, models A, B and C
perform quite differently. It is not true that all three methods are equally good:
depending on the application, one of them can be preferred.
Since within-dataset average scores do not fully reflect the performance of
multi-target models, comparisons in terms of such aggregates are not informa-
tive, and can even be misleading.

3.2 Why comparing target-specific scores is bad

As has been mentioned in the previous section, the only alternative strategy
sometimes used in practice to avoid averaging is to compare target-specific scores
across all datasets. As has already been noticed by some researchers, the scores
coming from the same dataset are dependent, which violates the assumptions
of the Friedman test, commonly applied to compare these scores across the
algorithms. Thus, such an approach is not statistically sound and should not be
used in practice because the results of the test are not reliable.
Furthermore, even if a statistical test existed that would take the dependen-
cies between performance scores coming from the same dataset into account, it
would allow one to compare multi-target models in terms of their performance
on a single randomly selected task. Arguably, this is not what we want: one is
rather interested in comparing the models based on their joint performance on
a set of related targets.
6 E. Korneva, H. Blockeel.

But is there a statistically sound way to compare a set of target-specific

scores as a whole, i.e., without aggregating them into a single score per dataset,
yet keeping track of which dataset the scores are coming from?
The task of comparing multivariate samples often arises in practice and has
been extensively studied. Examples include nested ANOVA, global statistical
tests [14], and “the method of m rankings“ [2]. However, to the best of our
knowledge, none of the existing approaches are applicable in the multi-target
setting, where multiple observations per dataset are dependent, and their number
varies across the datasets.

3.3 How can we make the comparisons better

Since there is no suitable statistical test one could run on a table of experimental
results such as in Figure 1, some kind of within-dataset aggregation is inevitable.
We propose averaging the ranks rather than raw prediction scores. Below, we
suggest two ways of introducing ranks.

Target-specific ranks In order to obtain a single aggregated performance

score per dataset, one can replace raw target-specific scores by their ranks
when averaging across the targets. This will help overcome the issue of non-
commensurability of the scores corresponding to different targets, as well as to
limit the influence of the outliers.
In the fictional example in Figure 2a, such ranks for a single dataset are given
in brackets. While aRRMSEs are the same, the rank of model B is lower than
that of models A and C, indicating that this model “wins” more often.
When such ranks are obtained for multiple datasets, statistical tests can be
run to compare the models in terms of their average ranks rather then average
performance score.
The disadvantage of this approach is that ranks do not capture the magnitude
of the differences in performances. Aligned ranks can be introduced that take
it into account, but this will make the hypothesis of the statistical test run on
these ranks less interpretable.

Pareto ranks Pareto-dominance is an alternative way to compare performances

of several algorithms. A model is Pareto-optimal on some dataset if for each
target, it yields better prediction accuracy than any other model.
To make Pareto-style comparisons across multiple datasets, Pareto ranks can
be introduced as follows. For each dataset, the models that are Pareto-optimal
get rank 1. The models that are optimal without considering the models with
rank 1, get rank 2, and so on. Once the ranks are computed, statistical tests
can be run to check if there is any difference in the Pareto ranks obtained by
different models.
The disadvantage of such an approach is that it is quite conservative: im-
provements across all targets are needed for a model to get a higher rank. It
can happen that on some datasets, no model is Pareto-optimal (this is exactly
Towards Better Evaluation of Multi-Target Regression Models 7

Dataset # samples # features # targets % missing Source

atp1d 337 411 6 0 [15]
atp7d 296 411 6 0 [15]
oes97 334 263 16 0 [15]
oes10 403 298 16 0 [15]
rf1 9125 64 8 0.6 [15]
rf2 9125 576 8 6.8 [15]
scm1d 9803 280 16 0 [15]
scm20d 8966 61 16 0 [15]
edm 154 16 2 0 [10]
sf1 323 10 3 0 [5]
sf2 1066 10 3 0 [5]
jura 359 15 3 0 [7], [15]
wq 1060 16 14 0 [6]
enb 768 8 2 0 [16], [15]
slump 103 7 3 0 [20], [15]
andro 49 30 6 0 [9], [15]
osales 639 413 12 3.8 [15]
scpf 1137 23 3 35.4 [15]

Table 2: Summary of the 18 benchmark datasets used for evaluating multi-target

regression models.

the situation in the Figure 2a). In this case, all models get the same rank. If
this happens for multiple datasets, no insights can be gained from such a con-
servative procedure. At the same time, if some algorithm is the best in terms
of Pareto rank, one can be sure that it outperforms the competitors on all tar-
gets, which is not the case when comparison is based on aRRMSE. Every model
which is the best in terms of aRRMSE is Pareto-optimal, but the opposite is not
true: major improvement on one target can lead to the lowest aRRMSE even if
model’s performance on the rest of the targets is worse compared to some other
methods.

4 Remarks on the benchmark datasets

Datasets used to evaluate the models are as important as the procedures used to
draw conclusions about models’ performances on them. In this section, we take
a closer look at the datasets commonly used to evaluate multi-target regression
algorithms.
Illustrating some properties of multi-target methods on toy datasets, or eval-
uating them on synthetic datasets is unfortunately not common. Only in [19], the
methods are evaluated on a synthetic dataset generated using a simulated two-
output time series process. This synthetic dataset, however, is not constructed
to highlight the differences in the behavior of the compared techniques, but is
rather used as an addition to the available real-world data.
8 E. Korneva, H. Blockeel.

EHWZHHQWDUJHWVDEV
3DUZLVHFRUUHODWLRQ

V I
UR G
G P Q
E UD
OH UI

UI
G G S VI

VI
S T
G WS

WS G H MX V V

VF P Z
Q H H H VD P
VO
X
D D D R R R VF VF
P

'DWDVHW

Fig. 3: The distribution of pairwise correlations between targets of benchmark

datasets (absolute values are considered to reflect the magnitude of dependen-
cies). Sometimes, all targets are strongly correlated, which can introduce bias in
the evaluation process.

In 2016, Spyromitros-Xioufis et al. collected a set of 18 real-world datasets [15]

that have been commonly used for evaluation of multi-target regression models
since then. These datasets are summarized in Table 2. There are datasets of
different sizes in terms of the number of examples, features and targets, which is
important to guarantee general evaluation. The datasets are also coming from a
variety of different domains such as geology (jura), hydrology (rf, wq, andro),
astronomy (sf) and engineering (edm, enb, slump), as well as e-commerce (scpf),
sales (atp, osales), economics (oes) and management (scm). However, while
datasets seem diverse at first glance, there are two aspects that need to be taken
into account when using them to evaluate multi-target models.
First, one can notice that some of them come in pairs, e.g., sf1 and sf2,
etc. This is because, in some cases, separate datasets were created for the same
type of data collected, for instance, in different years. Such datasets are thus
very similar, and the algorithms are likely to demonstrate similar performance
on them, which can add bias to the evaluation process. Besides, performance
scores, both target-specific and aggregated, coming from such similar datasets
are also dependent, which is a problem for statistical tests.
The second point worth discussing is the magnitude of dependency between
the targets in a single dataset. In Figure 3, we plot the distribution of the absolute
values of the pairwise correlations between the targets per dataset. One can
notice that for some datasets such as atp1d, oes10 and oes97, these values are
rather high. An extreme case is the enb dataset, where the only two targets are
perfectly correlated. An opposite situation can be observed in enb, where there is
no correlation between the targets. Of course, this is not ’wrong’ per se since the
data is coming from the real-world applications and reflects the phenomena that
occur in practice. However, these aspects should be taken into account during
the evaluation process, since predicting linearly dependent targets is easier than
those with more complex inter-dependency. One should therefore not make overly
Towards Better Evaluation of Multi-Target Regression Models 9

confident statements about the performance of a multi-target model when testing

on the datasets with very correlated targets.

5 Conclusions
Multi-target regression, which is a special case of multi-target prediction where
several numerical targets are predicted simultaneously, is an actively developing
field with diverse real-world applications. A lot of methods to tackle multi-target
regression tasks are being proposed by the researchers in the field. In this paper,
we have addressed the problem of evaluating multi-target regression models,
which is crucial to better understand and modify the existing techniques, as well
as to successfully develop new ones. Our analysis of the recent papers publishing
the results of several multi-target regression methods over multiple datasets has
shown that many authors are unsure about the correct way of making such
comparisons.
We argue that comparing multi-target models in terms of averaged scores
leaves no possibility to fully understand and meaningfully discuss strengths and
weaknesses of different approaches. Besides, it does not help the practitioners to
choose an appropriate model to solve a real-world multi-target regression prob-
lem. However, we conclude that to run statistical test on experimental results,
aggregation is inevitable since no existing test is suitable for comparing multi-
target models. We therefore propose two ways on how ranks can be used instead
of raw performance scores to overcome such shortcomings of simple averaging
as non-commensurability of the target-specific scores and sensitivity to outliers.
Note that such ranks, just like any other aggregation, hide a lot of information
about models’ behavior, and should be reported along with other metrics (e.g.,
min and max of target-specific errors, median error, etc.).
Simply plotting the per-target results can help highlight the differences be-
tween the models. Radar plot, shown for a fictional example in Figure 2b, is a
good example of such visualization. Besides, it is worth to explore why some
models perform well on one set of targets and others on another one, as it gives
a deeper understanding of the behavior of the model. Such analysis is almost
not happening in practice. One example can be found in [17].
In addition, we also inspect benchmark multi-target regression datasets and
claim that more diverse datasets are needed to improve the evaluation process. In
the absence of more real-world data, one solution is to use artificially generated
datasets4 . Wisely created, such datasets are useful not only to compare the
overall predictive performance of multi-target approaches, but also to explore
and understand the behavior of individual algorithms in-depth.

References
1. Aho, T., Ženko, B., Džeroski, S., Elomaa, T.: Multi-target regression with rule
ensembles. Journal of Machine Learning Research 13(Aug), 2367–2407 (2012)
4
An attempt to create such toy benchmarks can be seen in [12].
10 E. Korneva, H. Blockeel.

2. Benard, A.p., vanElteren, P.: A generalization of the method of m rankings. Inda-

gationes Mathematicae 1(5), 358–369 (1953)
3. Breskvar, M., Kocev, D., Džeroski, S.: Ensembles for multi-target regression with
random output selections. Machine Learning 107(11), 1673–1709 (2018)
4. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal
of Machine learning research 7(Jan), 1–30 (2006)
5. Dua, D., Graff, C.: UCI machine learning repository (2017),
http://archive.ics.uci.edu/ml
6. Džeroski, S., Demšar, D., Grbović, J.: Predicting chemical parameters of river
water quality from bioindicator data. Applied Intelligence 13(1), 7–17 (2000)
7. Goovaerts, P.: Geostatistics for natural resources evaluation. Oxford University
Press on Demand (1997)
8. Hadavandi, E., Shahrabi, J., Shamshirband, S.: A novel boosted-neural network
ensemble for modeling multi-target regression problems. Engineering Applications
of Artificial Intelligence 45, 204–219 (2015)
9. Hatzikos, E.V., Tsoumakas, G., Tzanis, G., Bassiliades, N., Vlahavas, I.: An em-
pirical study on sea water quality prediction. Knowledge-Based Systems 21(6),
471–478 (2008)
10. Karalič, A., Bratko, I.: First order regression. Machine learning 26(2-3), 147–176
(1997)
11. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Tree ensembles for predicting struc-
tured outputs. Pattern Recognition 46(3), 817–833 (2013)
12. Mastelini, S.M., Santana, E.J., da Costa, V.G.T., Barbon, S.: Benchmarking multi-
target regression methods. In: 2018 7th Brazilian Conference on Intelligent Systems
(BRACIS). pp. 396–401. IEEE (2018)
13. Melki, G., Cano, A., Kecman, V., Ventura, S.: Multi-target support vector regres-
sion via correlation regressor chains. Information Sciences 415, 53–69 (2017)
14. O’Brien, P.C.: Procedures for comparing samples with multiple endpoints. Biomet-
rics pp. 1079–1087 (1984)
15. Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-
target regression via input space expansion: treating targets as inputs. Ma-
chine Learning 104(1), 55–98 (2016). https://doi.org/10.1007/s10994-016-5546-z,
http://dx.doi.org/10.1007/s10994-016-5546-z
16. Tsanas, A., Xifara, A.: Accurate quantitative estimation of energy performance of
residential buildings using statistical machine learning tools. Energy and Buildings
49, 560–567 (2012)
17. Tsoumakas, G., Spyromitros-Xioufis, E., Vrekou, A., Vlahavas, I.: Multi-target
regression via random linear target combinations. In: Joint european conference
on machine learning and knowledge discovery in databases. pp. 225–240. Springer
(2014)
18. Tuia, D., Verrelst, J., Alonso, L., Pérez-Cruz, F., Camps-Valls, G.: Multioutput
support vector regression for remote sensing biophysical parameter estimation.
IEEE Geoscience and Remote Sensing Letters 8(4), 804–808 (2011)
19. Xu, S., An, X., Qiao, X., Zhu, L., Li, L.: Multi-output least-squares support vector
regression machines. Pattern Recognition Letters 34(9), 1078–1084 (2013)
20. Yeh, I.C.: Modeling slump flow of concrete using second-order regressions and
artificial neural networks. Cement and Concrete Composites 29(6), 474–480 (2007)
21. Zhen, X., Yu, M., He, X., Li, S.: Multi-target regression via robust low-rank learn-
ing. IEEE transactions on pattern analysis and machine intelligence 40(2), 497–504
(2017)

Gábor Békés, Gábor Kézdi - Data Analysis For Business, Economics, and Policy-Cambridge University Press (2021)
100% (5)
Gábor Békés, Gábor Kézdi - Data Analysis For Business, Economics, and Policy-Cambridge University Press (2021)
742 pages
Salesforce AI Associate Dumps
100% (4)
Salesforce AI Associate Dumps
60 pages
Sunbeam Popcorn Maker FPSBPP7310 FPSBPP7316
60% (10)
Sunbeam Popcorn Maker FPSBPP7310 FPSBPP7316
9 pages
Wiley - Operations Management - An Integrated Approach, 7th Edition - 978-1-119-49706-6
No ratings yet
Wiley - Operations Management - An Integrated Approach, 7th Edition - 978-1-119-49706-6
3 pages
Ford Escape 4wd Workshop Manual v6 3 0l 2008
100% (4)
Ford Escape 4wd Workshop Manual v6 3 0l 2008
7,556 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
2019 Book EssentialsOfBusinessAnalytics PDF
93% (14)
2019 Book EssentialsOfBusinessAnalytics PDF
971 pages
AP Statistics Chapter 3
0% (1)
AP Statistics Chapter 3
3 pages
Coursera Supply Chain Planning Week 4 Assignment Answers
No ratings yet
Coursera Supply Chain Planning Week 4 Assignment Answers
6 pages
Udemy 2024 Learning Trends Top 100 Surging Skills Infographic
100% (1)
Udemy 2024 Learning Trends Top 100 Surging Skills Infographic
1 page
Finite Elements and Approximation
From Everand
Finite Elements and Approximation
O. C. Zienkiewicz
4.5/5 (4)
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Practical Design of Experiments: DoE Made Easy
From Everand
Practical Design of Experiments: DoE Made Easy
Colin Hardwick
4.5/5 (7)
Home Depot Strategy
100% (1)
Home Depot Strategy
8 pages
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
Multi Target Prediction
No ratings yet
Multi Target Prediction
129 pages
Multi-Target Regression Via Robust Low-Rank Learning
No ratings yet
Multi-Target Regression Via Robust Low-Rank Learning
11 pages
Multi-Target Support Vector Regression Via Correlation Regressor Chains
No ratings yet
Multi-Target Support Vector Regression Via Correlation Regressor Chains
29 pages
Workshop Master Revealed
From Everand
Workshop Master Revealed
Anil Soni
No ratings yet
Applied Iterative Methods
From Everand
Applied Iterative Methods
Louis A. Hageman
No ratings yet
Secrets of Statistical Data Analysis and Management Science!
From Everand
Secrets of Statistical Data Analysis and Management Science!
Andrei Besedin
No ratings yet
JMP for Mixed Models
From Everand
JMP for Mixed Models
Ruth Hummel
No ratings yet
An Ensemble-Based Method For The Selection of Instances in The Multi-Target Regression Problem
No ratings yet
An Ensemble-Based Method For The Selection of Instances in The Multi-Target Regression Problem
18 pages
Notes in Operations Research
From Everand
Notes in Operations Research
Rahul Basu
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
Scientific Management of the Classroom
From Everand
Scientific Management of the Classroom
Pernell Hodges
No ratings yet
7 Regression
No ratings yet
7 Regression
15 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
6.Classification & Regression
No ratings yet
6.Classification & Regression
45 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Numerical Optimization: Theories and Applications
From Everand
Numerical Optimization: Theories and Applications
Udayan Bhattacharya
No ratings yet
Reinforcement Learning: A Practical Guide to Algorithms
From Everand
Reinforcement Learning: A Practical Guide to Algorithms
Trilokesh Khatri
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Model Comparison and Calibration Assessment
No ratings yet
Model Comparison and Calibration Assessment
70 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
EDUCATION DATA MINING FOR PREDICTING STUDENTS’ PERFORMANCE
From Everand
EDUCATION DATA MINING FOR PREDICTING STUDENTS’ PERFORMANCE
Dr. GEETHA N DATA SCIENTIST, BENGALURU
No ratings yet
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
The Art of Controller Design
From Everand
The Art of Controller Design
Martin Braae
No ratings yet
Active Appearance Model: Unlocking the Power of Active Appearance Models in Computer Vision
From Everand
Active Appearance Model: Unlocking the Power of Active Appearance Models in Computer Vision
Fouad Sabry
No ratings yet
1 s2.0 S092523122101883X Main
No ratings yet
1 s2.0 S092523122101883X Main
9 pages
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
From Everand
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
Steffen Kruse
No ratings yet
Means Ends Analysis: Fundamentals and Applications
From Everand
Means Ends Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
MCS-024: Object Oriented Technologies and Java Programming
From Everand
MCS-024: Object Oriented Technologies and Java Programming
Dr. DK Sukhani
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
WW-M1 Bernardo
No ratings yet
WW-M1 Bernardo
3 pages
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Differential Evolution: Fundamentals and Applications
From Everand
Differential Evolution: Fundamentals and Applications
Fouad Sabry
No ratings yet
Predictive Analytics - Regression
No ratings yet
Predictive Analytics - Regression
27 pages
CIVI6731 Lecture (Week11)
No ratings yet
CIVI6731 Lecture (Week11)
22 pages
Elementary Theory and Application of Numerical Analysis: Revised Edition
From Everand
Elementary Theory and Application of Numerical Analysis: Revised Edition
David G. Moursund
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
LS_Project_Report
No ratings yet
LS_Project_Report
10 pages
Reinforcement Learning: From Basics to Expert Proficiency
From Everand
Reinforcement Learning: From Basics to Expert Proficiency
William Smith
No ratings yet
Design Patterns Made Easy: A Practical Guide with Examples
From Everand
Design Patterns Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Blank
No ratings yet
Blank
1 page
Data Scientist Roadmap
From Everand
Data Scientist Roadmap
Mohammed Ahmed
5/5 (1)
Number & Operations - Drill Sheets Gr. 6-8
From Everand
Number & Operations - Drill Sheets Gr. 6-8
Nat Reed
No ratings yet
Ishikawa Diagram: Anticipate and solve problems within your business
From Everand
Ishikawa Diagram: Anticipate and solve problems within your business
50minutes
5/5 (3)
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
No ratings yet
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
10 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables with Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning Algorithms for Data Scientists: An Overview
From Everand
Machine Learning Algorithms for Data Scientists: An Overview
Vinaitheerthan Renganathan
No ratings yet
Real Estate Price Prediction With Regression and Classification
No ratings yet
Real Estate Price Prediction With Regression and Classification
5 pages
Consumer Reports Buying Guide 2021
100% (1)
Consumer Reports Buying Guide 2021
227 pages
A Collection of Fraud Schemes
67% (3)
A Collection of Fraud Schemes
54 pages
Resume Updated
100% (3)
Resume Updated
2 pages
GRE Text Completion and Sentence Equivalence Practice Questions
100% (2)
GRE Text Completion and Sentence Equivalence Practice Questions
32 pages
(PDF) Introduction To Selling Value - Course-Final
No ratings yet
(PDF) Introduction To Selling Value - Course-Final
75 pages
TED Talks List
100% (2)
TED Talks List
15 pages
ATS Resume Template PDF
No ratings yet
ATS Resume Template PDF
1 page
Online Casino Software For Sale and Casino Software Solutions
No ratings yet
Online Casino Software For Sale and Casino Software Solutions
2 pages
Political Analysis
No ratings yet
Political Analysis
11 pages
Outdoor Living Skills (PDFDrive) PDF
No ratings yet
Outdoor Living Skills (PDFDrive) PDF
157 pages
Data Analytics Concepts Techniques and A PDF
100% (11)
Data Analytics Concepts Techniques and A PDF
451 pages
Focus Investing PDF
No ratings yet
Focus Investing PDF
18 pages
Cyber Resilience Blueprint
No ratings yet
Cyber Resilience Blueprint
12 pages
Globalization Strategy Playbook: Document Revision History
100% (2)
Globalization Strategy Playbook: Document Revision History
93 pages
SAP GTS Case Study - Citrix - Systems
100% (1)
SAP GTS Case Study - Citrix - Systems
2 pages
QuickBooks Online Core Certification Self Study Workbook V21.2.2
100% (1)
QuickBooks Online Core Certification Self Study Workbook V21.2.2
55 pages
Guidance On Good Data and Record Management Practices
No ratings yet
Guidance On Good Data and Record Management Practices
44 pages
2015 Book IntroductionToNursingInformati
100% (1)
2015 Book IntroductionToNursingInformati
456 pages
The Chemical Engineer - Issue 983 - May 2023
No ratings yet
The Chemical Engineer - Issue 983 - May 2023
68 pages
NIST 2 Framework
100% (1)
NIST 2 Framework
32 pages
Microsoft AppSource Partner Listing Guidelines PDF
No ratings yet
Microsoft AppSource Partner Listing Guidelines PDF
10 pages
Whitepaper - Third-Party Risk Management Services
No ratings yet
Whitepaper - Third-Party Risk Management Services
24 pages
Advanced PHY Features For Automotive Ethernet V1.0
No ratings yet
Advanced PHY Features For Automotive Ethernet V1.0
19 pages
Weinrich 2021
No ratings yet
Weinrich 2021
14 pages
User manual-REMC Forecasting and Scheduling Application
No ratings yet
User manual-REMC Forecasting and Scheduling Application
145 pages
An Empirical Study On Software Aging Indicators Prediction in Android Mobile
No ratings yet
An Empirical Study On Software Aging Indicators Prediction in Android Mobile
7 pages
Thermal Conductivity Apparatus 2
No ratings yet
Thermal Conductivity Apparatus 2
22 pages
Naan Muthalvan Project Report Stock Market Forecast 4310
No ratings yet
Naan Muthalvan Project Report Stock Market Forecast 4310
29 pages
LSTM 1
No ratings yet
LSTM 1
6 pages
Volatility Forecasting For Mutual Fund Portfolios
No ratings yet
Volatility Forecasting For Mutual Fund Portfolios
10 pages
ML1 17 Hepsi
No ratings yet
ML1 17 Hepsi
90 pages
A Short Course On Nonparametric Curve Estimation R PDF
No ratings yet
A Short Course On Nonparametric Curve Estimation R PDF
114 pages
Research Statement: Sharad D. Silwal Sharad@math - Ksu.edu
No ratings yet
Research Statement: Sharad D. Silwal Sharad@math - Ksu.edu
7 pages
Spatial Interpolation: A Brief: Eugene Brusilovskiy
No ratings yet
Spatial Interpolation: A Brief: Eugene Brusilovskiy
58 pages
Intermittent Demand Forecasting: Context, Methods and Applications 1st Edition Aris A. Syntetos - Own the complete ebook set now in PDF and DOCX formats
100% (3)
Intermittent Demand Forecasting: Context, Methods and Applications 1st Edition Aris A. Syntetos - Own the complete ebook set now in PDF and DOCX formats
54 pages
Computational Fluid Dynamic Analysis of Innovative Design of Solar-Biomass Hybrid Dryer
No ratings yet
Computational Fluid Dynamic Analysis of Innovative Design of Solar-Biomass Hybrid Dryer
12 pages
CH 7
No ratings yet
CH 7
27 pages
Jurnal Baru
No ratings yet
Jurnal Baru
8 pages
Hands On Simulation Modeling With Python Develop Simulation Models to Get Accurate Results and Enhance Decision Making Processes 1st Edition by Giuseppe Ciaburro 1838985093 9781838985097 - Read the ebook now with the complete version and no limits
100% (6)
Hands On Simulation Modeling With Python Develop Simulation Models to Get Accurate Results and Enhance Decision Making Processes 1st Edition by Giuseppe Ciaburro 1838985093 9781838985097 - Read the ebook now with the complete version and no limits
78 pages
AI and IOT in Renewable Energy (2021)
No ratings yet
AI and IOT in Renewable Energy (2021)
119 pages
Ihwah 2021 IOP Conf. Ser. Earth Environ. Sci. 733 012047 PDF
No ratings yet
Ihwah 2021 IOP Conf. Ser. Earth Environ. Sci. 733 012047 PDF
8 pages
QAM Chapter 4
No ratings yet
QAM Chapter 4
71 pages
Separation in Logistic Regression - Causes Consequences and Control
No ratings yet
Separation in Logistic Regression - Causes Consequences and Control
7 pages
Assessment 2 UEL CN 7000
No ratings yet
Assessment 2 UEL CN 7000
10 pages
843-Artificial Intelligence-Xi Xii
100% (2)
843-Artificial Intelligence-Xi Xii
11 pages
Chapter7 Econometrics Multicollinearity
No ratings yet
Chapter7 Econometrics Multicollinearity
25 pages
ML New record (5)
No ratings yet
ML New record (5)
51 pages
Localizations of Multiple Targets Using Multistatic UWB Radar Systems
No ratings yet
Localizations of Multiple Targets Using Multistatic UWB Radar Systems
6 pages
3MA Forecast
No ratings yet
3MA Forecast
43 pages
STAT 497 - Old Exams
100% (2)
STAT 497 - Old Exams
71 pages

Towards Better Evaluation Multitarget Models Camera Ready

Uploaded by

Towards Better Evaluation Multitarget Models Camera Ready

Uploaded by

Towards Better Evaluation

Evgeniya Korneva1 and Hendrik Blockeel1,2

Abstract. Multi-target models are machine learning models that si-

Keywords: multi-task learning · multi-target regression · evaluation

Approach Method References Year

Table 1: To review the common practice in evaluation of multi-target regression

In this paper, we consider multi-target regression as an example of a multi-

A typical machine learning paper introducing a new machine learning algorithm

To review the current practice, we have analyzed a number of papers that

1. compare the aggregated scores (e.g., aRRMSE), one per dataset;

The authors explicitly state the drawbacks of both approaches. Aggregation

Both approaches to statistical comparison of multi-target models commonly used

3.1 Why comparing aggregated scores is bad

Model A Model B Model C

3.2 Why comparing target-specific scores is bad

But is there a statistically sound way to compare a set of target-specific

3.3 How can we make the comparisons better

Target-specific ranks In order to obtain a single aggregated performance

Pareto ranks Pareto-dominance is an alternative way to compare performances

Dataset # samples # features # targets % missing Source

Table 2: Summary of the 18 benchmark datasets used for evaluating multi-target

4 Remarks on the benchmark datasets

Fig. 3: The distribution of pairwise correlations between targets of benchmark

In 2016, Spyromitros-Xioufis et al. collected a set of 18 real-world datasets [15]

confident statements about the performance of a multi-target model when testing

2. Benard, A.p., vanElteren, P.: A generalization of the method of m rankings. Inda-

You might also like