0% found this document useful (0 votes)

7 views

Multi Target Prediction

This document provides an overview of multi-target prediction problems, which involve predicting more than one target variable from a given input. It discusses several types of multi-target problems, including multi-label classification, multi-variate regression, label ranking, multi-task learning, collaborative filtering, and dyadic prediction. The main challenges in multi-target prediction are appropriately modeling the dependencies between the different target variables being predicted.

Uploaded by

Macmatthew Ahaotu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Multi Target Prediction

Uploaded by

Macmatthew Ahaotu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 129

Multi-Target Prediction

Krzysztof Dembczyński
Intelligent Decision Support Systems Laboratory (IDSS)
Poznań University of Technology, Poland

Discovery Science 2013, Tutorial, Singapore

Multi-target prediction

Many thanks to Willem Waegeman and Eyke Hüllermeier for collaborating

on this topic and working together on this tutorial.

1 / 102
Multi-target prediction

• Prediction problems in which we consider more than one target

variable.

2 / 102
Image annotation/retrieval

Target 1: cloud yes/no

Target 2: sky yes/no
Target 3: tree yes/no
... ... ...

3 / 102
Multi-label classification

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1
x 4.0 2.5 ? ? ?

4 / 102
Multi-label classification

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1
x 4.0 2.5 1 1 0

4 / 102
Ecology

• Prediction of the presence

or absence of species, or
even the population size

5 / 102
Multi-variate regression

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = Rm .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 14 0.3 9
x2 2.0 2.5 15 1.1 4.5
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 19 0.9 2
x 4.0 2.5 ? ? ?

6 / 102
Multi-variate regression

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = Rm .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 14 0.3 9
x2 2.0 2.5 15 1.1 4.5
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 19 0.9 2
x 4.0 2.5 18 0.5 1

6 / 102
Label ranking

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, where y i is a

ranking (permutation) of a fixed number of labels/alternatives.1
• Predict permutation (yπ(1) , yπ(2) , . . . , yπ(m) ) for a given x.

X1 X2 Y1 Y2 Ym
x1 5.0 4.5 1 3 2
x2 2.0 2.5 2 1 3
.. .. .. .. ..
. . . . .
xn 3.0 3.5 3 1 2
x 4.0 2.5 ? ? ?
1
E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise
preferences. Artificial Intelligence, 172:1897–1916, 2008
7 / 102
Label ranking

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, where y i is a

ranking (permutation) of a fixed number of labels/alternatives.1
• Predict permutation (yπ(1) , yπ(2) , . . . , yπ(m) ) for a given x.

X1 X2 Y1 Y2 Ym
x1 5.0 4.5 1 3 2
x2 2.0 2.5 2 1 3
.. .. .. .. ..
. . . . .
xn 3.0 3.5 3 1 2
x 4.0 2.5 1 2 3
1
E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise
preferences. Artificial Intelligence, 172:1897–1916, 2008
7 / 102
Multi-task learning

• Training data: {(x1j , y1j ), (x2j , y2j ), . . . , (xnj , ynj )}, j = 1, . . . , m,

yij ∈ Y = R.
• Predict yj for a given xj .

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 14 9
x2 2.0 2.5 1.1
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 2
x 4.0 2.5 ?

8 / 102
Multi-task learning

• Training data: {(x1j , y1j ), (x2j , y2j ), . . . , (xnj , ynj )}, j = 1, . . . , m,

yij ∈ Y = R.
• Predict yj for a given xj .

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 14 9
x2 2.0 2.5 1.1
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 2
x 4.0 2.5 1

8 / 102
Collaborative filtering2

• Training data: {(ui , mj , yij )}, for some i = 1, . . . , n and

j = 1, . . . , m, yij ∈ Y = R.
• Predict yij for a given ui and mj .

m1 m2 m3 · · · mm
u1 1 ··· 4
u2 3 1 ···
u3 2 5 ···
... ···
un 2 ··· 1

2
D. Goldberg, D. Nichols, B.M. Oki, and D. Terry. Using collaborative filtering to weave and
information tapestry. Communications of the ACM, 35(12):61–70, 1992
9 / 102
Dyadic prediction3

4 5 ··· 7 8 6
10 14 · · · 9 21 12
instances y1 y2 · · · ym y m+1 y m+2
1 1 x1 10 ? ··· 1 ? ?
3 5 x2 0.1 · · · 0 ?
7 0 x3 ? ? ··· 1 ?
1 1 ... ··· 0 ?
3 1 xn 0.9 · · · 1 ? ?
2 3 xn+1 ? ··· ? ?
3 1 xn+2 ? ··· ? ? ?

3
A.K. Menon and C. Elkan. Predicting labels for dyadic data. Data Mining and Knowledge
Discovery, 21(2), 2010
10 / 102
Multi-target prediction

• Multi-Target Prediction: For a feature vector x predict accurately

a vector of responses y using a function h(x):
h(x)
x = (x1 , x2 , . . . , xp ) −−−−−→ y = (y1 , y2 , . . . , ym )
• Main challenges:
I Appropriate modeling of target dependencies between targets

y1 , y2 , . . . , ym
I A multitude of multivariate loss functions defined over the output
vector
`(y, h(x))
• Main question:
I Can we improve over independent models trained for each target?

• Two views:
I The individual-target view
I The joint-target view

11 / 102
The individual target view

• How can we improve the predictive accuracy of a single label by using

information about other labels?
• What are the requirements for improvement?

12 / 102
The joint target view

• What are the specific multivariate loss functions we would like to

minimize?
• How to perform minimization of such losses?
• What are the relations between the losses?

13 / 102
The individual and joint target view

• The individual target view:

I Goal: predict a value of y using x and any available information on
i
other targets yj s.
I The problem is usually defined through univariate losses `(y , y
i ˆi ).
I The problem is usually decomposable over the targets.
I Domain of y is either continuous or nominal.
i
I Regularized (shrunken) models vs. independent models.

• The joint target view:

I Goal: predict a vector y using x.
I Multivariate distribution of y.
I The problem is defined through multivariate losses `(y, ŷ).

I The problem is not easily decomposable over the targets.

I Domain of y is usually finite, but contains a large number of elements.
I More expressive models vs. independent models.

14 / 102
Multi-target prediction

the individual
target view

Reduce model complexity by model sharing.

shrunken
Example: RR, FicyReg, Curds&Whey,multi-task learning methods,
models kernel dependency estimation, stacking, compressed sensing, etc.

independent Fit one model for every target (independently).

models Examples: binary relevance in multi-label classification

Introduce additional parameters or models for targets or tar-

get combinations.
more expressive
Examples: label powerset, structured SVMs, conditional random
models fields, probabilistic classifier chains (PCC), Max Margin Markov
Networks, etc.

the joint tar-

get view

15 / 102
Target interdependences

• Marginal and conditional dependence:

m
Y m
Y
P (Y ) 6= P (Yi ) P (Y | x) 6= P (Yi | x)
i=1 i=1

marginal (in)dependence 6 conditional (in)dependence

16 / 102
Target interdependences

• Model similarities:

fi (x) = gi (x) + i , for i = 1, . . . , m

Similarities in the structural parts gi (x) of the models.

17 / 102
Target interdependences

• Structure imposed (domain knowledge) on targets

I Chains,

I Hierarchies,
I General graphs,
I ...

18 / 102
Target interdependences

• Interdependence vs. hypothesis and feature space:

I Regularization constraints the hypothesis space.
I Modeling dependencies may increase the expressiveness of the model.
I Using a more complex model on individual targets might also help.

I Comparison between independent and multi-target models is difficult in

general, as they differ in many respects (e.g., complexity)!

19 / 102
Multivariate loss functions

• Decomposable and non-decomposable losses over examples

n
X n
X
L= `(y i , h(xi )) L 6= `(y i , h(xi ))
i=1 i=1
• Decomposable and non-decomposable losses over targets
m
X m
X
`(y, h(x)) = `(yi , hi (x)) `(y, h(x)) 6= `(yi , hi (x))
i=1 i=1

20 / 102
The individual target view

• Loss functions and optimal predictions

I Decomposable losses over targets.

• Learning algorithms
I Pooling.
I Stacking.
I Regularized multi-target learning.

• Problem settings
I Multi-label classification.
I Multivariate regression.
I Multi-task learning.

21 / 102
A starting example

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1
x 4.0 2.5 ? ? ?

22 / 102
A starting example

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1
x 4.0 2.5 1 1 0

22 / 102
Loss functions and optimal predictions

• We are interested in minimization of the loss for a given target yi :

`(yi , ŷi )

• The loss function can be also written over all targets as:
m
X
`(y, ŷ) = `(yi , ŷi )
i=1

• The expected loss, or risk, of model h is given by:

m
X m
X
EXY `(Y , h(X)) = EXY `(Yi , hi (X)) = EXYi `(Yi , hi (X)) .
i=1 i=1

• The optimal prediction minimizing the risk could be obtained

independently for each target yi .
• Can we gain by considering other labels?
23 / 102
Multivariate linear regression

• Single output prediction: Learn a mapping h : X → Y, Y = R:

X Y
z }| { z }| {
xT1
  
x11 · · · x1p y1
 .. ..   ..  ..
= . →
 
 . . . 
xn1 · · · xnp xTn yn

• When h is linear: h(x) = aT x

• Multi-target: Learn a mapping h = (h1 , . . . , hm )T : X → Y,
Y = Rm :
y1T
   
y11 · · · y1m
 ..   .. .. 
 . = . . 
ynT yn1 · · · ynm
• When h is linear: h(x) = AT x

24 / 102
Single output regression vs. multivariate regression

• Multivariate least-squares risk:

Z m
X
L(h, P ) = (y·j − hj (x))2 dP (x, y)
X ×Y j=1

• Learning algorithm minimizes empirical least squares risk:

n X
X m
OLS
Â = arg min (yij − hj (xi ))2 .
A i=1 j=1

• The solution for multivariate least squares is the same as for

univariate least squares applied for each output independently.

25 / 102
Pooling

h1 (x) = Jx1 + x2 K h2 (x) = Jαx1 + x2 K

• Data uniformly distributed in [−1, 1],

• 10% noise added,
• Risk measured in terms of 0/1 loss: `0/1 (yj , hj (x)) = Jyj 6= hj (x)K

26 / 102
Pooling

Data for Target 2 Data for Target 2 plus Target 1

• A kind of “instance transfer,”

• Estimator will be biased, but have reduced variance.

27 / 102
Pooling

• Expected generalization performance as a function of sample size

(logistic regression, α = 1.5):

28 / 102
Pooling

α = 1.4 α = 1.5 α=2

• The critical sample size (dashed line) depends on the model similarity,
which is normally not known!
• To pool or not to pool? Or maybe pooling to some degree?

29 / 102
James-Stein estimator

• Consider a multivariate normal distribution y ∼ N (θ, σ 2 I).

• What is the best estimator of the mean vector θ?

• Evaluation w.r.t. MSE: E[(θ − θ̂)2 ]
ML
• Single-observation maximum likelihood estimator: θ̂ =y
• James-Stein estimator:4
(m − 2)σ 2

JS
θ̂ = 1− y
kyk2
4
W. James and C. Stein. Estimation with quadratic loss. In Proc. Fourth Berkeley Symp. Math.
Statist. Prob. 1, pages 361–379, 1961
30 / 102
James-Stein estimator

• James-stein estimator outperforms the maximum likelihood estimator

as soon as m > 3.
• Explanation: reducing variance by introducing bias.
• Regularization towards the origin 0
• Regularization towards other directions is also possible:

(m − 2)σ 2

θ̂JS+ = 1− (y − v) + v
ky − vk2

31 / 102
James-Stein estimator

• Works best when the norm of the mean vector is close to zero.5

• Only outperforms the maximum likelihood estimator w.r.t. the sum of

squared errors over all components.
• Does not outperform the squared error when evaluating an individual
component (i.e. one target).
• Forms the basis for explaining the behavior of many multi-target
prediction methods.
5
B. Efron and C. Morris. Stein’s estimation rule and its competitors–an empirical bayes approach.
Journal of the American Statistical Association, 68(341):117130, 1973 32 / 102
Joint target regularization

• Minimization of the empirical univariate regularized least squares risk:

n
X
âOLS
j (λ) = arg min (yij − hj (xi ))2 + λkaj k2 .
aj
i=1

• Minimization of the empirical multivariate regularized least squares

risk:
n X
X m
ÂOLS (λ) = arg min (yij − hj (xi ))2 + λkAkF .
A i=1 j=1

• Many machine learning techniques for multivariate regression and

multi-task learning depart from this principle, while adopting more
complex regularizers!
• Regularization incorporates bias, but reduces variance.

33 / 102
Mean-regularized multi-target learning6

Target 1
• Simple assumption:
models for different targets
are related to each other.
• Simple solution: the
parameters of these models Target 2 Mean Target 4

should have similar values.

• Approach: bias the
parameter vectors towards
their mean vector. Target 3

• Disadvantage: the
assumption of all target
models being similar might
m m
be invalid for many X 1 X
applications. min kY−XAkF +λ kai − aj k2
A m
i=1 j=1
6
Evgeniou and Pontil. Regularized multi-task learning. In KDD 2004
34 / 102
Multi-target prediction methods

• Methods that exploit the similarities between the structural parts of

target models:
y = h(f (x), x) , (1)
where f (x) is the prediction vector obtained by univariate methods,
and h(·) are additional shrunken or regularized classifiers.
• Alternatively, a similar model can be given by:

h−1 (y, x) = f (x) , (2)

i.e., the output space (possibly along with the feature space) is first
transformed, and than univariate (regression) methods are then
trained on the new output variables h−1 (y, x).

35 / 102
Stacking applied to multi-target prediction: general principle8

Level 2 h1 h2 h3 h4

Level 1 f1 f2 f3 f4

8
W. Cheng and E. Hüllermeier. Combining instance-based learning and logistic regression for
multilabel classification. Machine Learning, 76(2-3):211–225, 2009
36 / 102
Multivariate regression methods

• Many multivariate regression methods, like C&W,9 reduced-rank

regression (RRR),10 , and FICYREG,11 can be seen as a generalization
of stacking:
y = (T−1 GT)Ax ,
where T is the matrix of the y canonical co-ordinates (the solution of
CCA), and the diagonal matrix G contains the shrinkage factors for
scaling the solutions of ordinary linear regression A.

9
L. Breiman and J. Friedman. Predicting multivariate responses in multiple linear regression. J.
R. Stat. Soc., Ser. B, 69:3–54, 1997
10
A. Izenman. Reduced-rank regression for the multivariate linear model. J. Multivar. Anal.,
5:248–262, 1975
11
A. an der Merwe and J.V. Zidek. Multivariate regression analysis and canonical variates. Cana-
dian Journal of Statistics, 8:27–39, 1980
37 / 102
Multivariate regression methods

• Alternatively, y can be first transformed to the canonical co-ordinate

system y 0 = Ty.
• Then, separate linear regression is performed to obtain estimates
ỹ 0 = (ỹ10 , ỹ20 , . . . , ỹm
0 ).

• These estimates are further shrunk by the factor gii obtaining

ŷ 0 = Gỹ 0 .
• Finally, the prediction is transformed back to the original co-ordinate
output space ŷ = T−1 ŷ 0 .

• Similar methods exist for multi-label classification.

38 / 102
The joint target view

• Loss functions and probabilistic view

I Relations between losses.
I How to minimize complex loss functions.

• Learning algorithms
I Reduction algorithms.
I Conditional random fields (CRFs).
I Structured support vector machines (SSVMs).
I Probabilistic classifier chains (PCCs).

• Problem settings
I Hamming and subset 0/1 loss minimization.
I Multilabel ranking.
I F-measure maximization.

39 / 102
A starting example

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1
x 4.0 2.5 ? ? ?

40 / 102
A starting example

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Predict the vector y = (y1 , y2 , . . . , ym ) for a given x.

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1
x 4.0 2.5 1 1 0

40 / 102
Two basic approaches

• Binary relevance: Decomposes the problem to m binary

classification problems:
(x, y) −→ (x, y = yi ), i = 1, . . . , m
• Label powerset: Treats each label combination as a new meta-class
in multi-class classification:
(x, y) −→ (x, y = metaclass(y))

X1 X2 Y1 Y2 ... Ym
x1 5.0 4.5 1 1 0
x2 2.0 2.5 0 1 0
.. .. .. .. .. ..
. . . . . .
xn 3.0 3.5 0 1 1

41 / 102
Synthetic data

• Two independent models:

1 1 1 1
f1 (x) = x1 + x2 , f2 (x) = x1 − x2
2 2 2 2
• Logistic model to get labels:
1
P (yi = 1) =
1 + exp(−2fi )
1.0

1.0
●●
●●
●●
● ●●●● ●●●●●
●●●
●●●
●●●●
●●
●● ● ●●● ●
●●
●●●
● ●● ● ●
●●● ● ● ●
● ●●●
● ●
● ●● ●●● ●● ●● ●
●●●● ●●
●●
●●
● ●●●● ●●●●●
●●●
●●●
●●●●
●●
●● ● ●●● ●
●●
●●●
● ●● ● ●
●●● ● ● ●
● ●●●
● ●
● ●● ●●● ●● ●● ●
●●●●
● ● ● ● ●●●●
●●● ●
●●
●● ● ●● ● ● ●● ●● ●●●●● ● ●
● ● ● ● ● ● ●●●●
●●● ●
●●
●● ● ●● ● ● ●● ●● ●●●●● ● ●
● ●
●●
● ●
●●●
●
●
●
●
●●
●
●
●●
● ●●
●●●
●●
●
●
●●●
● ●●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●●●
●●●
●
●● ●
●●
●
●
●
●●●
● ●●●●
●
●●●●
●
●
●●●●
●
●●
●●●●●●
●●
●
● ●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
● ●●
● ●
●●●
●
●
●
●
●●
●
●
●●
● ●●
●●●
●●
●
●
●●●
● ●●
●
●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●●●
●●●
●
●● ●
●●
●
●
●
●●●
● ●●●●
●
●●●●
●
●
●●●●
●
●●
●●●●●●
●●
●
● ●●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
● ● ●● ●
●● ●● ●
●●●
● ●● ● ●
● ●●●●● ●
●●●
●● ●●● ● ● ●●●●
●●● ● ●●● ● ●● ●●●●
●
●●
●●
●
●●●●● ●
●●
●
●●
●●
● ●
● ● ●● ●
●● ●● ●
●●●
● ●● ● ●
● ●●●●● ●
●●●
●● ●●● ● ● ●●●●
●●● ● ●●● ● ●● ●●●●
●
●●
●●
●
●●●●● ●
●●
●
●●
●●
●
●●●●●●● ●●●●●
●● ●●●● ●● ●●
● ●
● ●
●●
● ●● ● ● ●●
●●● ● ● ●● ● ●
● ●
●● ●
● ●
●
● ● ● ●●
●●● ●
●●● ●●●●●●● ●●●●●
●● ●●●● ●● ●●
● ●
● ●
●●
● ●● ● ● ●●
●●● ● ●●● ● ●
● ●
●● ●
● ●
●
● ● ● ●●
●●● ●
●●●
●●●●● ●●● ●●● ●
●●
●
●●● ●
●●●
●●●●● ●
●●●●●● ●●
●
●
● ●
●●● ●
● ●●
●●●●●●● ●●
● ●
●●●●●●●●
●●
●●●●●● ●●
●●●
● ●● ●
● ●●●●● ●●● ●●● ●
●●
●
●●● ●
●●●
●●●●● ●
●●●●●● ●●
●
●
● ●
●●● ●
● ●●
●●● ●
●●● ●●
● ●
●●●●●●●●
●●
●●●●●● ●●
●●●
● ●● ●
●
●●● ●●● ●●● ●●●
● ●
●●
●
●
●
● ●●
●●●
●●●●
●●
●●
●●
●
●
●●
●
●●●●
●
●
● ●
●●●
● ●
● ● ●●●● ●
● ●●
●●●
●
●● ●●
●● ●
●●● ● ●
●
● ●
● ●●● ●●● ●●● ●●● ●●●
● ●
●●
●
●
●
● ●●
●●●
●●●●
●●
●●
●●
●
●
●●
●
●●●●
●
●
● ●
●●●
● ●
● ● ●●● ●●
● ●● ●
●● ●●
●● ●
●●● ● ●
●
● ●
● ●●●
●
●●●
●
●●●
●●
●
●●
●●
●
●● ●●
●
●● ●●
●●●
●●
●
●●●
●● ●●
●
●
●
●
●●●
● ●●●
● ●
●●●
●●
●●
●●
● ●
●
●
●●
●●
●●●●
●
●●
●●
●●●
● ●
●●● ●●●● ●●●●●
● ●●●●●●●
● ●●
●●●
●
●
●●
●
●●●
●
● ●
●●●
●
●●●
●●
●
●●
●●
●
●● ●●
●● ●●
●●●
●●
●
●●●
●● ●●
●
●
●
●
●●●
● ●●●
● ●
●●●
●●
●●
●●
● ●
●
●
●●
●●
●●●●
●
●●
●●
●●●
● ●
●
●●●
●● ●●●● ●●●●●
● ●●●●●●●
● ●●
●●●
●
●
●●
●
●●●
●
●
●●
●● ●
●●●●● ●●●
● ●
●●
●●
●
●
● ● ●●●
●
●●●●●●●● ●●
●●
●● ●
●●
●
●●
●●●●● ●● ●
●
●●
●
●●
●●●
● ●●● ●
● ●
●●●
●
●●●● ●●
●●
●●●●●●●
●●●●
●
●
●●●●
●
●
● ● ●●
●● ●
●●●●● ●●
●●
● ●
●●
●●
●
●
● ● ●●●
●
●●●●●●●● ●●
●●
●● ●
●●
●
●●
●●●●● ●● ●
●
●●
●
●●●
●●
● ●●● ●
● ●
●●●
●
●●●● ●●
●●
●●●●●●●
●●●●
●
●
●●●●
●
●
● ●
0.5

0.5
● ● ● ● ●●
● ● ● ● ● ●●●
●● ●● ● ● ●● ●●●●
● ● ●●● ● ●●●
● ●
● ● ● ● ● ●●
● ● ●
● ● ● ●●●
●● ● ● ● ●● ●●●●
● ● ●●● ● ●●●
● ●
●
●●
● ●●
●
●● ●● ●●●●
● ●
●●●●
●
●●●●●●
●
●
●
●
●●● ●●●●●●
●● ●●
●●
●●
●
●●
● ● ●
●●
●●
●
●●●
●●
●● ● ●
●●● ●
● ●
● ● ●● ●●
●
● ●●
●
● ●
●●● ●●
● ●●
●
●● ●● ●●●●
● ●
●●●●
●
●●●●●●
●
●
●
●
●●● ●●●●●●
●● ●
●●
●●
●●
●
●●
● ● ●
●●
●●
●
●●●
●
●
●● ● ●
●●● ●
● ●
● ● ●● ●●
●
● ●●
●
● ●
●●●
●
●●●
● ●
●●
●
●●●
●●
●●●
● ●●
●● ●
●●
●●
●
●●
●
●
●
●●●●●●
●
●
●
●
●
●●
●●●●
●
●
●
●
●●●
●●●
●
●●
●●●
●●
●●●●●●●
●
●●●●
●
●●
●
●●●
● ●●
●
● ●
●
●●●●
●●
●
●
●
●
●●
●
●●●●●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
● ●
● ●
●●●
● ●
●●
●
●●●
●●
●●●
● ●●
●● ●
●●
●●
●
●●
●
●
●
●●●●●●
●
●
●
●
●
●●
●●●●
●
●
●
●
●●●
●●●
●
●●
●●●
●●
●●●●●●●●
●
●●●
●
●●
●
●●●
● ●●
●
● ●
●
●●●●
●●
●
●
●
●
●●
●
●●●●●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
● ●
●
●
● ●●
●●●
●
●● ●●● ● ●
●
●● ●
●●●
●●●●●●●
●●●●
●●
●
●●● ●
●
●
●
●
●● ●
●
●
●
●
●● ● ●●
● ●
●
●●
●●
●●
●●
●●● ●
● ●
●●●●●
●●●
●● ●●
●
●
●●● ●●
●
● ●
● ● ●
●●● ●
● ●●
●●●
●
●● ●●● ● ●
●
●● ●
●●●
●●●●●●●
●●●●
●●
●
●●● ●
●
●
●
●
●● ●
●
●
●
●
●● ● ●●
● ●
●
●●
●●
●●●
●
●●● ●
● ●
●●●●●
●●●
●● ●●
●
●
●●● ●●
●
● ●
● ● ●
●●●
●●● ● ●●●
● ●
●● ●●●
● ●●●●●●●● ●
● ● ●●● ● ●●● ●●●● ● ●●●
●●●● ● ●● ●●● ● ●●●
● ●● ●●
● ●●●●●●●● ●
● ● ●●● ● ●●● ●●●● ● ●●●
●●●● ● ●●
●●
●●
●
● ●●
●●
●●●●●●●
● ●
●●●●●
●
● ● ●
●
●●
●
●
●
●●
●●
● ●●●
●●●
●●●●●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●●
●●●●
●
●●●●
●●
●
●
●● ●●●
●
●
●●
●
●
●
●●
● ●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
● ●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●●
●
● ●●
●●
●●●●●●●
● ●
●●●●●
●
● ● ●
●
●●
●
●
●
●
●
●
●●
● ●●●
●
●●●
●●●●●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
● ●
●●
●
●●
●
●●●●
●●
●
●
●● ●●●
●
●
●●
●
●
●
●●
● ●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
● ●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
● ●
●●
●
●
●●● ●
● ●●●● ●
●
●
●●●●
●
● ●
●●●
●●
●● ●
● ●●
● ●●
●● ●●●● ● ●
●● ●
●
●●
●●● ●●
●●
● ●
●●● ● ●
●●●●●●● ●●
●●●●
●●
● ●
●● ●
● ●
●●
●
●
●●● ●
● ●●●● ●
●
●
●●●●
●
● ●
●●●
●●
●● ●
● ●●
● ●●
●● ●●●● ● ●
●● ●
●●
●●
●● ●●
●●
● ●
●●● ● ●
●●●●●●● ●●
●●●●
●●
● ●
●●
●●● ●●● ●● ● ●●●● ●●● ●
●●●●● ●●●●● ●●● ● ● ● ●
●●●●●●
● ●● ● ●●●● ●●● ●● ●● ● ●●●● ●●● ●
●●●●● ●●●●● ●●● ● ● ● ●
●●●●●● ●● ● ●●●●
●●
●●●●●
●● ●
● ● ● ●●
●
●●
●●●●●
●
●●
●
●●●●●● ●●●
●●●
●● ●●● ● ●●●
●●● ●
● ●●
●●●●
●●●●●
●●●●●
●● ●●●●●●●● ●
●
●●
● ●●
●●●●●
●● ●
●
● ● ● ●●
●
●●
●●●●●
●
●●
●
●●●●●● ●●●
●●●
●● ●●● ● ●●●
● ●
● ●
● ●●
●●●●●
●●●●
●●●●●
●
●● ●●●●●●●● ●
●
●●
●
●
● ●●
●
●●●
● ● ●● ●●
●●●
●●● ●
●●
●
●●
●●● ●
● ●
●
●●●
● ●●●
●
●● ●● ●●● ●
●●●
● ●
●●●●
●●
●
●
● ●●●●●
●●
●● ●●●
●●●●●●● ●
●
●
● ●● ● ● ●
● ●●
●
●●●
● ● ●● ●●
●●●
●●● ●
●●
●
●●
●●● ●
● ●
●
●●●
● ●●●
●
●● ●● ●●● ●
●●●
● ●
●●●●
●●
●
●
● ●●●●●
●●
●● ●●●
●●●●●●● ●
●
●
● ●● ● ●
● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●●●
0.0

●●

0.0
● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●
●
●
●
●
●
●
●●
●
●●
●●●
●● ● ●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●●●
●●●●
●
●●
●
● ●
●●●
●
●
●
●
●● ●
●●
●●
● ●
●●
●● ●●●●
●
●●
●●
● ●●●
● ●
● ●●●
●●
●●●
●
●●
●●●
●
●
●●●
● ●
●
●●● ●
●
●●
●
●
●
●●
●●
●
●●●●
●
●●
●
●
●
●
●
●
●●
●
●●
●●●
●● ● ●
●
●
●●●
●
●
●●
●
●
●
●●●
●
●●●
●●●●
●
●●
●
● ●
●●●
●
●
●
●
●● ●
●●
●●
● ●
●●
●● ●●●●
●
●●
●●
● ●●●
● ●
● ●●●
●●
●●●
●
●●
●●●
●
●
●●●
● ●
●
●●● ●
●
●●
●
●
●
●●
●●
●
●●●●
●
●●
●
●
●
●●
●●●
●
●
●
●●
● ●●●
●
●
●
●●●
●●●●
●
●
●
●●
●
● ● ●●●●
●
●
●●●
●
●●●●●
●●●
●●●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●●
● ●
●
●
●●●
●●●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●●● ●
●
●
●●
●●●
●
●
●
●●
● ●●●
●
●
●
●●●
●●●●
●
●
●
●●
●
● ● ●●●●
●
●
●●●
●
●●●●●
●●●
●●●
●
●●
●●●
●
●
●●
●●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●●
● ●
●
●
●●●
●●●
●
●
●●
●
●
●
●●
●
●●
●●
●
●●●●
● ●● ●
●● ●●●●● ●
● ●
●
●●●● ●●
●●●●● ●●● ●
●
●●
●
● ● ●●● ●●
●● ●●●● ●
●●●● ●●●● ●
●●● ●
●●● ● ● ●●●●●●●
●● ● ● ●● ●
●● ●●●●● ●
● ●
●
●●●● ●●
●●●●● ●●● ●
●
●●
●
● ● ●●● ●●
●● ●●●● ●
●●●● ●●●● ●
●●● ●
●●● ● ● ●●●●●●●
●● ●
●●●
● ● ●
●● ● ●●
●●
●● ●●
● ●●● ● ●
● ● ● ●●● ●●●●●
●● ●
● ●
●●●● ●●
● ●
● ●●●●
●
● ●
●●●●● ●●
● ● ●● ● ●●●
● ● ●
●● ● ●●
●●
●● ●●
● ●●● ● ●
● ● ● ●●● ●●●●●
●● ●●●
●●●● ●●
● ●
● ●●●●
●
● ●
●●●●● ●●
● ● ●● ●
● ●
●●●● ●●
● ●
●● ● ●● ●
●
●● ●
●
●●●●
● ●
● ●
●●●●
●●●
●● ●●
●●
●●●
●●●●
●●● ● ●
●●●● ●
●●●
●●● ●
●●●● ●
●●
●●●
●●●●●●●● ●
●● ●● ●
●●●●●
● ● ● ●
●●●● ●●
● ●
●● ● ●● ●
●
●● ●
●
●●●●
● ●
● ●
●●●●
●●●
●● ●●
●●
●●●
●●●●
●●● ● ●
●●●● ●
●●●
●●● ●
●●●● ●
●●
●●●
●●●●●●●● ●
●● ●● ●
●●●●●
● ●
●
●● ● ●● ●
●
●●● ●●● ● ● ●●●●
● ●● ●● ●●
●● ●
●● ●●●●
● ● ●●
● ● ●● ● ●
●● ● ●● ●
●
●●● ●●● ● ● ●●●●
● ●● ●● ●●
●● ●
●● ●●●●
● ● ●●
● ● ●● ●
●
●●
●●
●
●
●
●●
● ●
●
●●
●
●
● ●●
●●
●●●●
●●
●●
●
●●
●● ●●●
●
●●
●
●● ●
●●
● ●●
●
●
●●●
● ●
●●●● ●● ●
●●
●●●●●● ●
●
● ●
●● ●
●●
●●
●●● ●
●●●●
●
●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●●
●● ●
●●
●●
●
●
●
●●
● ●
●
●●
●
●
● ●●
●●
●●●●
●●
●●
●
●●
●● ●●●
●
●●
●
●● ●
●●
● ●●
●
●
●●●
● ●
●●●● ●● ●
●●
●●●●●● ●
●
● ●
●● ●
●●
●●
●●● ●
●●●●
●
●
●●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●●
●●
●●●
●
●
●●
●
●●
● ●●●● ● ●●●●
●
●●
●
●●●●
●●● ●
●●●
● ●● ●●●●
●●
● ●●
●●
●
●
●
●●●●
● ●●●●
●●●●●
●●
●●
●●
●●
●● ●●
● ●●●●
●●●
●●●●● ● ●
●●
● ●●●●
●
●
●
●●
●
● ●●●
●
●
●●
●
●●
● ●●●● ● ●●●●
●
●●
●
●●●●
●●● ●
●●●
● ●● ●●●●
●●
● ●●
●●
●
●
●
●●●●
● ●●●●
●●●●●●
●●
●●
● ●●
●● ●●
● ●●●●
●●●
●●●●● ● ●
●●
● ●●●●
●
●
●
●●
●
●
−0.5

● ●● ●●
−0.5
●● ●● ●● ●● ●●●●●●●
● ●● ●
●●●●●●
● ● ● ●●● ●● ●
● ●● ●●●
● ●
●
●● ● ●●
● ●
● ● ●●
●●●●●● ●
●
●●● ●● ●
●● ●● ●● ●●●●●●●
● ●● ●
●●●●●●
● ● ● ●●● ●● ●
● ●● ●●●
● ●
●
●● ● ●●
● ●
● ● ●●
●●●●●● ●
●
●●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
● ●
●
●●
●
●●
●●●●●
●●
●
●
●
●
●
●●
●●●
●●●
●●●●
●●
●
●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●●
●●●
●●●
●●●
●●●●
●
●●
●●●●
●
●
●
●
●●
●
●
● ●●
●
●●
● ●
●
●●●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
● ●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
● ●
●
●●
●
●●
●●●●●
●●
●
●
●
●
●
●●
●●●
●●●
●●●●
●●
●
●
●
●
●
●
●
● ●●
●
●●●
●
●
●
●●
●●●
●●●
●●●
●●●●
●
●●
●●●●
●
●
●
●
●●
●
●
● ●●
●
●●
● ●
●
●●●
●
●
●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
● ●
●
● ●●
●
●●● ●● ● ●
●●●●
●●● ●
●●●●
● ●●
● ● ●●
● ●●●●● ● ● ● ●
●●●
●●
● ●●
●
●● ●
● ●●
●● ●●
●
●● ●●● ●●
●●● ●
●●
●
●●●●●
●● ● ●
●● ● ●●
●
●●● ●● ● ●
●●●●
●●● ●
●●●●
● ●●
● ● ●●
● ●●●●● ● ● ● ●
●●●
●●
● ●●
●
●● ●
● ●●
●● ●●
●
●● ●●● ●●
●●● ●●
●
●●●
● ●●
●● ● ●
●●
●●
●●●
●● ●●●●
●●●● ●
● ●●
●● ●
●●
●
●
●● ●●
● ●
●● ● ●● ●●●
● ●
●●
● ●● ●● ● ●
●● ●●
●● ● ●●
●
●● ●
●
● ●● ●●
●●●
●● ●●●●
●●●● ●
● ●●
●● ●
●●
●
●
●● ●●
● ●
●● ● ●● ●●●
● ●
●●
● ●● ●● ● ●
●● ●●
●● ● ●●
●
●● ●
●
● ●●
●●●
●●●● ●●● ●
●●●● ●● ●
●●● ●
●●
●
●● ●●●●●
●●●●●●
●● ●● ●
● ●
●●●●●
● ●● ●●
●● ● ●●
●
●●●
●
● ● ●●● ●● ●
●●●
●
●●● ●●●
●●●● ●●● ●
●●●● ●● ●
●●● ●
●●
●
●● ●●●●●
●●●●●●
●● ●● ●
● ●
●●●●●
● ●● ●●
●● ● ●●
●
●●●
●
● ● ●●● ●● ●
●●●
●
●●●
●●
●●● ●●●●●●● ●
● ●
●●● ●● ●●●● ●
●● ●● ●●●●● ● ●●●●●●●●●●● ●●●
● ● ●● ●●●●●
● ● ●
● ●●
●●● ●●●●●●● ●
● ●
●●● ●● ●●●● ● ●● ●●●●● ● ●●●● ●
●●●●●● ●●●
● ● ●● ●●●●●
● ● ●
●
●●●●●
●● ●
●●
●
●●
●
●●
●
●
●●●●●
●
● ●●●●●● ●
●●●
●● ● ●● ●
● ●●
●●●●● ●
●●●
●
●
● ●
●
●●
●●
● ●
●● ●
●●●
●●
●●
● ●● ●
●●●●●
●●●●
●
● ●●●●●
●● ●
●●
●
●●
●
●●
●
●
●●●●●
●
● ●●●●●●●
● ●
●●●
●● ● ●● ●
● ●●
●●●●● ●
●●●
●
●
● ●
●
●●
●●
● ●
●● ●
●●●
●●
●●
● ●● ●
●●●●●
●●●●
●
●
●
●
●
●●
●
●
●
●
●
●●●●
●
●●
●● ●
●
●
●●
● ●●
●●
● ●
●●
●
●
●
●
●
●
●
●
● ●●● ●
●●
● ●●●●
●
●
●
●
●
●
●
●●
● ●●
●●●●●●●●
●
●●●●● ●●●●● ●
●
●● ●
●●
●●
●●
●
●
●●●
●●
●●
● ●●●●●
●● ●●●
●●
●●
●
●
●
●●●
● ●
●
●
●●
●
●
●
●
●
●●●●
●
●●
●● ●
●
●
●●
● ●●
●●
● ●
●●
●
●
●
●
●
●
●
●
● ●●● ●
●●
● ●●●●
●
●
●
●
●
●
●
●●
● ●●
●●●●●●●●
●
●●●●● ●●●●● ●
●
●● ●
●●
●●
●●
●
●
●●●
●●
●●
● ●●●●●
●● ●●●
●●
●●
●
●
●
●●●
●
●●●●
●
●
●
●●●
● ●●●
●●●●
● ●
●
●●●
● ●
●●●
●
●
●●
●
●
●●
●●
●●
●●
●●
● ●●●
●●
●● ●●
●●●
●
● ●●
●●
●●●●●
●
●●
●●●
● ●●
●●● ●
●●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●●
●●●
●●●
●●
● ●●
●●● ●
●
●
●●
●
●
● ●●●●
●
●
●
●●●
● ●●●
●●●●
● ●
●
●●●
● ●
●●●
●
●
●●
●
●
●●
●●
●●
●●
●●
● ●●●
●●
●● ●●
●●●
●
● ●●
●●
●●●●●
●
●●
●●●
● ●●
●●● ●
●●
●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●●
●●●
●●●
●●
● ●●
●●● ●
●
●
●●
●
●
●
●●●● ● ●●● ● ●
● ●● ●
● ●● ●
● ● ● ●● ● ● ● ●● ● ● ●●●● ● ●●● ● ●
● ●● ●
● ●● ●
● ● ● ●● ● ● ● ●● ● ●
−1.0

−1.0

●
● ●
●●
●●●●
●● ●●● ●●
● ●
●●
●● ●●●
●
●●●● ●●● ●
●
●● ● ●
●
●●
●●● ●
●● ● ●
●●●● ●●●●●●
●
●
●●● ● ●●●
●●
● ●
●●●
●●●●
●●
●
● ●●●●
● ●
● ●
●●
●●●●
●● ●●● ●●
● ●
●●
●● ●●●
●
●●●● ●●● ●
●
●● ● ●
●
●●
●●● ●
●● ● ●
●●●● ●●●●●●
●
●
●●● ● ●●●
●●
● ●
●●●
●●●●
●●
●
● ●●●●
●
●
●● ● ● ● ● ● ● ●●● ● ● ●
●● ● ● ● ● ● ● ●●● ● ●
●●
●
●● ●
●●●●●
●
●●
●●●●● ●●● ●
●
● ●●●●●
●●●●●●●●
●●●●●●●●●
● ●● ●●● ●●●● ●●● ●●●
●● ●●●●
●●●● ●●●●● ●
●●
●
●
●
● ●
●
● ●●
●
●● ●
●●●●●
●
●●
●●●●● ●●● ●
●
● ●●●●●
●●●●●●●●
●●●●●●●●●
● ●● ●●● ●●●● ●●● ●●●
●● ●●●●
●●●● ●●●●● ●
●●
●
●
●
● ●
●
●

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
42 / 102
Synthetic data

• Two dependent models:

1 1 1 1 2
f1 (x) = x1 + x2 f2 (y1 , x) = y1 + x1 − x2 −
2 2 2 2 3
• Logistic model to get labels:
1
P (yi = 1) =
1 + exp(−2fi )
1.0

●●

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
43 / 102
Results for two performance measures

1 Pm
• Hamming loss: `H (y, h) = m i=1 Jyi 6= hi K ,

• Subset 0/1 loss: `0/1 (y, h) = Jy 6= hK .

Conditional independence
classifier Hamming loss subset 0/1 loss
BR LR 0.4232 0.6723
LP LR 0.4232 0.6725

Conditional dependence
classifier Hamming loss subset 0/1 loss
BR LR 0.3470 0.5499
LP LR 0.3610 0.5146

44 / 102
Linear + XOR synthetic data

1.0
● ●
●● ● ●● ● ●
●● ●● ● ● ●●
●
●● ● ●● ●
●
● ●●
● ● ● ●
● ●
● ●● ● ●●●● ●
● ● ●● ●● ●●● ●●

0.5
● ●●●●
● ● ● ●●●●●●
● ●● ● ● ●
● ●● ●
● ● ● ●
●
● ● ●
● ● ● ●●
● ●
● ●● ● ●●
0.0

●● ● ●
●● ● ● ● ● ●●
●
● ● ●
● ● ●
●●●● ●● ●
● ● ●
●●● ●
● ● ● ●●● ●
−0.5

● ● ● ●●
● ●● ● ● ● ●
● ● ●●●
●●●
● ●●● ●●● ● ●
● ● ●●
● ●● ● ●
● ● ●● ●
● ● ● ● ● ●●
−1.0

●● ● ● ●●●

−1.0 −0.5 0.0 0.5 1.0

Figure : Problem with two targets: shapes (4 vs. ◦) and colors ( vs. ).

45 / 102
Linear + XOR synthetic data

classifier Hamming subset 0/1

loss loss
BR LR 0.2399(±.0097) 0.4751(±.0196)
LP LR 0.0143(±.0020) 0.0195(±.0011)

Bayes Optimal 0 0

46 / 102
Linear + XOR synthetic data

classifier Hamming subset 0/1

loss loss
BR LR 0.2399(±.0097) 0.4751(±.0196)
LP LR 0.0143(±.0020) 0.0195(±.0011)
BR MLRules 0.0011(±.0002) 0.0020(±.0003)
Bayes Optimal 0 0

46 / 102
Linear + XOR synthetic data

• BR LR uses two linear classifiers:

1.0
● ●
●● ●● ● ●
cannot handle the label color ( ●● ●●
●
● ● ●●
● ●
● ● ●
vs. ) – the XOR problem. ● ●
● ●●
●
●●
●
●
● ●
● ●● ● ●●●●●● ●
● ●●
● ●● ●●

0.5
● ●●●●●
• LP LR uses four linear classifiers ●
● ●●
● ● ●●●●●●
● ●
● ●
●● ●
● ● ●
to solve 4-class problem (M, N, ●
●
●
● ●
● ● ● ● ●●
●
◦, •): extends the hypothesis ● ●● ● ●●

0.0
●● ● ●
●● ● ● ● ● ●●
space. ●
●
● ●
● ●
●
●●
●●●●● ● ● ●
●●● ●
• BR MLRules uses two non-linear ● ● ● ●●● ●

−0.5
● ● ● ●●
● ●● ● ● ● ●
● ● ● ●●●
classifiers (based on decision ● ● ●●● ● ●
● ●●●
● ● ●●
● ●● ●
rules): XOR problem is not a ●
● ●
●
●
●●
● ●
●
● ●●
−1.0
●● ● ● ●●●
problem.
−1.0 −0.5 0.0 0.5 1.0
• There is no noise in the data.
• Easy to perform unfair
comparison.

47 / 102
Multi-target prediction - probabilistic view

• Data are coming from distribution

P (Y , X) .

• Since we predict the value of Y for a given object x, we are

interested in the conditional distribution:
P (Y = y, X = x)
P (Y = y|X = x) =
P (X = x)

48 / 102
Multi-target prediction - probabilistic view

• Data are coming from distribution

P (Y , X) .

• Since we predict the value of Y for a given object x, we are

interested in the conditional distribution:
P (Y = y, X = x)
P (Y = y|X = x) =
P (X = x)

• What is the most reasonable response y?

48 / 102
Multi-target prediction - probabilistic view

• Data are coming from distribution

P (Y , X) .

• Since we predict the value of Y for a given object x, we are

interested in the conditional distribution:
P (Y = y, X = x)
P (Y = y|X = x) =
P (X = x)

• What is the most reasonable response y?

I P (Y = y|X = x) is the largest?

48 / 102
Multi-target prediction - probabilistic view

• Data are coming from distribution

P (Y , X) .

• Since we predict the value of Y for a given object x, we are

interested in the conditional distribution:
P (Y = y, X = x)
P (Y = y|X = x) =
P (X = x)

• What is the most reasonable response y?

I P (Y = y|X = x) is the largest?
I P (Yi = yi |X = x) are the largest?

48 / 102
Multi-target prediction - probabilistic view

• Data are coming from distribution

P (Y , X) .

• Since we predict the value of Y for a given object x, we are

interested in the conditional distribution:
P (Y = y, X = x)
P (Y = y|X = x) =
P (X = x)

• What is the most reasonable response y?

I P (Y = y|X = x) is the largest?
I P (Yi = yi |X = x) are the largest?
I ... ?
I ... ?
I ... ?

48 / 102
Multi-target prediction - loss minimization view

• Define your problem via minimization of a loss function `(y, h(x)).

• Risk (expected loss) of the prediction h for a given x is:
X
L` (h, P | x) = EY |x [`(Y , h(x))] = P (Y = y | x)`(y, h(x))
y∈Y

• The risk minimization model h∗ (x), the so-called Bayes classifier, is

defined for a given x by

h∗ (x) = arg min L` (h, P | x)

h(x)

• Different formulations of loss functions possible:

I Set-based losses.
I Ranking-based losses.

49 / 102
Multi-target prediction - loss minimization view

• Subset 0/1 loss: `0/1 (y, h) = Jy 6= hK

m
1 X
• Hamming loss: `H (y, h) = Jyi 6= hi K
m
i=1
Pm
2 yi hi
i=1 P
• F-measure-based loss: `F (y, h) = 1 − Pm m
i=1 yi + i=1 hi

X 1

• Rank loss: `rnk (y, h) = w(y) Jhi < hj K + Jhi = hj K
yi >yj
2

• ...

50 / 102
Loss minimization view - main issues

• Relations between losses.

• The form of the Bayes classifiers for different losses.
• How to optimize?
I Assumptions behind learning algorithms.
I Statistical consistency and regret bounds.

I Generalization bounds.
I Computational complexity.

51 / 102
Relations between losses

• The loss function `(y, h) should fulfill some basic conditions:

I `(y, h) = 0 if and only if y = h.
I `(y, h) is maximal when y 6= h for every i = 1, . . . , m.
i i
I Should be monotonically non-decreasing with respect to the number of

yi 6= hi .
• In case of deterministic data (no-noise): the optimal prediction
should have the same form for all loss functions and the risk for this
prediction should be 0.
• In case of non-deterministic data (noise): the optimal prediction
and its risk can be different for different losses.

52 / 102
Relations between losses

• Hamming loss vs. subset 0/1 loss:12

I The form of risk minimizers.
I Consistency of risk minimizers.
I Risk bound analysis.

I Regret bound analysis.

12
K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. On loss minimization and
label dependence in multi-label classification. Machine Learning, 88:5–45, 2012
53 / 102
Risk minimizers

• The risk minimizer for the Hamming loss is the marginal mode:

h∗i (x) = arg max P (Yi = yi | x) , i = 1, . . . , m,

yi ∈{0,1}

while for the subset 0/1 loss is the joint mode:

h∗ (x) = arg max P (y | x) .

y∈Y

• Marginal mode vs. joint mode.

y P (y)
0 00 0 0.30
0 11 1 0.17 Marginal mode: 1111
1 01 1 0.18 Joint mode: 0000
1 10 1 0.17
1 11 0 0.18

54 / 102
Consistency of risk minimizers and risk bounds

• The risk minimizers for `H and `0/1 are equivalent,

h∗H (x) = h∗0/1 (x) ,

under specific conditions, for example, when:

I Targets Y1 , . . . , Ym are conditionally independent, i.e,
m
Y
P (Y |x) = P (Yi |x) .
i=1

I The probability of the joint mode satisfies

P (h∗0/1 (x)|x) > 0.5 .

• The following bounds hold for any P (Y | x) and h:

1
L (h, P | x) ≤ LH (h, P | x) ≤ L0/1 (h, P | x)
m 0/1
55 / 102
Regret analysis

56 / 102
Regret analysis

• The previous results may suggest that one of the loss functions can
be used as a proxy (surrogate) for the other:
I For some situations both risk minimizers coincide.
I One can provide mutual bounds for both loss functions.
• However, the regret analysis of the worst case shows that
minimization of the subset 0/1 loss may result in a large error
for the Hamming loss and vice versa.

56 / 102
Regret analysis

• The regret of a classifier with respect to ` is defined as:

Reg` (h, P ) = L` (h, P ) − L` (h∗` , P ) ,

where h∗` is the Bayes classifier for a given loss `.

• Regret measures how worse is h by comparison with the optimal
classifier for a given loss.
• To simplify the analysis we will consider the conditional regret:

Reg` (h, P | x) = L` (h, P | x) − L` (h∗` , P | x) .

• We will analyze the regret between:

I the Bayes classifier for Hamming loss h∗
H
I the Bayes classifier for subset 0/1 loss h∗
0/1
with respect to both functions.
• It is a bit an unusual analysis.
57 / 102
Regret analysis

• The following upper bound holds:

Reg0/1 (h∗H , P | x) = L0/1 (h∗H , P | x) − L0/1 (h∗0/1 , P | x) < 0.5

• Moreover, this bound is tight.

• Example:
y P (y)
0000 0.02 Marginal mode: 0000
0011 0.49 Joint mode: 0 0 1 1 or 1 1 0 0
1100 0.49

58 / 102
Regret analysis

• The following upper bound holds m > 3:

m−2
RegH (h∗0/1 , P | x) = LH (h∗0/1 , P | x) − LH (h∗H , P | x) <
m+2
• Moreover, this bound is tight.
• Example:
y P (y)
0 00 0 0.170
0 11 1 0.166
Marginal mode: 1111
1 01 1 0.166 Joint mode: 0000
1 10 1 0.166
1 11 0 0.166
1 11 1 0.166

59 / 102
Relations between losses

• Summary:
I The risk minimizers of Hamming and subset 0/1 loss are different:

marginal mode vs. joint mode.

I Under specific conditions, these two risk minimizers are equivalent.
I The risks of these loss functions are mutually upper bounded.
I Minimization of the subset 0/1 loss may cause a high regret for the

Hamming loss and vice versa.

60 / 102
Relations between losses

• Both are commonly used.

• Hamming loss:
I Not too many labels.
I Well-balanced labels.

I Application: Gene function

prediction.
• Subset 0/1 loss:
I Very restrictive.

I Small number of labels.

I Low noise problems.
I Application: Prediction of

diseases of a patient.

61 / 102
BR vs. LP

• What does the above analysis change in interpretation of the

results of the starting examples?
I BR trains for each label an independent classifier:
• Does BR assume label independence?
• Is it consistent for any loss function?
• What is its complexity?
I LP treats each label combination as a new meta-class in multi-class
classification:
• What are the assumptions behind LP?
• Is it consistent for any loss function?
• What is its complexity?

62 / 102
BR vs. LP

• Binary relevance (BR)

I BR is consistent for Hamming loss without any additional assumption

on label (in)dependence.
I If this would not be true, then we could not optimally solve binary

classification problems!!!
I For other losses, one should probably take additional assumptions:

• For subset 0/1 loss: label independence, high probability of the joint
mode (> 0.5), . . .
I Learning and inference is linear in m (however, faster algorithms exist).

63 / 102
BR vs. LP

• Label powerset (LP)

I LP is consistent for the subset 0/1 loss.
I In its basic formulation it is not consistent for Hamming loss.
I However, if used with probabilistic multi-class classifier, it estimates the

joint conditional distribution for a given x: inference for any loss

would be then possible.
I Similarly, by reducing to cost-sensitive multi-class classification LP can

be used with almost any loss function.

I LP may gain from the implicit expansion of the feature or hypothesis

space.
I Unfortunately, learning and inference is basically exponential in m

(however, this complexity is constrained by the number of training

examples).

64 / 102
Algorithmic approaches for multivariate losses

• The loss functions, like Hamming loss or subset 0/1 loss, often
referred to as task losses, are usually neither convex nor
differentiable.
• Therefore learning is a hard optimization problem.
• Two approaches try to make this task easier
I Reduction.

I Structured loss minimization.

• Two phases in solving multi-target prediction problems:

I Learning: Estimate parameters of the scoring function f (x, y).
I Inference: Use the scoring function f (x, y) to classify new instances by

finding the best y for a given x.

65 / 102
Reduction

{(x, y)}ni=1

(x, y) → (x0 , y 0 ) • Reduce the original problem

into problems of simpler
type, for which efficient
algorithmic solutions are
available.
˜ 0 , x0 , f )
LEARNING min `(y • Reduction to one or a
sequence of problems.
• Plug-in rule classifiers.
• BR and LP already
discussed.
f (x0 , y 0 )

x Inference ŷ
66 / 102
Structured loss minimization

{(x, y)}ni=1

• Replace the task loss by a

surrogate loss that is easier
to cope with.
˜ x, f )
LEARNING min `(y, • Surrogate loss is typically a
differentiable approximation
of the task loss or a convex
upper bound of it.

f (x, y)

x Inference ŷ

67 / 102
Statistical consistency

• Analysis of algorithms in terms of their infinite sample performance.13

• We say that a proxy loss `˜ is consistent with the task loss ` when the
following holds:

Reg`˜(h, P ) → 0 ⇒ Reg` (h, P ) → 0 .

• The definition concerns both structured loss minimization and

reduction algorithms
I Structured loss minimization: `˜ = surrogate loss.
I Reduction: `˜ = loss used in the reduced problem.
• We already know: Hamming loss is not a consistent proxy for subset
0/1 loss and vice versa.
13
A. Tewari and P.L. Bartlett. On the consistency of multiclass classification methods. JMLR,
8:1007–1025, 2007
D. McAllester and J. Keshet. Generalization bounds and consistency for latent structural probit
and ramp loss. In NIPS, pages 2205–2212, 2011
W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. Artificial Intelligence,
199-200:22–44, 2013
68 / 102
Algorithms

• Conditional random fields (CRFs)

• Structured support vector machines (SVMs)
• Probabilistic classifier chains (PCC)

69 / 102
Conditional random fields

• Conditional random fields (CRFs) extend logistic regression.14

• CRFs model the conditional joint distribution of Y by:

1
P (y | x) = exp(f (x, y))
Z(x)

• f (x, y) is a scoring function that models the adjustment between y

and x.
• Z(x) is a normalization constant:
X
Z(x) = exp(f (x, y))
y∈Y

14
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282–289, 2001
70 / 102
Conditional random fields

• The negative log-loss is used as a surrogate:

 
X
`log (y, x, f ) = − log P (y|x) = log  exp(f (x, y)) − f (x, y)
y∈Y

• Regularized log-likelihood optimization:

n
1X
min `log (y, x, f ) + λJ(f )
f n
i=1

• Inference for a new instance x:

h(x) = arg max P (y | x)

y∈Y

71 / 102
Conditional random fields

• Similar to LP, but with an internal structure of classes and scoring

function f (x, y).
• Convex optimization problem, but depending on the structure of
f (x, y) its solution can be hard.
• Similarly, the inference (also known as decoding problem) is hard in
the general case.
• Tailored for the subset 0/1 loss (estimation of the joint mode).
• Different forms for f (x, y).

72 / 102
Conditional random fields

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1

• In this case, we have:

73 / 102
Conditional random fields

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1

• In this case, we have:

exp( m
P
exp(f (x, y)) f (x, yi ))
P (y | x) = P =P Pmi
i=1
y∈Y exp(f (x, y)) y∈Y exp( i=1 fi (x, yi ))
Qm Qm
i=1 exp(fi (x, yi )) exp(fi (x, yi ))
= P Qm = Qm i=1 P
y∈Y i=1 exp(fi (x, yi )) i=1 yi exp(fi (x, yi ))

73 / 102
Conditional random fields

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1

• In this case, we have:

73 / 102
Conditional random fields

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1

• In this case, we have:

• Optimal for Hamming loss!!!

• The structure of f (x, y) is connected to the loss function.
73 / 102
Conditional random fields

• A different form of f (x, y):

m
X X
f (x, y) = fi (x, yi ) + fk,l (yk , yl )
i=1 yk ,yl

• Models pairwise interactions, . . .

74 / 102
Conditional random fields

• A different form of f (x, y):

m
X X
f (x, y) = fi (x, yi ) + fk,l (yk , yl )
i=1 yk ,yl

• Models pairwise interactions, . . . but in the conditional sense:

74 / 102
Conditional random fields

• A different form of f (x, y):

m
X X
f (x, y) = fi (x, yi ) + fk,l (yk , yl )
i=1 yk ,yl

• Models pairwise interactions, . . . but in the conditional sense:

I Assume that x is not given:

P P
exp( i fi (yi ) + yk ,yl fk,l (yk , yl ))
P (y) = P P P
y∈Y exp( i fi (yi ) + yk ,yl fk,l (yk , yl ))

I Models a prior joint distribution over labels!!!

I The prior cannot be easily factorized to marginal probabilities.
• Should work better for subset 0/1 loss than for Hamming loss.

74 / 102
Structured loss minimization

• CRFs do not directly take the task loss into account.

• We would like to have a method that could be used with any loss . . .

15
Y. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for struc-
tured and interdependent output variables. JMLR, 6:1453–1484, 2005
75 / 102
Structured loss minimization

• CRFs do not directly take the task loss into account.

• We would like to have a method that could be used with any loss . . .
• Structured support vector machines (SSVMs) extends the idea of
large-margin classifiers to structured output prediction problems.15

15
Y. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for struc-
tured and interdependent output variables. JMLR, 6:1453–1484, 2005
75 / 102
Structured support vector machines

• SSVMs use, similarly to CRFs, a scoring function f (x, y).

• They minimize the structured hinge loss:

`˜h (y, x, f ) = max

0
{`(y, y 0 ) + f (x, y 0 )} − f (x, y) .
y ∈Y

• Task loss `(y, y 0 ) is used for margin rescaling.

• Regularized optimization problem:
n
1 X˜
min `h (y, x, f ) + λJ(f )
f n
i=1

• Predict according to:

h(x) = arg max f (x, y) .

y∈Y

76 / 102
Structured support vector machines

• Convex optimization problem with linear constraints.

• An exponential number of constraints −→ Cutting-plane algorithms.
• The arg max problem is hard for general structures.
• Different forms for f (x, y).

77 / 102
Structured support vector machines

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1
• Let us use it with the Hamming loss:

16
B. Hariharan, L. Zelnik-Manor, S.V.N. Vishwanathan, and M. Varma. Large scale max-margin
multi-label classification with priors. In ICML. Omnipress, 2010
78 / 102
Structured support vector machines

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1
• Let us use it with the Hamming loss:

`˜h (y, x, f ) = max {`H (y, y 0 ) + f (x, y 0 )} − f (x, y)

y 0 ∈Y
( )
X X X
0 0
= max
0
Jy i 6
= yi K + fi (x, yi ) − fi (x, yi )
y ∈Y
i i i

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1
• Let us use it with the Hamming loss:

`˜h (y, x, f ) = max {`H (y, y 0 ) + f (x, y 0 )} − f (x, y)

y 0 ∈Y
( )
X X X
0 0
= max
0
Jy i 6
= yi K + fi (x, yi ) − fi (x, yi )
y ∈Y
i i i
X
0 0

= max
0
Jy i 6
= y i K + fi (x, yi ) − fi (x, yi )
yi
i

• Let f (x, y) be defined as:

m
X
f (x, y) = fi (x, yi )
i=1
• Let us use it with the Hamming loss:

`˜h (y, x, f ) = max {`H (y, y 0 ) + f (x, y 0 )} − f (x, y)

y 0 ∈Y
( )
X X X
0 0
= max
0
Jy i 6
= yi K + fi (x, yi ) − fi (x, yi )
y ∈Y
i i i
X
0 0

= max
0
Jy i 6
= y i K + fi (x, yi ) − fi (x, yi )
yi
i

• Structured hinge loss decomposes to hinge loss for each label.16

• Consistent for the Hamming loss.
16
B. Hariharan, L. Zelnik-Manor, S.V.N. Vishwanathan, and M. Varma. Large scale max-margin
multi-label classification with priors. In ICML. Omnipress, 2010
78 / 102
Structured support vector machines

• The form f (x, y) that models pairwise interactions:

m
X X
f (x, y) = fi (x, yi ) + fk,l (yk , yl )
i=1 yk ,yl

• How important is the pairwise interaction part for different task

losses?
• For a general form of f (x, y), SSVMs are inconsistent for Hamming
loss.17
• There are more results of this type.18

17
W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. Artificial Intelligence,
199-200:22–44, 2013
18
A. Tewari and P.L. Bartlett. On the consistency of multiclass classification methods. JMLR,
8:1007–1025, 2007
D. McAllester. Generalization Bounds and Consistency for Structured Labeling in Predicting
Structured Data. MIT Press, 2007
79 / 102
Structured support vector machines

Table : SSVMs with pairwise term19 vs. BR with LR20 .

Dataset SSVM Best BR LR

Scene 0.101±.003 0.102±.003
Yeast 0.202±.005 0.199±.005
Synth1 0.069±.001 0.067±.002
Synth2 0.058±.001 0.084±.001

• There is almost no difference between both algorithms.

19
Thomas Finley and Thorsten Joachims. Training structural SVMs when exact inference is
intractable. In ICML. Omnipress, 2008
20
K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. An analysis of chaining in
multi-label classification. In ECAI, 2012
80 / 102
SSVMs vs. CRFs

• SSVMs and CRFs are quite similar to each other:

 
X
`˜log (y, x, f ) = log  exp(f (x, y)) − f (x, y)
y∈Y

`˜h (y, x, f ) = max

0
{`(y, y 0 ) + f (x, y 0 )} − f (x, y)
y ∈Y

• The main differences are:

I max vs. soft-max
I margin vs. no-margin

• Many works on incorporating margin in CRFs.21

21
P. Pletscher, C.S. Ong, and J.M. Buhmann. Entropy and margin maximization for structured
output learning. In ECML/PKDD. Springer, 2010
Q. Shi, M. Reid, and T. Caetano. Hybrid model of conditional random field and support vector
machine. In Workshop at NIPS, 2009
K. Gimpel and N. Smith. Softmax-margin crfs: Training log-linear models with cost functions.
In HLT, page 733736, 2010
81 / 102
Probabilistic classifier chains

• Probabilistic classifier chains (PCCs)22 similarly to CRFs estimate the

joint conditional distribution P (Y | x).
• Their idea is to repeatedly apply the product rule of probability:
m
Y
P (Y = y | x) = P (Yi = yi | x, y1 , . . . , yi−1 ) .
i=1

• They follow a reduction to a sequence of subproblems:

(x, y) −→ (x0 = (x, y1 , . . . , yi−1 ), y = yi ), i = 1, . . . , m

• Their additional advantage is that one can easily sample from the
estimated distribution.
22
J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification.
Machine Learning Journal, 85:333–359, 2011
K. Dembczyński, W. Cheng, and E. Hüllermeier. Bayes optimal multilabel classification via
probabilistic classifier chains. In ICML, pages 279–286. Omnipress, 2010
82 / 102
Probabilistic classifier chains

• Learning of PCCs relies on constructing probabilistic classifiers for

estimating
P (Yi = yi |x, y1 , . . . , yi−1 ) ,
independently for each i = 1, . . . , m.
• One can use scoring functions fi (x0 , yi ) and use logistic
transformation.
• By using the linear models, the overall scoring function takes the form:
m
X X
f (x, y) = fi (x, yi ) + fk,l (yk , yl )
i=1 yk ,yl

83 / 102
Probabilistic classifier chains

• Inference relies on exploiting a probability tree being the result of

PCC:
x
y1 = 0 y1 = 1

P (y1 = 0 | x) = 0.4 P (y1 = 1 | x) = 0.6

y2 = 0 y2 = 1 y2 = 0 y2 = 1

P (y2=0 | y1=0, x)=0.0 P (y2=1 | y1=0, x)=1.0 P (y2=0 | y1=1, x)=0.4 P (y2=1 | y1=1, x)=0.6
P (y=(0, 0) | x)=0 P (y=(0, 1) | x)=0.4 P (y=(1, 0) | x)=0.24 P (y=(1, 1) | x)=0.36

• For subset 0/1 loss one needs to find h(x) = arg maxy∈Y P (y | x).
• Greedy and approximate search techniques with guarantees exist.23
23
K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. An analysis of chaining in
multi-label classification. In ECAI, 2012
A. Kumar, S. Vembu, A.K. Menon, and C. Elkan. Beam search algorithms for multilabel
learning. In Machine Learning, 2013
84 / 102
Probabilistic classifier chains

• Inference relies on exploiting a probability tree being the result of

PCC:
x
y1 = 0 y1 = 1

P (y1 = 0 | x) = 0.4 P (y1 = 1 | x) = 0.6

y2 = 0 y2 = 1 y2 = 0 y2 = 1

P (y2=0 | y1=0, x)=0.0 P (y2=1 | y1=0, x)=1.0 P (y2=0 | y1=1, x)=0.4 P (y2=1 | y1=1, x)=0.6
P (y=(0, 0) | x)=0 P (y=(0, 1) | x)=0.4 P (y=(1, 0) | x)=0.24 P (y=(1, 1) | x)=0.36

• Other losses: compute the prediction on a sample from P (Y | x).23

• Sampling can be easily performed by using the probability tree.
23
K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. An analysis of chaining in
multi-label classification. In ECAI, 2012

84 / 102
Probabilistic classifier chains

Table : PCC vs. SSVMs on Hamming loss and PCC vs. BR on subset 0/1 loss.

Dataset PCC SSVM Best PCC BR

Hamming loss subset 0/1 loss
Scene 0.104±.004 0.101±.003 0.385±.014 0.509±.014
Yeast 0.203±.005 0.202±.005 0.761±.014 0.842±.012
Synth1 0.067±.001 0.069±.001 0.239±.006 0.240±.006
Synth2 0.000±.000 0.058±.001 0.000±.000 0.832±.004
Reuters 0.060±.002 0.045±.001 0.598±.009 0.689±.008
Mediamill 0.172±.001 0.182±.001 0.885±.003 0.902±.003

85 / 102
Multilabel ranking

Multi-label classification

politics 0
economy 0
business 0
sport 1
tennis 1
soccer 0
show-business 0
celebrities 1
..
.
England 1
USA 1
Poland 1
Lithuania 0

86 / 102
Multilabel ranking

Multilabel ranking

tennis

≺
sport

≺
England

≺
Poland

≺
USA

≺
..
.

≺
politics

86 / 102
Multilabel ranking

• Ranking loss:

X 1

`rnk (y, h) = w(y) Jhi (x) < hj (x)K + Jhi (x) = hj (x)K ,
2
(i,j) : yi >yj

where w(y) < wmax is a weight function.

X1 X2 Y1 Y2 ... Ym
x 4.0 2.5 1 0 0
h2 > h1 > ... > hm

87 / 102
Multilabel ranking

• Ranking loss:

X 1

`rnk (y, h) = w(y) Jhi (x) < hj (x)K + Jhi (x) = hj (x)K ,
2
(i,j) : yi >yj

where w(y) < wmax is a weight function.

The weight function w(y) is usually used to normalize the range

of rank loss to [0, 1]:
1
w(y) = ,
n+ n−

i.e., it is equal to the inverse of the total number of pairwise

comparisons between labels.

87 / 102
Pairwise surrogate losses

• The most intuitive approach is to use pairwise convex surrogate

losses of the form
X
`˜φ (y, h) = w(y)φ(hi − hj ) ,
(i,j) : yi >yj

where φ is
I an exponential function (BoosTexter)24 : φ(f ) = e−f ,
I logistic function (LLLR)25 : φ(f ) = log(1 + e−f ) ,
I or hinge function (RankSVM)26 : φ(f ) = max(0, 1 − f ) .

24
R. E. Schapire and Y. Singer. BoosTexter: A Boosting-based System for Text Categorization.
Machine Learning, 39(2/3):135–168, 2000
25
O. Dekel, Ch. Manning, and Y. Singer. Log-linear models for label ranking. In NIPS. MIT
Press, 2004
26
A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In NIPS, pages
681–687, 2001
88 / 102
Multilabel ranking

• This approach is, however, inconsistent for the most commonly used
convex surrogates.27
• The consistent classifier can be, however, obtained by using
univariate loss functions28 . . .

27
J. Duchi, L. Mackey, and M. Jordan. On the consistency of ranking algorithms. In ICML, pages
327–334, 2010
W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. Artificial Intelligence,
199-200:22–44, 2013
28
K. Dembczynski, W. Kotlowski, and E. Hüllermeier. Consistent multilabel ranking through
univariate losses. In ICML, 2012
89 / 102
Reduction to weighted binary relevance

• The Bayes ranker can be obtained by sorting labels according to:

X
∆1i = w(y)P (y | x) .
y : yi =1

• For w(y) ≡ 1, ∆ui reduces to marginal probabilities P (Yi = u | x).

• The solution can be obtained with BR or its weighted variant in a
general case.

90 / 102
Reduction to weighted binary relevance

• Consider the sum of univariate (weighted) losses:

m
X
`˜exp (y, h) = w(y) e−(2yi −1)hi ,
i=1
m
X
`˜log (y, h) = w(y) log 1 + e−(2yi −1)hi .
i=1

• The risk minimizer of these losses is:

1 ∆1 1 ∆1i
h∗i (x) = log i0 = log ,
c ∆i c W − ∆1i

which is a strictly increasing transformation of ∆1i , where

X
W = E[w(Y ) | x] = w(y)P (y | x) .
y

91 / 102
Reduction to weighted binary relevance

• Vertical reduction: Solving m independent classification problems.

• Standard algorithms, like AdaBoost and logistic regression, can be
adapted to this setting.
• AdaBoost.MH follows this approach for w = 1.29
• Besides its simplicity and efficiency, this approach is consistent
(regret bounds have also been derived).30

29
R. E. Schapire and Y. Singer. BoosTexter: A Boosting-based System for Text Categorization.
Machine Learning, 39(2/3):135–168, 2000
30
K. Dembczynski, W. Kotlowski, and E. Hüllermeier. Consistent multilabel ranking through
univariate losses. In ICML, 2012
92 / 102
Weighted binary relevance

● ●

WBR−LR ●
WBR−LR
●
0.174

LLLR LLLR

0.186
Bayes risk Bayes risk
rank loss

rank loss
●

●
0.172

●
●

0.184
●
●
●
●

●
●
●
0.170

● ●
●
● ● ●

0.182
● ●
● ●
●

250 500 1000 2000 4000 8000 16000 250 500 1000 2000 4000 8000 16000
# of learning examples # of learning examples

Figure : WBR LR vs. LLLR. Left: independent data. Right: dependent data.

• Label independence: the methods perform more or less en par.

• Label dependence: WBR shows small but consistent improvements.

93 / 102
Benchmark data

Table : WBR-AdaBoost vs. AdaBoost.MR (left) and WBR-LR vs LLLR (right).

dataset AB.MR WBR-AB LLLR WBR-LR

image 0.2081 0.2041 0.2047 0.2065
emotions 0.1703 0.1699 0.1743 0.1657
scene 0.0720 0.0792 0.0861 0.0793
yeast 0.2072 0.1820 0.1728 0.1736
mediamill 0.0665 0.0609 0.0614 0.0472

• WBR is at least competitive to state-of-the-art algorithms defined on

pairwise surrogates.

94 / 102
Maximization of the F-measure

• Applications: Information retrieval, document tagging, and NLP.

• JRS 2012 Data Mining

Competition: Indexing
documents from
MEDLINE or PubMed
Central databases with
concepts from the
Medical Subject
Headings ontology.

95 / 102
Maximization of the F-measure

• The Fβ -measure-based loss function (Fβ -loss):

`Fβ (y, h(x)) = 1 − Fβ (y, h(x))

(1 + β 2 ) m
P
y h (x)
= 1 − 2 Pm Pmi i
i=1
∈ [0, 1] .
β i=1 yi + i=1 hi (x)

• Provides a better balance between relevant and irrelevant labels.

• However, it is not easy to optimize.

96 / 102
SSVMs for Fβ -based loss

• SSVMs can be used to minimize Fβ -based loss

• Rescale the margin by `F (y, y 0 ).
• Two algorithms:31
RML SML
No label interactions: Submodular interactions:
m
X m
X X
f (y, x) = fi (yi , x) f (y, x) = fi (yi , x)+ fk,l (yk , yl )
i=1 i=1 yk ,yl
Quadratic learning and linear More complex (graph-cut and ap-
prediction proximate algorithms)
• Both are inconsistent.

31
J. Petterson and T. S. Caetano. Reverse multi-label learning. In NIPS, pages 1912–1920, 2010
J. Petterson and T. S. Caetano. Submodular multi-label learning. In NIPS, pages 1512–1520,
2011
97 / 102
Plug-in rule approach

• Plug estimates of required parameters into the Bayes classifier.

h∗ = arg min E `Fβ (Y , h)

h∈Y
(β + 1) m
P
yh
Pmi i
X
i=1
= arg max P (y) 2 Pm
h∈Y β i=1 yi + i=1 hi
y∈Y

• No closed form solution for this optimization problem.

• The problem cannot be solved naively by brute-force search:
I This would require to check all possible combinations of labels (2m )
I To sum over 2m number of elements for computing the expected value.
I The number of parameters to be estimated (P (y)) is 2m .

98 / 102
Plug-in rule approach

• Approximation needed?

32
N. Ye, K. Chai, W. Lee, and H. Chieu. Optimizing F-measures: a tale of two approaches. In
ICML, 2012
33
K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. An exact algorithm for F-
measure maximization. In NIPS, volume 25, 2011
34
K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Hüllermeier. Optimizing
the F-measure in multi-label classification: Plug-in rule approach versus structured loss mini-
mization. In ICML, 2013
99 / 102
Plug-in rule approach

• Approximation needed? Not really. The exact solution is tractable!

LFP: EFP:
Assumes label independence. No assumptions.
Linear number of parameters: Quadratic number
P of parameters:
P (yi = 1). P (yi = 1, s = i yi ).
Inference based on dynamic pro- Inference based on matrix multipli-
gramming.32 cation and top k selection.33
Reduction to LR for each label. Reduction to multinomial LR for
each label.

• EFP is consistent.34
32
N. Ye, K. Chai, W. Lee, and H. Chieu. Optimizing F-measures: a tale of two approaches. In
ICML, 2012
33
K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. An exact algorithm for F-
measure maximization. In NIPS, volume 25, 2011
34
K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. Hüllermeier. Optimizing
the F-measure in multi-label classification: Plug-in rule approach versus structured loss mini-
mization. In ICML, 2013
99 / 102
Maximization of the F-measure

IMAGE SCENE YEAST

65 80 74,44 74,38 73,92 70 65,47 65,02
59,77 64,78 63,96
58,86 57,49 65
60 56,99 68,5 60,59
F1-‐measure [%]

F1-‐measure [%]

F1-‐measure [%]
70
55 60
60 55,73 55
50
43,63 50
45 50 45
40 40
40
35 35
30 30 30
EFP LFP RML SML BR EFP LFP RML SML BR EFP LFP RML SML BR

MEDICAL ENRON MEDIAMILL

90 65 61,04 60
80,39 81,27 80,63 55,16 55,15
80 60 56,86 57,69 55
F1-‐measure [%]

F1-‐measure [%]

F1-‐measure [%]
54,61 55,49 51,21
67,9 70,19 55 49,35 50,02
70 50
50
60 45
45
50 40
40
40 35 35
30 30 30
EFP LFP RML SML BR EFP LFP RML SML BR EFP LFP RML SML BR

100 / 102
Challenges

• We did not discuss:

I Label ranking problems.
I Hierarchical multi-label classification.
I Structured output prediction problems.
I ...

• Main challenges:
I Learning and inference algorithms for any task losses and output

structures.
I Consistency of the algorithms.
I Large-scale datasets: number of instances, features, and labels.

101 / 102
Conclusions

• Take-away message:
I Two main challenges: loss minimization and target dependence.
I Two views: the individual target and the joint target view.
I The individual target view: joint target regularization
I The joint target view: structured loss minimization and reduction.

I Proper modeling of target dependence for different loss functions.

I Be careful with empirical evaluations.
I Independent models can perform quite well.

Many thanks to Eyke and Willem for collaboration on this tutorial and Arek for a
help in preparing the slides.

This project is partially supported by the Foundation of Polish Science under the
Homing Plus programme, co-financed by the European Regional Development Fund.

102 / 102

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Hansen, Discrete Inverse Problems Full Book
100% (1)
Hansen, Discrete Inverse Problems Full Book
217 pages
topic6_intro_prediction_Oct212023
No ratings yet
topic6_intro_prediction_Oct212023
15 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
5 Multiclass
No ratings yet
5 Multiclass
46 pages
Towards Better Evaluation Multitarget Models Camera Ready
No ratings yet
Towards Better Evaluation Multitarget Models Camera Ready
10 pages
Ai - W2L4
No ratings yet
Ai - W2L4
18 pages
ML-21AI63
No ratings yet
ML-21AI63
26 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
l05_machine_learning
No ratings yet
l05_machine_learning
34 pages
CH 1
No ratings yet
CH 1
24 pages
Lecture - 4 - Logistic Regression
No ratings yet
Lecture - 4 - Logistic Regression
62 pages
Machine Learning Concepts
No ratings yet
Machine Learning Concepts
68 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Lec 05
No ratings yet
Lec 05
54 pages
Multilabel Things
No ratings yet
Multilabel Things
42 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
AI Lec 4
No ratings yet
AI Lec 4
35 pages
Structured Prediction
No ratings yet
Structured Prediction
17 pages
1 Intro
No ratings yet
1 Intro
5 pages
ML Opt
No ratings yet
ML Opt
89 pages
Predictive Analysis 1
No ratings yet
Predictive Analysis 1
22 pages
SML_Lecture1
No ratings yet
SML_Lecture1
37 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
ML_Introduction
No ratings yet
ML_Introduction
76 pages
A Review On Multi-Label Learning Algorithms: IEEE Transactions On Knowledge and Data Engineering August 2014
No ratings yet
A Review On Multi-Label Learning Algorithms: IEEE Transactions On Knowledge and Data Engineering August 2014
43 pages
Week 01
No ratings yet
Week 01
37 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
AReviewonMulti-Label Learning Algorithms
No ratings yet
AReviewonMulti-Label Learning Algorithms
43 pages
ML-UNIT-I
No ratings yet
ML-UNIT-I
14 pages
Fun Least Squares
No ratings yet
Fun Least Squares
3 pages
Unit2- Types of Classifications
No ratings yet
Unit2- Types of Classifications
14 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
ML
No ratings yet
ML
9 pages
ML Notes
No ratings yet
ML Notes
79 pages
2018 02 Msu Data Science
No ratings yet
2018 02 Msu Data Science
65 pages
Unit-1 - Machine Learning
No ratings yet
Unit-1 - Machine Learning
85 pages
Predictive Analytics (2)
No ratings yet
Predictive Analytics (2)
46 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages
Machine Learning - Lecture 5
No ratings yet
Machine Learning - Lecture 5
19 pages
Unit-Ii Chapter-3 Beyond Binary Classification Handling More Than Two Classes
No ratings yet
Unit-Ii Chapter-3 Beyond Binary Classification Handling More Than Two Classes
16 pages
Multi-Label Classification Using Labels As Hidden Nodes
No ratings yet
Multi-Label Classification Using Labels As Hidden Nodes
34 pages
Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
No ratings yet
Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
29 pages
Stat Learning Notes IV2
No ratings yet
Stat Learning Notes IV2
333 pages
Predictive Modeling Machine Learning
No ratings yet
Predictive Modeling Machine Learning
16 pages
Model Comparison and Calibration Assessment
No ratings yet
Model Comparison and Calibration Assessment
70 pages
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Python
No ratings yet
Python
14 pages
Solving Math Problems
From Everand
Solving Math Problems
George N. Frempong
No ratings yet
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
From Everand
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
CSPacademic
No ratings yet
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Review of Microwave Imaging Algorithms For Stroke Detection: Jinzhen Liu Liming Chen Hui Xiong Yuqing Han
No ratings yet
Review of Microwave Imaging Algorithms For Stroke Detection: Jinzhen Liu Liming Chen Hui Xiong Yuqing Han
14 pages
CS7643: Deep Learning Assignment 3: Instructor: Zsolt Kira Deadline: 11:59pm Mar 14, 2021, EST
No ratings yet
CS7643: Deep Learning Assignment 3: Instructor: Zsolt Kira Deadline: 11:59pm Mar 14, 2021, EST
12 pages
Journal of Computational Physics: M. Raissi, P. Perdikaris, G.E. Karniadakis
No ratings yet
Journal of Computational Physics: M. Raissi, P. Perdikaris, G.E. Karniadakis
22 pages
Data Science Question Bank With Answer
No ratings yet
Data Science Question Bank With Answer
39 pages
Machine_learning_and_the_physical_sciences0-7
No ratings yet
Machine_learning_and_the_physical_sciences0-7
8 pages
QBANK_ML
No ratings yet
QBANK_ML
6 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
PDF Matrix Analysis and Applications 1st Edition Xian-Da Zhang download
No ratings yet
PDF Matrix Analysis and Applications 1st Edition Xian-Da Zhang download
65 pages
Applied Ai Schedule
No ratings yet
Applied Ai Schedule
19 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
ML Terminologies PDF
100% (1)
ML Terminologies PDF
44 pages
Heat Conduction - Basic Research
100% (1)
Heat Conduction - Basic Research
362 pages
Unit 1
No ratings yet
Unit 1
66 pages
A Survey of Deep Learning Approaches To Image Restoration
No ratings yet
A Survey of Deep Learning Approaches To Image Restoration
20 pages
Cheat Sheet For Exam
No ratings yet
Cheat Sheet For Exam
2 pages
Regularization
No ratings yet
Regularization
38 pages
Deep Learning For Time Series Forecasting: The Electric Load Case
No ratings yet
Deep Learning For Time Series Forecasting: The Electric Load Case
19 pages
FePEST User Guide PDF
No ratings yet
FePEST User Guide PDF
130 pages
Time Series Analysis of Cross-Listed Stocks
No ratings yet
Time Series Analysis of Cross-Listed Stocks
9 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Sparsity-Constrained Kernel Generalized Maximum Versoria Criterion With Variable Center
No ratings yet
Sparsity-Constrained Kernel Generalized Maximum Versoria Criterion With Variable Center
5 pages
Rules of Machine Learning
100% (1)
Rules of Machine Learning
24 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Interpretation of Very Low Frequency Electromagnetic Measurements in
No ratings yet
Interpretation of Very Low Frequency Electromagnetic Measurements in
10 pages
Unit 3 MCQ
No ratings yet
Unit 3 MCQ
20 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Neural Networks Bias
No ratings yet
Neural Networks Bias
7 pages
Gradient Methods For Nonsmooth Problems
No ratings yet
Gradient Methods For Nonsmooth Problems
26 pages

Multi Target Prediction

Uploaded by

Multi Target Prediction

Uploaded by

Multi-Target Prediction

Discovery Science 2013, Tutorial, Singapore

Many thanks to Willem Waegeman and Eyke Hüllermeier for collaborating

• Prediction problems in which we consider more than one target

Target 1: cloud yes/no

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Prediction of the presence

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = Rm .

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = Rm .

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, where y i is a

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, where y i is a

• Training data: {(x1j , y1j ), (x2j , y2j ), . . . , (xnj , ynj )}, j = 1, . . . , m,

• Training data: {(x1j , y1j ), (x2j , y2j ), . . . , (xnj , ynj )}, j = 1, . . . , m,

• Training data: {(ui , mj , yij )}, for some i = 1, . . . , n and

• Multi-Target Prediction: For a feature vector x predict accurately

• How can we improve the predictive accuracy of a single label by using

• What are the specific multivariate loss functions we would like to

• The individual target view:

• The joint target view:

I The problem is not easily decomposable over the targets.

Reduce model complexity by model sharing.

independent Fit one model for every target (independently).

Introduce additional parameters or models for targets or tar-

the joint tar-

• Marginal and conditional dependence:

marginal (in)dependence 6 conditional (in)dependence

fi (x) = gi (x) + i , for i = 1, . . . , m

Similarities in the structural parts gi (x) of the models.

• Structure imposed (domain knowledge) on targets

• Interdependence vs. hypothesis and feature space:

I Comparison between independent and multi-target models is difficult in

general, as they differ in many respects (e.g., complexity)!

• Decomposable and non-decomposable losses over examples

• Loss functions and optimal predictions

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y .

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y .

• We are interested in minimization of the loss for a given target yi :

• The expected loss, or risk, of model h is given by:

• The optimal prediction minimizing the risk could be obtained

• Single output prediction: Learn a mapping h : X → Y, Y = R:

• When h is linear: h(x) = aT x

• Multivariate least-squares risk:

• Learning algorithm minimizes empirical least squares risk:

• The solution for multivariate least squares is the same as for

h1 (x) = Jx1 + x2 K h2 (x) = Jαx1 + x2 K

• Data uniformly distributed in [−1, 1],

Data for Target 2 Data for Target 2 plus Target 1

• A kind of “instance transfer,”

• Expected generalization performance as a function of sample size

α = 1.4 α = 1.5 α=2

• Consider a multivariate normal distribution y ∼ N (θ, σ 2 I).

• What is the best estimator of the mean vector θ?

• James-stein estimator outperforms the maximum likelihood estimator

• Only outperforms the maximum likelihood estimator w.r.t. the sum of

• Minimization of the empirical univariate regularized least squares risk:

• Minimization of the empirical multivariate regularized least squares

• Many machine learning techniques for multivariate regression and

should have similar values.

• Methods that exploit the similarities between the structural parts of

h−1 (y, x) = f (x) , (2)

• Many multivariate regression methods, like C&W,9 reduced-rank

• Alternatively, y can be first transformed to the canonical co-ordinate

• These estimates are further shrunk by the factor gii obtaining

• Similar methods exist for multi-label classification.

• Loss functions and probabilistic view

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Training data: {(x1 , y 1 ), (x2 , y 2 ), . . . , (xn , y n )}, y i ∈ Y = {0, 1}m .

• Binary relevance: Decomposes the problem to m binary

• Two independent models:

• Two dependent models:

• Subset 0/1 loss: `0/1 (y, h) = Jy 6= hK .

−1.0 −0.5 0.0 0.5 1.0

classifier Hamming subset 0/1

classifier Hamming subset 0/1

• BR LR uses two linear classifiers:

marginal (in)dependence 6 conditional (in)dependence

fi (x) = gi (x) + i , for i = 1, . . . , m