0% found this document useful (0 votes)

2 views

2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems

This paper presents a novel approach to multi-task learning (MTL) by framing it as a multi-objective optimization problem, aiming to find Pareto optimal solutions. The authors propose an efficient optimization method that overcomes the limitations of existing algorithms, particularly in high-dimensional settings, by using an upper bound for the multi-objective loss. Empirical evaluations demonstrate that their method outperforms traditional MTL formulations across various tasks, including digit classification and scene understanding.

Uploaded by

yangkunkuo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems

Uploaded by

yangkunkuo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Multi-Task Learning as Multi-Objective Optimization

Ozan Sener Vladlen Koltun

Intel Labs Intel Labs

Abstract
In multi-task learning, multiple tasks are solved jointly, sharing inductive bias
between them. Multi-task learning is inherently a multi-objective problem because
different tasks may conflict, necessitating a trade-off. A common compromise is to
optimize a proxy objective that minimizes a weighted linear combination of per-
task losses. However, this workaround is only valid when the tasks do not compete,
which is rarely the case. In this paper, we explicitly cast multi-task learning as
multi-objective optimization, with the overall objective of finding a Pareto optimal
solution. To this end, we use algorithms developed in the gradient-based multi-
objective optimization literature. These algorithms are not directly applicable to
large-scale learning problems since they scale poorly with the dimensionality of
the gradients and the number of tasks. We therefore propose an upper bound
for the multi-objective loss and show that it can be optimized efficiently. We
further prove that optimizing this upper bound yields a Pareto optimal solution
under realistic assumptions. We apply our method to a variety of multi-task
deep learning problems including digit classification, scene understanding (joint
semantic segmentation, instance segmentation, and depth estimation), and multi-
label classification. Our method produces higher-performing models than recent
multi-task learning formulations or per-task training.

1 Introduction
One of the most surprising results in statistics is Stein’s paradox. Stein (1956) showed that it is better
to estimate the means of three or more Gaussian random variables using samples from all of them
rather than estimating them separately, even when the Gaussians are independent. Stein’s paradox
was an early motivation for multi-task learning (MTL) (Caruana, 1997), a learning paradigm in which
data from multiple tasks is used with the hope to obtain superior performance over learning each task
independently. Potential advantages of MTL go beyond the direct implications of Stein’s paradox,
since even seemingly unrelated real world tasks have strong dependencies due to the shared processes
that give rise to the data. For example, although autonomous driving and object manipulation are
seemingly unrelated, the underlying data is governed by the same laws of optics, material properties,
and dynamics. This motivates the use of multiple tasks as an inductive bias in learning systems.
A typical MTL system is given a collection of input points and sets of targets for various tasks per
point. A common way to set up the inductive bias across tasks is to design a parametrized hypothesis
class that shares some parameters across tasks. Typically, these parameters are learned by solving an
optimization problem that minimizes a weighted sum of the empirical risk for each task. However,
the linear-combination formulation is only sensible when there is a parameter set that is effective
across all tasks. In other words, minimization of a weighted sum of empirical risk is only valid if
tasks are not competing, which is rarely the case. MTL with conflicting objectives requires modeling
of the trade-off between tasks, which is beyond what a linear combination achieves.
An alternative objective for MTL is finding solutions that are not dominated by any others. Such
solutions are said to be Pareto optimal. In this paper, we cast the objective of MTL in terms of finding
Pareto optimal solutions.

32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
The problem of finding Pareto optimal solutions given multiple criteria is called multi-objective
optimization. A variety of algorithms for multi-objective optimization exist. One such approach
is the multiple-gradient descent algorithm (MGDA), which uses gradient-based optimization and
provably converges to a point on the Pareto set (Désidéri, 2012). MGDA is well-suited for multi-task
learning with deep networks. It can use the gradients of each task and solve an optimization problem
to decide on an update over the shared parameters. However, there are two technical problems that
hinder the applicability of MGDA on a large scale. (i) The underlying optimization problem does
not scale gracefully to high-dimensional gradients, which arise naturally in deep networks. (ii) The
algorithm requires explicit computation of gradients per task, which results in linear scaling of the
number of backward passes and roughly multiplies the training time by the number of tasks.
In this paper, we develop a Frank-Wolfe-based optimizer that scales to high-dimensional problems.
Furthermore, we provide an upper bound for the MGDA optimization objective and show that it can
be computed via a single backward pass without explicit task-specific gradients, thus making the
computational overhead of the method negligible. We prove that using our upper bound yields a Pareto
optimal solution under realistic assumptions. The result is an exact algorithm for multi-objective
optimization of deep networks with negligible computational overhead.
We empirically evaluate the presented method on three different problems. First, we perform an
extensive evaluation on multi-digit classification with MultiMNIST (Sabour et al., 2017). Second, we
cast multi-label classification as MTL and conduct experiments with the CelebA dataset (Liu et al.,
2015b). Lastly, we apply the presented method to scene understanding; specifically, we perform
joint semantic segmentation, instance segmentation, and depth estimation on the Cityscapes dataset
(Cordts et al., 2016). The number of tasks in our evaluation varies from 2 to 40. Our method clearly
outperforms all baselines.

2 Related Work

Multi-task learning. We summarize the work most closely related to ours and refer the interested
reader to reviews by Ruder (2017) and Zhou et al. (2011b) for additional background. Multi-task
learning (MTL) is typically conducted via hard or soft parameter sharing. In hard parameter sharing,
a subset of the parameters is shared between tasks while other parameters are task-specific. In soft
parameter sharing, all parameters are task-specific but they are jointly constrained via Bayesian
priors (Xue et al., 2007; Bakker and Heskes, 2003) or a joint dictionary (Argyriou et al., 2007; Long
and Wang, 2015; Yang and Hospedales, 2016; Ruder, 2017). We focus on hard parameter sharing
with gradient-based optimization, following the success of deep MTL in computer vision (Bilen and
Vedaldi, 2016; Misra et al., 2016; Rudd et al., 2016; Yang and Hospedales, 2016; Kokkinos, 2017;
Zamir et al., 2018), natural language processing (Collobert and Weston, 2008; Dong et al., 2015; Liu
et al., 2015a; Luong et al., 2015; Hashimoto et al., 2017), speech processing (Huang et al., 2013;
Seltzer and Droppo, 2013; Huang et al., 2015), and even seemingly unrelated domains over multiple
modalities (Kaiser et al., 2017).
Baxter (2000) theoretically analyze the MTL problem as interaction between individual learners and
a meta-algorithm. Each learner is responsible for one task and a meta-algorithm decides how the
shared parameters are updated. All aforementioned MTL algorithms use weighted summation as the
meta-algorithm. Meta-algorithms that go beyond weighted summation have also been explored. Li
et al. (2014) consider the case where each individual learner is based on kernel learning and utilize
multi-objective optimization. Zhang and Yeung (2010) consider the case where each learner is a
linear model and use a task affinity matrix. Zhou et al. (2011a) and Bagherjeiran et al. (2005) use
the assumption that tasks share a dictionary and develop an expectation-maximization-like meta-
algorithm. de Miranda et al. (2012) and Zhou et al. (2017b) use swarm optimization. None of these
methods apply to gradient-based learning of high-capacity models such as modern deep networks.
Kendall et al. (2018) and Chen et al. (2018) propose heuristics based on uncertainty and gradient
magnitudes, respectively, and apply their methods to convolutional neural networks. Another recent
work uses multi-agent reinforcement learning (Rosenbaum et al., 2017).
Multi-objective optimization. Multi-objective optimization addresses the problem of optimizing a
set of possibly contrasting objectives. We recommend Miettinen (1998) and Ehrgott (2005) for surveys
of this field. Of particular relevance to our work is gradient-based multi-objective optimization, as
developed by Fliege and Svaiter (2000), Schäffler et al. (2002), and Désidéri (2012). These methods

2
use multi-objective Karush-Kuhn-Tucker (KKT) conditions (Kuhn and Tucker, 1951) and find a
descent direction that decreases all objectives. This approach was extended to stochastic gradient
descent by Peitz and Dellnitz (2018) and Poirion et al. (2017). In machine learning, these methods
have been applied to multi-agent learning (Ghosh et al., 2013; Pirotta and Restelli, 2016; Parisi
et al., 2014), kernel learning (Li et al., 2014), sequential decision making (Roijers et al., 2013), and
Bayesian optimization (Shah and Ghahramani, 2016; Hernández-Lobato et al., 2016). Our work
applies gradient-based multi-objective optimization to multi-task learning.

3 Multi-Task Learning as Multi-Objective Optimization

Consider a multi-task learning (MTL) problem over an input space X and a collection of task spaces
{Y t }t2[T ] , such that a large dataset of i.i.d. data points {xi , yi1 , . . . , yiT }i2[N ] is given where T is
the number of tasks, N is the number of data points, and yit is the label of the tth task for the ith
data point.1 We further consider a parametric hypothesis class per task as f t (x; ✓ sh , ✓ t ) : X ! Y t ,
such that some parameters (✓ sh ) are shared between tasks and some (✓ t ) are task-specific. We also
consider task-specific loss functions Lt (·, ·) : Y t ⇥ Y t ! R+ .
Although many hypothesis classes and loss functions have been proposed in the MTL literature, they
generally yield the following empirical risk minimization formulation:
T
X
min ct L̂t (✓ sh , ✓ t ) (1)
✓ sh ,
t=1
✓ 1 ,...,✓ T

for some static or dynamically computed weights ct per task, where L̂t (✓ sh , ✓ t ) is the empirical loss
P
of the task t, defined as L̂t (✓ sh , ✓ t ) , N1 i L f t (xi ; ✓ sh , ✓ t ), yit .
Although the weighted summation formulation (1) is intuitively appealing, it typically either requires
an expensive grid search over various scalings or the use of a heuristic (Kendall et al., 2018; Chen
et al., 2018). A basic justification for scaling is that it is not possible to define global optimality in the
MTL setting. Consider two sets of solutions ✓ and ✓¯ such that L̂t1 (✓ sh , ✓ t1 ) < L̂t1 (✓¯sh , ✓¯t1 ) and
L̂t2 (✓ sh , ✓ t2 ) > L̂t2 (✓¯sh , ✓¯t2 ), for some tasks t1 and t2 . In other words, solution ✓ is better for task
t1 whereas ✓¯ is better for t2 . It is not possible to compare these two solutions without a pairwise
importance of tasks, which is typically not available.
Alternatively, MTL can be formulated as multi-objective optimization: optimizing a collection
of possibly conflicting objectives. This is the approach we take. We specify the multi-objective
optimization formulation of MTL using a vector-valued loss L:
|
min L(✓ sh , ✓ 1 , . . . , ✓ T ) = min L̂1 (✓ sh , ✓ 1 ), . . . , L̂T (✓ sh , ✓ T ) . (2)
sh sh
✓ , ✓ ,
✓ ,...,✓ T
1
✓ ,...,✓ T
1

The goal of multi-objective optimization is achieving Pareto optimality.

Definition 1 (Pareto optimality for MTL)
(a) A solution ✓ dominates a solution ✓¯ if L̂t (✓ sh , ✓ t )  L̂t (✓¯sh , ✓¯t ) for all tasks t and
L(✓ sh , ✓ 1 , . . . , ✓ T ) 6= L(✓¯sh , ✓¯1 , . . . , ✓¯T ).
(b) A solution ✓ ? is called Pareto optimal if there exists no solution ✓ that dominates ✓ ? .
The set of Pareto optimal solutions is called the Pareto set (P✓ ) and its image is called the Pareto
front (PL = {L(✓)}✓2P✓ ). In this paper, we focus on gradient-based multi-objective optimization
due to its direct relevance to gradient-based MTL.
In the rest of this section, we first summarize in Section 3.1 how multi-objective optimization can be
performed with gradient descent. Then, we suggest in Section 3.2 a practical algorithm for performing
multi-objective optimization over very large parameter spaces. Finally, in Section 3.3 we propose an
efficient solution for multi-objective optimization designed directly for high-capacity deep networks.
Our method scales to very large models and a high number of tasks with negligible overhead.
1
This definition can be extended to the partially-labelled case by extending Y t with a null label.

3
3.1 Multiple Gradient Descent Algorithm

As in the single-objective case, multi-objective optimization can be solved to local optimality via
gradient descent. In this section, we summarize one such approach, called the multiple gradient
descent algorithm (MGDA) (Désidéri, 2012). MGDA leverages the Karush-Kuhn-Tucker (KKT)
conditions, which are necessary for optimality (Fliege and Svaiter, 2000; Schäffler et al., 2002;
Désidéri, 2012). We now state the KKT conditions for both task-specific and shared parameters:
PT PT
• There exist ↵1 , . . . , ↵T 0 such that t=1 ↵t = 1 and t=1 ↵t r✓sh L̂t (✓ sh , ✓ t ) = 0

• For all tasks t, r✓t L̂t (✓ sh , ✓ t ) = 0

Any solution that satisfies these conditions is called a Pareto stationary point. Although every Pareto
optimal point is Pareto stationary, the reverse may not be true. Consider the optimization problem

( T T
)
X 2 X
t t sh t t t
min ↵ r✓sh L̂ (✓ , ✓ ) ↵ = 1, ↵ 0 8t (3)
↵1 ,...,↵T 2 t=1
t=1

Désidéri (2012) showed that either the solution to this optimization problem is 0 and the resulting
point satisfies the KKT conditions, or the solution gives a descent direction that improves all tasks.
Hence, the resulting MTL algorithm would be gradient descent on the task-specific parameters
PT
followed by solving (3) and applying the solution ( t=1 ↵t r✓sh ) as a gradient update to shared
parameters. We discuss how to solve (3) for an arbitrary model in Section 3.2 and present an efficient
solution when the underlying model is an encoder-decoder in Section 3.3.

3.2 Solving the Optimization Problem

The optimization problem defined in (3) is equivalent to finding a minimum-norm point in the
convex hull of the set of input points. This problem arises naturally in computational geometry: it is
equivalent to finding the closest point within a convex hull to a given query point. It has been studied
extensively (Makimoto et al., 1994; Wolfe, 1976; Sekitani and Yamamoto, 1993). Although many
algorithms have been proposed, they do not apply in our setting because the assumptions they make
do not hold. Algorithms proposed in the computational geometry literature address the problem of
finding minimum-norm points in the convex hull of a large number of points in a low-dimensional
space (typically of dimensionality 2 or 3). In our setting, the number of points is the number of tasks
and is typically low; in contrast, the dimensionality is the number of shared parameters and can be
in the millions. We therefore use a different approach based on convex optimization, since (3) is a
convex quadratic problem with linear constraints.
Before we tackle the general case, let’s consider the case of two tasks. The optimization problem
can be defined as min↵2[0,1] k↵r✓sh L̂1 (✓ sh , ✓ 1 ) + (1 ↵)r✓sh L̂2 (✓ sh , ✓ 2 )k22 , which is a one-
dimensional quadratic function of ↵ with an analytical solution:

" |
#
r✓sh L̂2 (✓ sh , ✓ 2 ) r✓sh L̂1 (✓ sh , ✓ 1 ) r✓sh L̂2 (✓ sh , ✓ 2 )
↵
ˆ= (4)
kr✓sh L̂1 (✓ sh , ✓ 1 ) r✓sh L̂2 (✓ sh , ✓ 2 )k22 1
+, |

where [·]+, 1 represents clipping to [0, 1] as [a]+, 1 = max(min(a, 1), 0). We further visualize this
| |
solution in Figure 1. Although this is only applicable when T = 2, this enables efficient application
of the Frank-Wolfe algorithm (Jaggi, 2013) since the line search can be solved analytically. Hence,
we use Frank-Wolfe to solve the constrained optimization problem, using (4) as a subroutine for the
line search. We give all the update equations for the Frank-Wolfe solver in Algorithm 2.

4
Algorithm 1
min 2[0,1] k ✓ + (1 ¯ 2
)✓k2

1: if ✓ | ✓¯ ✓ | ✓ then
2: =1
3: else if ✓ | ✓¯ ✓¯| ✓¯ then
4: =0
5: else
✓¯ ✓)| ✓¯
Figure 1: Visualisation of the min-norm point in the convex hull 6: = (k✓ ¯ 2
✓k
of two points (min 2[0,1] k ✓ + (1 ¯ 2 ). As the geometry sug- 7: end if
)✓k
2
2
gests, the solution is either an edge case or a perpendicular vector.

Algorithm 2 Update Equations for MTL

1: for t = 1 to T do
2: ✓ t = ✓ t ⌘r✓t L̂t (✓ sh , ✓ t ) . Gradient descent on task-specific parameters
3: end for
4: ↵1 , . . . , ↵T = F RANK W OLFE S OLVER(✓) . Solve (3) to find a common descent direction
PT
5: ✓ sh = ✓ sh ⌘ t=1 ↵t r✓sh L̂t (✓ sh , ✓ t ) . Gradient descent on shared parameters

6: procedure F RANK W OLFE S OLVER(✓)

7: Initialize ↵ = (↵1 , . . . , ↵T ) = ( T1 , . . . , T
1
)
|
8: Precompute M st. Mi,j = r✓sh L̂ (✓ , ✓ i ) i sh
r✓sh L̂j (✓ sh , ✓ j )
9: repeat P
10: t̂ = arg minr t ↵t Mrt
|
11: ˆ = arg min (1 )↵ + et̂ M (1 )↵ + et̂ . Using Algorithm 1
12: ↵ = (1 ˆ )↵ + ˆ et̂
13: until ˆ ⇠ 0 or Number of Iterations Limit
14: return ↵1 , . . . , ↵T
15: end procedure

3.3 Efficient Optimization for Encoder-Decoder Architectures

The MTL update described in Algorithm 2 is applicable to any problem that uses optimization
based on gradient descent. Our experiments also suggest that the Frank-Wolfe solver is efficient and
accurate as it typically converges in a modest number of iterations with negligible effect on training
time. However, the algorithm we described needs to compute r✓sh L̂t (✓ sh , ✓ t ) for each task t, which
requires a backward pass over the shared parameters for each task. Hence, the resulting gradient
computation would be the forward pass followed by T backward passes. Considering the fact that
computation of the backward pass is typically more expensive than the forward pass, this results in
linear scaling of the training time and can be prohibitive for problems with more than a few tasks.
We now propose an efficient method that optimizes an upper bound of the objective and requires only
a single backward pass. We further show that optimizing this upper bound yields a Pareto optimal
solution under realistic assumptions. The architectures we address conjoin a shared representation
function with task-specific decision functions. This class of architectures covers most of the existing
deep MTL models and can be formally defined by constraining the hypothesis class as
f t (x; ✓ sh , ✓ t ) = (f t (·; ✓ t ) g(·; ✓ sh ))(x) = f t (g(x; ✓ sh ); ✓ t ) (5)
where g is the representation function shared by all tasks and f are the task-specific functions
t

that take this representation as input. If we denote the representations as Z = z1 , . . . , zN , where

zi = g(xi ; ✓ sh ), we can state the following upper bound as a direct consequence of the chain rule:
T 2 2 T 2
X @Z X
t t sh t t t sh t
↵ r✓sh L̂ (✓ , ✓ )  ↵ rZ L̂ (✓ , ✓ ) (6)
t=1
@✓ sh t=1
2 2 2

where @✓ @Z
sh
2
is the matrix norm of the Jacobian of Z with respect to ✓ sh . Two desirable properties
of this upper bound are that (i) rZ L̂t (✓ sh , ✓ t ) can be computed in a single backward pass for all

5
2
tasks and (ii) @Z
@✓ sh
is not a function of ↵1 , . . . , ↵T , hence it can be removed when it is used as
2
PT 2
an optimization objective. We replace the t t sh
t=1 ↵ r✓ sh L̂ (✓ , ✓ )
t
term with the upper bound
2
2
we have just derived in order to obtain the approximate optimization problem and drop the @✓
@Z
sh 2
term since it does not affect the optimization. The resulting optimization problem is
( T T
)
X 2 X
t t sh t t t
min ↵ rZ L̂ (✓ , ✓ ) ↵ = 1, ↵ 0 8t (MGDA-UB)
↵1 ,...,↵T 2 t=1
t=1

We refer to this problem as MGDA-UB (Multiple Gradient Descent Algorithm – Upper Bound).
In practice, MGDA-UB corresponds to using the gradients of the task losses with respect to the
representations instead of the shared parameters. We use Algorithm 2 with only this change as the
final method.
Although MGDA-UB is an approximation of the original optimization problem, we now state a
theorem that shows that our method produces a Pareto optimal solution under mild assumptions. The
proof is given in the supplement.
Theorem 1 Assume @✓ sh is full-rank. If ↵
@Z 1,...,T
is the solution of MGDA-UB, one of the following is
true:
PT
(a) t=1 ↵ r✓ sh L̂ (✓ , ✓ ) = 0 and the current parameters are Pareto stationary.
t t sh t

PT
(b) t=1 ↵ r✓ sh L̂ (✓ , ✓ ) is a descent direction that decreases all objectives.
t t sh t

This result follows from the fact that as long as @✓ @Z

sh is full rank, optimizing the upper bound
corresponds to minimizing the norm of the convex combination of the gradients using the Mahalonobis
@Z | @Z
norm defined by @✓ sh @✓ sh
. The non-singularity assumption is reasonable as singularity implies
that tasks are linearly related and a trade-off is not necessary. In summary, our method provably finds
a Pareto stationary point with negligible computational overhead and can be applied to any deep
multi-objective problem with an encoder-decoder model.

4 Experiments
We evaluate the presented MTL method on a number of problems. First, we use MultiMNIST
(Sabour et al., 2017), an MTL adaptation of MNIST (LeCun et al., 1998). Next, we tackle multi-label
classification on the CelebA dataset (Liu et al., 2015b) by considering each label as a distinct binary
classification task. These problems include both classification and regression, with the number of
tasks ranging from 2 to 40. Finally, we experiment with scene understanding, jointly tackling the tasks
of semantic segmentation, instance segmentation, and depth estimation on the Cityscapes dataset
(Cordts et al., 2016). We discuss each experiment separately in the following subsections.
The baselines weP consider are (i) uniform scaling: minimizing a uniformly weighted sum of
loss functions T1 t Lt , (ii) single task: solving tasks independently, (iii) grid search: exhaus-
P P
tively trying various values from {ct 2 [0, 1]| t ct = 1} and optimizing for T1 t ct Lt ,
(iv) Kendall et al. (2018): using the uncertainty weighting proposed by Kendall et al. (2018), and
(v) GradNorm: using the normalization proposed by Chen et al. (2018).

4.1 MultiMNIST

Our initial experiments are on MultiMNIST, an MTL version of the MNIST dataset (Sabour et al.,
2017). In order to convert digit classification into a multi-task problem, Sabour et al. (2017) overlaid
multiple images together. We use a similar construction. For each image, a different one is chosen
uniformly in random. Then one of these images is put at the top-left and the other one is at the
bottom-right. The resulting tasks are: classifying the digit on the top-left (task-L) and classifying
the digit on the bottom-right (task-R). We use 60K examples and directly apply existing single-task
MNIST models. The MultiMNIST dataset is illustrated in the supplement.
We use the LeNet architecture (LeCun et al., 1998). We treat all layers except the last as the
representation function g and put two fully-connected layers as task-specific functions (see the

6
A11
A5

A7
A9

A32
5

A6
A3

A3
A1

3
0
20

A3
0
A3

A2
17
7
A1 15
3 A2
12
A3
9
5
5 A1 10
4 7.5 A1
A14 3
2 A19 5.0

1 2.5
A4 A0

A8
A16
A36

A38
A12

7 A3
A1 4

A3 8
0 A1
A3
Uniform Scaling 1
2

Kendall et al. 2018

1
A2
Single Task

A2
A2

8
A23
GradNorm
9
A24

A27
A26

A25
Ours

Figure 2: Radar charts of percentage error per attribute on CelebA (Liu et al., 2015b). Lower is better.
We divide attributes into two sets for legibility: easy on the left, hard on the right. Zoom in for details.

supplement for details). We visualize the performance profile as a scatter plot of accuracies on task-L
and task-R in Figure 3, and list the results in Table 3.
In this setup, any static scaling results in lower accuracy than solving each task separately (the single-
task baseline). The two tasks appear to compete for model capacity, since increase in the accuracy
of one task results in decrease in the accuracy of the other. Uncertainty weighting (Kendall et al.,
2018) and GradNorm (Chen et al., 2018) find solutions that are slightly better than grid search but
distinctly worse than the single-task baseline. In contrast, our method finds a solution that efficiently
utilizes the model capacity and yields accuracies that are as good as the single-task solutions. This
experiment demonstrates the effectiveness of our method as well as the necessity of treating MTL as
multi-objective optimization. Even after a large hyper-parameter search, any scaling of tasks does not
approach the effectiveness of our method.

4.2 Multi-Label Classification

Next, we tackle multi-label classification. Given a set of attributes,

multi-label classification calls for deciding whether each attribute Table 1: Mean of error per
holds for the input. We use the CelebA dataset (Liu et al., 2015b), category of MTL algorithms
which includes 200K face images annotated with 40 attributes. Each in multi-label classification on
attribute gives rise to a binary classification task and we cast this as a CelebA (Liu et al., 2015b).
40-way MTL problem. We use ResNet-18 (He et al., 2016) without
the final layer as a shared representation function, and attach a linear Average
layer for each attribute (see the supplement for further details). error
We plot the resulting error for each binary classification task as a Single task 8.77
radar chart in Figure 2. The average over them is listed in Table 1. Uniform scaling 9.62
We skip grid search since it is not feasible over 40 tasks. Although Kendall et al. 2018 9.53
uniform scaling is the norm in the multi-label classification litera- GradNorm 8.44
ture, single-task performance is significantly better. Our method Ours 8.25
outperforms baselines for significant majority of tasks and achieves
comparable performance in rest. This experiment also shows that
our method remains effective when the number of tasks is high.

4.3 Scene Understanding

To evaluate our method in a more realistic setting, we use scene understanding. Given an RGB
image, we solve three tasks: semantic segmentation (assigning pixel-level class labels), instance

7
Table 2: Effect of the MGDA-UB approximation. We report the final accuracies as well as training
times for our method with and without the approximation.
Scene understanding (3 tasks) Multi-label (40 tasks)
Training Segmentation Instance Disparity Training Average
time mIoU [%] error [px] error [px] time (hour) error
Ours (w/o approx.) 38.6 66.13 10.28 2.59 429.9 8.33
Ours 23.3 66.63 10.25 2.54 16.1 8.25

segmentation (assigning pixel-level instance labels), and monocular depth estimation (estimating
continuous disparity per pixel). We follow the experimental procedure of Kendall et al. (2018) and
use an encoder-decoder architecture. The encoder is based on ResNet-50 (He et al., 2016) and is
shared by all three tasks. The decoders are task-specific and are based on the pyramid pooling module
(Zhao et al., 2017) (see the supplement for further implementation details).
Since the output space of instance segmentation is unconstrained (the number of instances is not
known in advance), we use a proxy problem as in Kendall et al. (2018). For each pixel, we estimate
the location of the center of mass of the instance that encompasses the pixel. These center votes
can then be clustered to extract the instances. In our experiments, we directly report the MSE in
the proxy task. Figure 4 shows the performance profile for each pair of tasks, although we perform
all experiments on all three tasks jointly. The pairwise performance profiles shown in Figure 4 are
simply 2D projections of the three-dimensional profile, presented this way for legibility. The results
are also listed in Table 4.
MTL outperforms single-task accuracy, indicating that the tasks cooperate and help each other. Our
method outperforms all baselines on all tasks.

4.4 Role of the Approximation

In order to understand the role of the approximation proposed in Section 3.3, we compare the final
performance and training time of our algorithm with and without the presented approximation in
Table 2 (runtime measured on a single Titan Xp GPU). For a small number of tasks (3 for scene
understanding), training time is reduced by 40%. For the multi-label classification experiment (40
tasks), the presented approximation accelerates learning by a factor of 25.
On the accuracy side, we expect both methods to perform similarly as long as the full-rank assumption
is satisfied. As expected, the accuracy of both methods is very similar. Somewhat surprisingly, our
approximation results in slightly improved accuracy in all experiments. While counter-intuitive at
first, we hypothesize that this is related to the use of SGD in the learning algorithm. Stability analysis
in convex optimization suggests that if gradients are computed with an error r̂✓ Lt = r✓ Lt + et (✓
corresponds to ✓ sh in (3)), as opposed to Z in the approximate problem in (MGDA-UB), the error in
the solution is bounded as kˆ ↵ ↵k2  O(maxt ket k2 ). Considering the fact that the gradients are
computed over the full parameter set (millions of dimensions) for the original problem and over a
smaller space for the approximation (batch size times representation which is in the thousands), the
dimension of the error vector is significantly higher in the original problem. We expect the l2 norm
of such a random vector to depend on the dimension.
In summary, our quantitative analysis of the approximation suggests that (i) the approximation does
not cause an accuracy drop and (ii) by solving an equivalent problem in a lower-dimensional space,
our method achieves both better computational efficiency and higher stability.

5 Conclusion
We described an approach to multi-task learning. Our approach is based on multi-objective optimiza-
tion. In order to apply multi-objective optimization to MTL, we described an efficient algorithm as
well as specific approximations that yielded a deep MTL algorithm with almost no computational
overhead. Our experiments indicate that the resulting algorithm is effective for a wide range of
multi-task scenarios.

8
1.00
0.40

0.35
0.98

0.30

1/Disparity Error[px]
Single Task
Grid Search
0.96 0.25
Uniform Scaling
Kendall et al. 2018
0.20 GradNorm
Accuracy R

Ours
0.94
0.15

0.10

0.92
0.05
0.45 0.50 0.55 mIoU[%] 0.60 0.65 0.70
Single Task
Grid Search
0.90
Uniform Scaling
Kendall et al. 2018
GradNorm 1.00
Ours
0.88
0.90 0.92 0.94 0.96 0.98 1.00
Accuracy L
0.95

Figure 3: MultiMNIST accuracy profile. We

1 - Instance Error[%]
plot the obtained accuracy in detecting the left
and right digits for all baselines. The grid-search 0.90

results suggest that the tasks compete for model

capacity. Our method is the only one that finds 0.85
Single Task
Grid Search
a solution that is as good as training a dedicated Uniform Scaling
Kendall et al. 2018
model for each task. Top-right is better. GradNorm
Ours
0.80
0.45 0.50 0.55 0.60 0.65 0.70
mIoU[%]

Table 3: Performance of MTL algorithms on

MultiMNIST. Single-task baselines solve tasks 0.40

separately, with dedicated models, but are shown 0.35

in the same row for clarity.
0.30
Left digit Right digit
1/Disparity Error[px]

accuracy [%] accuracy [%] 0.25

Single task 97.23 95.90 0.20

Uniform scaling 96.46 94.99
Kendall et al. 2018 96.47 95.29 0.15
Single Task
Grid Search
GradNorm 96.27 94.84 Uniform Scaling

Ours 97.26 95.90 0.10 Kendall et al. 2018

GradNorm
Ours
0.05
0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.0
1 - Instance Error[%]

Table 4: Performance of MTL algorithms in Figure 4: Cityscapes performance profile. We

joint semantic segmentation, instance segmenta- plot the performance of all baselines for the
tion, and depth estimation on Cityscapes. Single- tasks of semantic segmentation, instance seg-
task baselines solve tasks separately but are mentation, and depth estimation. We use mIoU
shown in the same row for clarity. for semantic segmentation, error of per-pixel re-
Segmentation Instance Disparity gression (normalized to image size) for instance
mIoU [%] error [px] error [px] segmentation, and disparity error for depth esti-
Single task 60.68 11.34 2.78 mation. To convert errors to performance mea-
Uniform scaling 54.59 10.38 2.96 sures, we use 1 instance error and 1/disparity
Kendall et al. 2018 64.21 11.54 2.65 error. We plot 2D projections of the performance
GradNorm 64.81 11.31 2.57 profile for each pair of tasks. Although we plot
Ours 66.63 10.25 2.54
pairwise projections for visualization, each point
in the plots solves all tasks. Top-right is better.

9
References
A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, 2007.

A. Bagherjeiran, R. Vilalta, and C. F. Eick. Content-based image retrieval through a multi-agent meta-learning
framework. In International Conference on Tools with Artificial Intelligence, 2005.

B. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. JMLR, 4:83–99, 2003.

J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149–198, 2000.

H. Bilen and A. Vedaldi. Integrated perception with recurrent multi-task neural networks. In NIPS, 2016.

R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.

Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich. GradNorm: Gradient normalization for adaptive loss
balancing in deep multitask networks. In ICML, 2018.

R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with
multitask learning. In ICML, 2008.

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele.
The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.

P. B. C. de Miranda, R. B. C. Prudêncio, A. C. P. L. F. de Carvalho, and C. Soares. Combining a multi-objective

optimization approach with meta-learning for SVM parameter selection. In International Conference on
Systems, Man, and Cybernetics, 2012.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image
database. In CVPR, 2009.

J.-A. Désidéri. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus
Mathematique, 350(5):313–318, 2012.

D. Dong, H. Wu, W. He, D. Yu, and H. Wang. Multi-task learning for multiple language translation. In ACL,
2015.

M. Ehrgott. Multicriteria Optimization (2. ed.). Springer, 2005.

J. Fliege and B. F. Svaiter. Steepest descent methods for multicriteria optimization. Mathematical Methods of
Operations Research, 51(3):479–494, 2000.

S. Ghosh, C. Lovell, and S. R. Gunn. Towards Pareto descent directions in sampling experts for multiple tasks in
an on-line learning paradigm. In AAAI Spring Symposium: Lifelong Machine Learning, 2013.

K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher. A joint many-task model: Growing a neural network for
multiple NLP tasks. In EMNLP, 2017.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

D. Hernández-Lobato, J. M. Hernández-Lobato, A. Shah, and R. P. Adams. Predictive entropy search for

multi-objective bayesian optimization. In ICML, 2016.

J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong. Cross-language knowledge transfer using multilingual deep
neural network with shared hidden layers. In ICASSP, 2013.

Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee. Rapid adaptation for deep neural networks
through multi-task learning. In Interspeech, 2015.

M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.

L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. One model to learn
them all. arXiv:1706.05137, 2017.

A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and
semantics. In CVPR, 2018.

I. Kokkinos. UberNet: Training a universal convolutional neural network for low-, mid-, and high-level vision
using diverse datasets and limited memory. In CVPR, 2017.

10
H. W. Kuhn and A. W. Tucker. Nonlinear programming. In Proceedings of the Second Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, Calif., 1951. University of California Press.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
C. Li, M. Georgiopoulos, and G. C. Anagnostopoulos. Pareto-path multi-task multiple kernel learning.
arXiv:1404.3190, 2014.
X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y.-Y. Wang. Representation learning using multi-task deep neural
networks for semantic classification and information retrieval. In NAACL HLT, 2015a.
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015b.
M. Long and J. Wang. Learning multiple tasks with deep relationship networks. arXiv:1506.02117, 2015.
M.-T. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser. Multi-task sequence to sequence learning.
arXiv:1511.06114, 2015.
N. Makimoto, I. Nakagawa, and A. Tamura. An efficient algorithm for finding the minimum norm point in the
convex hull of a finite point set in the plane. Operations Research Letters, 16(1):33–40, 1994.
K. Miettinen. Nonlinear Multiobjective Optimization. Springer, 1998.
I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In CVPR,
2016.
S. Parisi, M. Pirotta, N. Smacchia, L. Bascetta, and M. Restelli. Policy gradient approaches for multi-objective
sequential decision making. In IJCNN, 2014.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer.
Automatic differentiation in PyTorch. In NIPS Workshops, 2017.
S. Peitz and M. Dellnitz. Gradient-based multiobjective optimization with uncertainties. In NEO, 2018.
M. Pirotta and M. Restelli. Inverse reinforcement learning through policy gradient minimization. In AAAI, 2016.
F. Poirion, Q. Mercier, and J. Désidéri. Descent algorithm for nonsmooth stochastic multiobjective optimization.
Computational Optimization and Applications, 68(2):317–331, 2017.
D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making.
Journal of Artificial Intelligence Research, 48:67–113, 2013.
C. Rosenbaum, T. Klinger, and M. Riemer. Routing networks: Adaptive selection of non-linear functions for
multi-task learning. arXiv:1711.01239, 2017.
E. M. Rudd, M. Günther, and T. E. Boult. MOON: A mixed objective optimization network for the recognition
of facial attributes. In ECCV, 2016.
S. Ruder. An overview of multi-task learning in deep neural networks. arXiv:1706.05098, 2017.
S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017.
S. Schäffler, R. Schultz, and K. Weinzierl. Stochastic method for the solution of unconstrained vector optimization
problems. Journal of Optimization Theory and Applications, 114(1):209–222, 2002.
K. Sekitani and Y. Yamamoto. A recursive algorithm for finding the minimum norm point in a polytope and a
pair of closest points in two polytopes. Mathematical Programming, 61(1-3):233–249, 1993.
M. L. Seltzer and J. Droppo. Multi-task learning in deep neural networks for improved phoneme recognition. In
ICASSP, 2013.
A. Shah and Z. Ghahramani. Pareto frontier learning with expensive correlated objectives. In ICML, 2016.
C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Technical
report, Stanford University, US, 1956.
P. Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128–149, 1976.
Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with dirichlet process
priors. JMLR, 8:35–63, 2007.

11
Y. Yang and T. M. Hospedales. Trace norm regularised deep multi-task learning. arXiv:1606.04038, 2016.
A. R. Zamir, A. Sax, W. B. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task
transfer learning. In CVPR, 2018.

Y. Zhang and D. Yeung. A convex formulation for learning task relationships in multi-task learning. In UAI,
2010.

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ADE20K dataset. In
CVPR, 2017a.
D. Zhou, J. Wang, B. Jiang, H. Guo, and Y. Li. Multi-task multi-view learning based on cooperative multi-
objective optimization. IEEE Access, 2017b.

J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In NIPS, 2011a.
J. Zhou, J. Chen, and J. Ye. MALSAR: Multi-task learning via structural regularization. Arizona State University,
2011b.

Numerical Optimization of Process Parameters in Plastic Injection Molding For Minimizing Weldlines and Clamping Force Using Conformal Cooling Channel
No ratings yet
Numerical Optimization of Process Parameters in Plastic Injection Molding For Minimizing Weldlines and Clamping Force Using Conformal Cooling Channel
9 pages
2019_Pareto Multi-Task Learning_Lin et al_Curran Associates, Inc.
No ratings yet
2019_Pareto Multi-Task Learning_Lin et al_Curran Associates, Inc.
11 pages
2018_A brief review on multi-task learning_Thung_Wee_Multimedia Tools and Applications
No ratings yet
2018_A brief review on multi-task learning_Thung_Wee_Multimedia Tools and Applications
21 pages
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
No ratings yet
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
20 pages
A Survey On Multi-Task Learning
No ratings yet
A Survey On Multi-Task Learning
24 pages
Survey of Multitask Learning
No ratings yet
Survey of Multitask Learning
20 pages
Multi Task Learning (MTL)
No ratings yet
Multi Task Learning (MTL)
15 pages
2024_A survey on kernel-based multi-task learning_Neurocomputing
No ratings yet
2024_A survey on kernel-based multi-task learning_Neurocomputing
12 pages
Libmtl: A Python Library For Deep Multi-Task Learning: Baijiong Lin
No ratings yet
Libmtl: A Python Library For Deep Multi-Task Learning: Baijiong Lin
7 pages
LibMTL - Pytorch Library For MTL - March 2022
No ratings yet
LibMTL - Pytorch Library For MTL - March 2022
6 pages
12270-Article (PDF) - 25835-1-10-20210120
No ratings yet
12270-Article (PDF) - 25835-1-10-20210120
31 pages
Thijs Van Der Laan s3986721 Bachelors Thesis
No ratings yet
Thijs Van Der Laan s3986721 Bachelors Thesis
42 pages
Adaptive Weight Assignment Scheme For Multi-Task Learning
No ratings yet
Adaptive Weight Assignment Scheme For Multi-Task Learning
6 pages
2022_Multi-Task Learning for Dense Prediction Tasks - A Survey_Vandenhende et al_IEEE Transactions on Pattern Analysis and Machine Intelligence
No ratings yet
2022_Multi-Task Learning for Dense Prediction Tasks - A Survey_Vandenhende et al_IEEE Transactions on Pattern Analysis and Machine Intelligence
20 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Multitask Learning
No ratings yet
Multitask Learning
35 pages
Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning
No ratings yet
Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning
18 pages
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
No ratings yet
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
12 pages
Multi-Task Deep Learning Games - Investigating Nash Equilibria and Convergence Properties
No ratings yet
Multi-Task Deep Learning Games - Investigating Nash Equilibria and Convergence Properties
22 pages
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Metaheuristic: Fundamentals and Applications
From Everand
Metaheuristic: Fundamentals and Applications
Fouad Sabry
No ratings yet
2305.20057v3
No ratings yet
2305.20057v3
79 pages
M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization Via Multiplier Induced Loss Landscape Scheduling
No ratings yet
M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization Via Multiplier Induced Loss Landscape Scheduling
15 pages
Best First Search: Fundamentals and Applications
From Everand
Best First Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
sensors-23-00583-v2
No ratings yet
sensors-23-00583-v2
17 pages
META-LEARNING WITH VERSATILE LOSS GEOMETRIES__FOR FAST ADAPTATION USING MIRROR DESCENT
No ratings yet
META-LEARNING WITH VERSATILE LOSS GEOMETRIES__FOR FAST ADAPTATION USING MIRROR DESCENT
7 pages
Multitask Learning: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
No ratings yet
Multitask Learning: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
35 pages
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
2010 07140 PDF
No ratings yet
2010 07140 PDF
34 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
No ratings yet
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
12 pages
Beam Search: Fundamentals and Applications
From Everand
Beam Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
2312.11846v1
No ratings yet
2312.11846v1
31 pages
Meta-Learning With Implicit Gradients: Equal Contributions. Project Page
No ratings yet
Meta-Learning With Implicit Gradients: Equal Contributions. Project Page
18 pages
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
A_notion_of_task_relatedness_yielding_provable_mul
No ratings yet
A_notion_of_task_relatedness_yielding_provable_mul
19 pages
Deep Reinforcement Learning For Multi-Objective Optimization
No ratings yet
Deep Reinforcement Learning For Multi-Objective Optimization
12 pages
NIPS 1996 Multi Task Learning for Stock Selection Paper
No ratings yet
NIPS 1996 Multi Task Learning for Stock Selection Paper
7 pages
Ai&ml Unit 4
No ratings yet
Ai&ml Unit 4
21 pages
Constraint Satisfaction: Fundamentals and Applications
From Everand
Constraint Satisfaction: Fundamentals and Applications
Fouad Sabry
No ratings yet
CM3 Cooperative Multi-Goal Multi-Stage Multi-Agent Reinforcement Learning
No ratings yet
CM3 Cooperative Multi-Goal Multi-Stage Multi-Agent Reinforcement Learning
24 pages
Pentina Curriculum Learning of 2015 CVPR Paper PDF
No ratings yet
Pentina Curriculum Learning of 2015 CVPR Paper PDF
9 pages
Heuristic: Fundamentals and Applications
From Everand
Heuristic: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistical Mechanics of Online Learning of Drifting Concepts: A Variational Approach
No ratings yet
Statistical Mechanics of Online Learning of Drifting Concepts: A Variational Approach
23 pages
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
No ratings yet
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
19 pages
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
No ratings yet
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
18 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Yolor Based Multi Task Learning
No ratings yet
Yolor Based Multi Task Learning
17 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
ML MODULE - 1-1
No ratings yet
ML MODULE - 1-1
25 pages
2022 - Neural Optimization Machine-A Neural Network Approach For Optimization
No ratings yet
2022 - Neural Optimization Machine-A Neural Network Approach For Optimization
22 pages
19年总述
No ratings yet
19年总述
19 pages
Multi-level-particle-swarm-optimizer-for-multimodal-opt_2025_Information-Sci
No ratings yet
Multi-level-particle-swarm-optimizer-for-multimodal-opt_2025_Information-Sci
15 pages
Pattern Recognition: You Ji, Shiliang Sun
No ratings yet
Pattern Recognition: You Ji, Shiliang Sun
11 pages
Module 1
No ratings yet
Module 1
27 pages
DL Unit-5
No ratings yet
DL Unit-5
7 pages
Mixed-Integer Optimization With Constraint Learning
No ratings yet
Mixed-Integer Optimization With Constraint Learning
62 pages
25953-Article Text-30016-1-2-20230626
No ratings yet
25953-Article Text-30016-1-2-20230626
9 pages
ml notes
No ratings yet
ml notes
47 pages
Multi-Objective Hyperparameter Optimization in Machine Learning - An Overview
No ratings yet
Multi-Objective Hyperparameter Optimization in Machine Learning - An Overview
50 pages
A Review of DNN and GPU in Optical Proximity Correction
No ratings yet
A Review of DNN and GPU in Optical Proximity Correction
7 pages
2017 - Not Just a Black Box - Not Just a Black Box Learning Important Features Through Propagating Activation Differences - Shrikumar at al
No ratings yet
2017 - Not Just a Black Box - Not Just a Black Box Learning Important Features Through Propagating Activation Differences - Shrikumar at al
6 pages
2007 - Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Co - Kiani at Al
No ratings yet
2007 - Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Co - Kiani at Al
14 pages
Kahng et al. - 2024 - NN-Steiner A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem
No ratings yet
Kahng et al. - 2024 - NN-Steiner A Mixed Neural-Algorithmic Approach for the Rectilinear Steiner Minimum Tree Problem
9 pages
2016 - ICCAD-2016 CAD Contest in Pattern Classification for Integrated Circuit Design Space Analysis and Be - Topaloglu at Al
No ratings yet
2016 - ICCAD-2016 CAD Contest in Pattern Classification for Integrated Circuit Design Space Analysis and Be - Topaloglu at Al
4 pages
- A General Layout Pattern Clustering Using Geometric Matching-based Clip Relocation and Lower-bound A - He at al
No ratings yet
- A General Layout Pattern Clustering Using Geometric Matching-based Clip Relocation and Lower-bound A - He at al
23 pages
Chen et al. - 2023 - Machine Learning in Advanced IC Design A Methodological Survey
No ratings yet
Chen et al. - 2023 - Machine Learning in Advanced IC Design A Methodological Survey
17 pages
Chen Et Al. - 2024 - The Dawn of AI-Native EDA Opportunities and Challenges of Large Circuit Models
No ratings yet
Chen Et Al. - 2024 - The Dawn of AI-Native EDA Opportunities and Challenges of Large Circuit Models
28 pages
2021_Task Switching Network for Multi-Task Learning_Sun et al_
No ratings yet
2021_Task Switching Network for Multi-Task Learning_Sun et al_
10 pages
2024_Neuroformer_Antoniades et al_arXiv
No ratings yet
2024_Neuroformer_Antoniades et al_arXiv
25 pages
2024_A Survey on LoRA of Large Language Models_Mao et al_arXiv
No ratings yet
2024_A Survey on LoRA of Large Language Models_Mao et al_arXiv
31 pages
2019_End-To-End Multi-Task Learning With Attention_Liu et al_
No ratings yet
2019_End-To-End Multi-Task Learning With Attention_Liu et al_
10 pages
2019_Multi-Domain and Multi-Task Learning for Human Action Recognition_Liu et al_IEEE Transactions on Image Processing
No ratings yet
2019_Multi-Domain and Multi-Task Learning for Human Action Recognition_Liu et al_IEEE Transactions on Image Processing
15 pages
2024 Transformer-VQ Lingle ArXiv
No ratings yet
2024 Transformer-VQ Lingle ArXiv
30 pages
Lopera et al. - 2021 - A Survey of Graph Neural Networks for Electronic Design Automation
No ratings yet
Lopera et al. - 2021 - A Survey of Graph Neural Networks for Electronic Design Automation
6 pages
Mirhoseini et al. - 2021 - A graph placement methodology for fast chip design
No ratings yet
Mirhoseini et al. - 2021 - A graph placement methodology for fast chip design
23 pages
Wang - 2021 - Steiner Tree A Deep Reinforcement Learning Approach
No ratings yet
Wang - 2021 - Steiner Tree A Deep Reinforcement Learning Approach
37 pages
Xu and Moseley - 2022 - Learning-Augmented Algorithms for Online Steiner Tree
No ratings yet
Xu and Moseley - 2022 - Learning-Augmented Algorithms for Online Steiner Tree
9 pages
2011 - Reliability of ERP and Single-Trial Analyses - Gaspar Et Al - NeuroImage
No ratings yet
2011 - Reliability of ERP and Single-Trial Analyses - Gaspar Et Al - NeuroImage
10 pages
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
No ratings yet
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
82 pages
2020 - A Benchmark Dataset For RSVP-Based Brain-Computer Interfaces - Zhang Et Al - Frontiers in Neuroscience
No ratings yet
2020 - A Benchmark Dataset For RSVP-Based Brain-Computer Interfaces - Zhang Et Al - Frontiers in Neuroscience
11 pages
2002 - FVC2000 - Fingerprint Verification Competition - Maio Et Al - IEEE Transactions On Pattern Analysis and Machine Intelligence
No ratings yet
2002 - FVC2000 - Fingerprint Verification Competition - Maio Et Al - IEEE Transactions On Pattern Analysis and Machine Intelligence
11 pages
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
No ratings yet
2020 - Supervised Contrastive Learning - Khosla Et Al - Curran Associates, Inc.
13 pages
SUMSEM2 (2021-22) MAT2004 ETH AP2021228000102 Reference Material I Applied Optimization Techniques
No ratings yet
SUMSEM2 (2021-22) MAT2004 ETH AP2021228000102 Reference Material I Applied Optimization Techniques
2 pages
Survey of Multi-Objective Optimization Methods For
No ratings yet
Survey of Multi-Objective Optimization Methods For
28 pages
Applied Soft Computing: Yu-Jun Zheng, Sheng-Yong Chen, Hai-Feng Ling
No ratings yet
Applied Soft Computing: Yu-Jun Zheng, Sheng-Yong Chen, Hai-Feng Ling
14 pages
Research On Operational Model of PUBG
No ratings yet
Research On Operational Model of PUBG
5 pages
MSBA 320 Syllabus Optimization and Simulation
No ratings yet
MSBA 320 Syllabus Optimization and Simulation
5 pages
Uncertainty Management With Fuzzy and Rough Sets - Recent Advances and Applications
No ratings yet
Uncertainty Management With Fuzzy and Rough Sets - Recent Advances and Applications
424 pages
Miller S J - Mathematics of Optimization Pure and Applied Undergraduate Texts - 2017
100% (2)
Miller S J - Mathematics of Optimization Pure and Applied Undergraduate Texts - 2017
350 pages
Water: Calibration Procedure For Water Distribution Systems: Comparison Among Hydraulic Models
No ratings yet
Water: Calibration Procedure For Water Distribution Systems: Comparison Among Hydraulic Models
18 pages
INCOSE IS 2022 - Architecting To MDAO
No ratings yet
INCOSE IS 2022 - Architecting To MDAO
17 pages
Multi-Objective Ga Based Pid Controller For Load Frequency Control in Power Systems M
No ratings yet
Multi-Objective Ga Based Pid Controller For Load Frequency Control in Power Systems M
5 pages
Isatrans D 22 01620 - R1
No ratings yet
Isatrans D 22 01620 - R1
43 pages
B. Tech Project Report Economic Load Dispatch Using Optimization Algorithms
No ratings yet
B. Tech Project Report Economic Load Dispatch Using Optimization Algorithms
81 pages
Research Proposal Endorsement and Approval: Department of Civil Engineering
No ratings yet
Research Proposal Endorsement and Approval: Department of Civil Engineering
73 pages
Many-Objective Optimization: Problems and Evolutionary Algorithms - A Short Review
No ratings yet
Many-Objective Optimization: Problems and Evolutionary Algorithms - A Short Review
20 pages
Multi-Objective Optimal SVC Installation For Power System Loading Margin Improvement
No ratings yet
Multi-Objective Optimal SVC Installation For Power System Loading Margin Improvement
9 pages
Energy Optimization Model Using Linear Programming For Process Industry: A Case Study of Textile Manufacturing Plant in Kenya
No ratings yet
Energy Optimization Model Using Linear Programming For Process Industry: A Case Study of Textile Manufacturing Plant in Kenya
8 pages
Data-Driven Evolutionary Modeling in Materials Technology 1st Edition Nirupam Chakraborti - Explore the complete ebook content with the fastest download
100% (1)
Data-Driven Evolutionary Modeling in Materials Technology 1st Edition Nirupam Chakraborti - Explore the complete ebook content with the fastest download
72 pages
Dbbs Thesis Update Form
100% (3)
Dbbs Thesis Update Form
8 pages
Multiobjective Optimal Design of Three-Phase Induction Motor Using Improved Evolution Strategy
No ratings yet
Multiobjective Optimal Design of Three-Phase Induction Motor Using Improved Evolution Strategy
4 pages
Design Optimization of Deep Groove Ball Bearings Using Crowding Distance Particle Swarm Optimization
No ratings yet
Design Optimization of Deep Groove Ball Bearings Using Crowding Distance Particle Swarm Optimization
8 pages
Optimization of Complex Systems: Theory, Models, Algorithms and Applications
100% (1)
Optimization of Complex Systems: Theory, Models, Algorithms and Applications
1,163 pages
1 s2.0 S1000936120301825 Main
No ratings yet
1 s2.0 S1000936120301825 Main
10 pages
Building Energy Management With Reinforcement Learning and Model Predictive Control A Survey
No ratings yet
Building Energy Management With Reinforcement Learning and Model Predictive Control A Survey
10 pages
Application of Chaos Discrete Particle Swarm Optimization Algorithm On Pavement Maintenance Scheduling Problem
No ratings yet
Application of Chaos Discrete Particle Swarm Optimization Algorithm On Pavement Maintenance Scheduling Problem
11 pages
Synchronous Reluctance Motor: Design, Optimization and Validation
No ratings yet
Synchronous Reluctance Motor: Design, Optimization and Validation
6 pages
Multiobjective Linear Programming: An Introduction 1st Edition Dinh The Luc (Auth.) 2024 scribd download
100% (3)
Multiobjective Linear Programming: An Introduction 1st Edition Dinh The Luc (Auth.) 2024 scribd download
62 pages
Local Search Based Metaheuristics For The Robust Vehicle Routing Problem With Discrete Scenarios
No ratings yet
Local Search Based Metaheuristics For The Robust Vehicle Routing Problem With Discrete Scenarios
14 pages
Application of Genetic Algorithm For Reliability Allocation in Nuclear Power Plants
No ratings yet
Application of Genetic Algorithm For Reliability Allocation in Nuclear Power Plants
10 pages
PHD Synopsis Accepted Sakeena
No ratings yet
PHD Synopsis Accepted Sakeena
40 pages

2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems

Uploaded by

2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems

Uploaded by

Multi-Task Learning as Multi-Objective Optimization

Ozan Sener Vladlen Koltun

3 Multi-Task Learning as Multi-Objective Optimization

The goal of multi-objective optimization is achieving Pareto optimality.

• For all tasks t, r✓t L̂t (✓ sh , ✓ t ) = 0

3.2 Solving the Optimization Problem

Algorithm 2 Update Equations for MTL

6: procedure F RANK W OLFE S OLVER(✓)

3.3 Efficient Optimization for Encoder-Decoder Architectures

that take this representation as input. If we denote the representations as Z = z1 , . . . , zN , where

This result follows from the fact that as long as @✓ @Z

Kendall et al. 2018

4.2 Multi-Label Classification

Next, we tackle multi-label classification. Given a set of attributes,

4.3 Scene Understanding

4.4 Role of the Approximation

Figure 3: MultiMNIST accuracy profile. We

results suggest that the tasks compete for model

Table 3: Performance of MTL algorithms on

separately, with dedicated models, but are shown 0.35

accuracy [%] accuracy [%] 0.25

Single task 97.23 95.90 0.20

Ours 97.26 95.90 0.10 Kendall et al. 2018

Table 4: Performance of MTL algorithms in Figure 4: Cityscapes performance profile. We

R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.

P. B. C. de Miranda, R. B. C. Prudêncio, A. C. P. L. F. de Carvalho, and C. Soares. Combining a multi-objective

M. Ehrgott. Multicriteria Optimization (2. ed.). Springer, 2005.

D. Hernández-Lobato, J. M. Hernández-Lobato, A. Shah, and R. P. Adams. Predictive entropy search for

M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.

You might also like