0% found this document useful (0 votes)
18 views39 pages

Jacobian Descent For Multi-Objective Optimization: Pierre - Quinton@epfl - CH

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views39 pages

Jacobian Descent For Multi-Objective Optimization: Pierre - Quinton@epfl - CH

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Jacobian Descent For Multi-Objective Optimization

Pierre Quinton ∗ Valérian Rey ∗


LTHI, EPFL Independent
[email protected] [email protected]
arXiv:2406.16232v1 [cs.LG] 23 Jun 2024

Abstract
Many optimization problems are inherently multi-objective. To address them, we
formalize Jacobian descent (JD), a direct generalization of gradient descent for
vector-valued functions. Each step of this algorithm relies on a Jacobian matrix
consisting of one gradient per objective. The aggregator, responsible for reducing
this matrix into an update vector, characterizes JD. While the multi-task learning
literature already contains a variety of aggregators, they often lack some natural
properties. In particular, the update should not conflict with any objective and
should scale proportionally to the norm of each gradient. We propose a new
aggregator specifically designed to satisfy this. Emphasizing conflict between
objectives, we then highlight direct applications for our methods. Most notably,
we introduce instance-wise risk minimization (IWRM), a learning paradigm in
which the loss of each training example is considered a separate objective. On
simple image classification tasks, IWRM exhibits promising results compared to
the direct minimization of the average loss. The performance of our aggregator in
those experiments also corroborates our theoretical findings. Lastly, as speed is the
main limitation of JD, we provide a path towards a more efficient implementation.

1 Introduction
The field of multi-objective optimization studies minimization of vector-valued objective functions [1–
4]. In deep learning, a widespread approach to train a model with multiple objectives is to combine
those into a scalar loss function minimized by stochastic gradient descent. While this method is
simple, it comes at the expense of potentially degrading some individual objectives. Without prior
knowledge of their relative importance, this is undesirable.
Early works have attempted to extend gradient descent (GD) to consider several objectives simul-
taneously, and thus several gradients [5, 6]. Essentially, [5, 6] propose some heuristic to prevent
the degradation of any individual objective. Several other works have built upon this method, ana-
lyzing its convergence properties or extending it to a stochastic setting [7–9]. Later, this has been
applied to multi-task learning to tackle conflict between tasks, illustrated by contradicting gradient
directions [10]. Many studies have followed, proposing various other algorithms for the training
of multi-task models [11–17]. They commonly rely on an aggregator that maps a collection of
task-specific gradients (a Jacobian matrix) to a shared parameter update.
We propose to unify all such methods under the Jacobian descent (JD) algorithm, specified by
an aggregator.1 The goal of this algorithm is to minimize a differentiable vector-valued function
f : Rn → Rm iteratively without relying on a scalarization of the objective. Besides, we introduce
a novel stochastic variant of JD that enables the training of neural networks with a large number
of objectives. This unlocks a particularly interesting perspective: considering the minimization of
instance-wise loss vectors rather than the usual minimization of the average training loss. As this

Equal contribution
1
Our library enabling JD with PyTorch is available at https://github.com/TorchJD/torchjd

Preprint. Under review.


paradigm is a direct generalization of the well-known empirical risk minimization (ERM) [18], we
name it instance-wise risk minimization (IWRM).
With the JD formulation, we can ponder the properties that aggregators should satisfy. The need for
a new aggregator emerges from the realization that none of the existing methods is both robust to
conflict and proportionally affected by individual gradient magnitudes.
Our contributions are organized as follows: In Section 2, we formalize the JD algorithm and its
stochastic variants. We then introduce three critical aggregator properties and define AUPGrad to
satisfy them. In the smooth convex case, we show convergence of JD with AUPGrad to the Pareto front.
We present applications for JD and aggregators in Section 3, emphasizing the IWRM paradigm. We
then discuss existing aggregators and analyze their properties in Section 4. In Section 5, we report
experiments with IWRM optimized with stochastic JD with various aggregators. Lastly, we address
computational efficiency in Section 6, giving a path towards an efficient implementation.

2 Theoretical foundation
A suitable partial order between vectors must be considered to enable multi-objective optimization.
Let ĺ be the relation defined, for any pair of vectors u and v in Rm , as u ĺ v whenever ui ĺ vi for
all coordinates i. Similarly, ă is the relation defined by u ă v whenever ui ă vi for all coordinates
i. Finally, u ň v indicates that both u ĺ v and u ̸= v hold. Throughout this paper, ∥ · ∥ and ∥ · ∥F
denote the Euclidean vector norm and the Frobenius matrix norm, respectively.

2.1 Jacobian descent

In the following, we introduce Jacobian descent, a natural extension of gradient descent supporting
the optimization of vector-valued functions.
Suppose that f : Rn → Rm is continuously
h idifferentiable. Let Jf (x) ∈ R
m×n
be the Jacobian

matrix of f at x, i.e. [Jf (x)]ij = ∂xj f (x) . If a given x ∈ Rn is updated with some y ∈ Rn ,
i
Taylor’s theorem yields
f (x + y) = f (x) + Jf (x) · y + o(∥y∥), (1)

where o(∥y∥) means that lim∥y∥→0 f (x+y)−f∥y∥ (x)−Jf (x)·y


= 0. The term f (x) + Jf (x) · y is the
first-order Taylor approximation of f (x + y). Since it depends on y only through Jf (x) · y, it is
sensible to make y a function of Jf (x). A mapping A : Rm×n → Rn reducing such a matrix into a
vector is called an aggregator. For any J ∈ Rm×n , A(J) is called the aggregation of J by A. To
minimize f , the update is then selected as y = −ηA(Jf (x)), where η is an appropriate step size.
When m = 1, the Jacobian consists of a single gradient. In GD, the update is thus simply −η∇f (x),
i.e. the aggregator is the identity. A minimal version of GD is given in Algorithm 1. When m ą 1,
the choice of A is non-trivial and is one of the main subjects of the present work. An elementary
version of JD parameterized by some aggregator A is provided in Algorithm 2.

Algorithm 1: Gradient descent Algorithm 2: Jacobian descent with aggregator A


Input: x ∈ Rn , 0 ă η, T ∈ N Input: A : Rm×n → Rn , x ∈ Rn , 0 ă η, T ∈ N
for t ← 1 to T do for t ← 1 to T do
x ← x − η∇f (x) x ← x − ηA(Jf (x))
Output: x Output: x

Note that most gradient-based optimization algorithms, e.g. Adam [19], can similarly be extended to
the multi-objective setting by substituting the gradient with an aggregation of the Jacobian.
In some settings, the exact computation of the update can be prohibitively slow or even intractable.
When dealing with a single objective, the gradient ∇f (x) can be substituted with an estimation. This
is known as stochastic gradient descent (SGD). More generally, stochastic Jacobian descent (SJD)
relies on estimates of the aggregation of the Jacobian. We highlight two straightforward methods
for this. First, we can compute a stochastic estimation of the Jacobian and aggregate it instead of
the true Jacobian. We call this stochastically estimated Jacobian descent (SEJD). Moreover, we can

2
aggregate a matrix whose rows are a random subset of the rows of the true Jacobian. We refer to this
as stochastic sub-Jacobian descent (SSJD). This novel approach enables multi-objective optimization
with a very large number of objectives.

2.2 Desired properties for aggregators

An inherent challenge of multi-objective optimization is to manage conflicting objectives [10–12].


Note that substituting the update y = −ηA(Jf (x)) into the first-order Taylor approximation
f (x) + Jf (x) · y yields f (x) − ηJf (x) · A(Jf (x)). In particular, if 0 ĺ Jf (x) · A(Jf (x)),
then no coordinate of the approximation of f will increase. A pair of vectors x, y ∈ Rn is said
to conflict if x⊤ y ă 0. Hence, if any row of Jf (x) conflicts with A(Jf (x)), the corresponding
coordinate of f will locally increase. When minimizing f , avoiding conflict between the aggregation
and any gradient is thus desirable. In GD, ∇f (x) does indeed not conflict with itself. For aggregators
of JD, this motivates the first property.
Definition 1 (Non-conflicting). Let A : Rm×n → Rn be an aggregator. If for all J ∈ Rm×n ,
0 ĺ J · A(J), then A is said to be non-conflicting.

For any collection of vectors C ⊆ Rn , the dual cone of C is {x ∈ Rn : ∀y ∈ C, 0 ĺ x⊤ y} [20].


Notice that an aggregator A is non-conflicting if and only if for any J, A(J) is in the dual cone of
the rows of J.
In a step of GD, the update scales proportionally to the gradient norm. Small gradients will thus lead
to small updates, and conversely, large gradients will lead to large updates. Likewise, it would be
coherent that the rows of the Jacobian also contribute to the aggregation proportionally to their norm.
Scaling each row of a matrix J ∈ Rm×n by the corresponding element of a vector c ∈ Rm gives
diag(c) · J. This insight can then be formalized as the following property.
Definition 2 (Linear under scaling). Let A : Rm×n → Rn be an aggregator. If for all J ∈ Rm×n ,
the mapping from any 0 ă c ∈ Rm to A(diag(c) · J) is linear in c, then A is said to be linear under
scaling.
Finally, (1) highlights that the precision of the first order Taylor approximation f (x) + Jf (x) · y
improves as ∥y∥ decreases. For any candidate update y, its projection y ′ onto the row space of
Jf (x) is such that Jf (x) · y ′ = Jf (x) · y and ∥y ′ ∥ ĺ ∥y∥. This projection thus preserves the
value of the approximation while improving its precision. It is therefore sensible to select y directly
in the span of the rows of Jf (x), i.e. to have a vector of weights w ∈ Rm such that y = Jf (x)⊤ · w.
This holds by design in GD and yields the last desired property for aggregators of JD.
Definition 3 (Weighted). Let A : Rm×n → Rn be an aggregator. If for all J ∈ Rm×n , there exists
w ∈ Rm such that A(J) = J ⊤ · w, then A is said to be weighted.

2.3 Unconflicting projection of gradients

We now define the unconflicting projection of gradients aggregator AUPGrad , specifically designed to
be non-conflicting, linear under scaling and weighted. In essence, it projects each gradient onto the
dual cone of the rows of the Jacobian and averages the results, as illustrated in Figure 1a.
For any J ∈ Rm×n and x ∈ Rn , the projection of x onto the dual cone of the rows of J is
πJ (x) = arg min ∥y − x∥2 . (2)
y∈Rn : 0ĺJy

Denoting by ei ∈ Rm the ith standard basis vector, J ⊤ ei is the ith row of J. AUPGrad is defined as
m
1 X
AUPGrad (J) = πJ (J ⊤ ei ). (3)
m i=1

Since the dual cone is convex, any combination of its elements with positive coefficients remains in it.
In particular, AUPGrad (J) is always in the dual cone of the rows of J. AUPGrad is thus non-conflicting.
Note that if no pair of gradients conflicts, AUPGrad simply averages the rows of the Jacobian.
Since πJ is a projection onto a closed convex cone, if x ∈ Rn and 0 ă a ∈ R, then πJ (a · x) =
a · πJ (x). By (3), AUPGrad is thus linear under scaling.

3
UPGrad

DualProj
Mean
MGDA

(a) AUPGrad (J) (ours) (b) AMean (J), AMGDA (J) and ADualProj (J)

Figure 1: Aggregation of J = [g 1 g 2 ]⊤ ∈ R2×2 by four different aggregators. The dual cone of


{g 1 , g 2 } is represented in green.
(a) AUPGrad projects g 1 and g 2 onto the dual cone and averages the results.
(b) The mean AMean (J) = 12 (g 1 + g 2 ) conflicts with g 1 . ADualProj projects this mean onto the dual
cone, so it lies on its frontier. AMGDA (J) is less aligned with g 2 because of its larger norm.

When n is large, the projection in (2) is prohibitively expensive to compute. An alternative but
equivalent approach is to use its dual formulation, which is independent of n.
Proposition 1. Let J ∈ Rm×n . For any u ∈ Rm , πJ (J ⊤ u) = J ⊤ w with

w ∈ arg min v ⊤ JJ ⊤ v. (4)


v∈Rm : uĺv

Proof. See Appendix A.2.

The problem defined in (4) can be solved efficiently using a quadratic programming solver, such as
those bundled in qpsolvers [21]. For any i ∈ [m], let wi be given by (4) when substituting u with
ei . Then, by Proposition 1, !
m
⊤ 1 X
AUPGrad (J) = J wi . (5)
m i=1
This provides an efficient implementation of AUPGrad and proves that it is weighted. AUPGrad can also
be easily extended to incorporate a vector of preferences by replacing the averaging in (3) and (5) by
a weighted sum with positive weights. This extension remains non-conflicting, linear under scaling
and weighted.

2.4 Convergence

We now prove the convergence of JD with AUPGrad when minimizing f : Rn → Rm under standard
assumptions. If for a given x ∈ Rn , there exists no y ∈ Rn such that f (y) ň f (x), then x is said to
be Pareto optimal. The set X ∗ ⊆ Rn of Pareto optimal points is called the Pareto set, and its image
f (X ∗ ) is called the Pareto front.
Whenever f (λx + (1 − λ)y) ĺ λf (x) + (1 − λ)f (y) holds for any pair of vectors x, y ∈ Rn
and any λ ∈ [0, 1], f is said to be ĺ-convex. Moreover, f is said to be β-smooth whenever
∥Jf (x) − Jf (y)∥F ĺ β∥x − y∥ holds for any pair of vectors x, y ∈ Rn .
Theorem 1. Let f : Rn → Rm be a β-smooth and ĺ-convex function. Suppose that the Pareto
front f (X ∗ ) is bounded and that for any x ∈ Rn , there is x∗ ∈ X ∗ such that f (x∗ ) ĺ f (x).2 Let
x0 ∈ Rn , and for all t ∈ N, xt+1 = xt − ηAUPGrad (Jf (xt )), with η = β √1m . Let wt be the weights
defining AUPGrad (Jf (xt )) as per (5), i.e. AUPGrad (Jf (xt )) = Jf (xt )⊤ · wt . If wt is bounded, then
f (xt ) converges to f (x∗ ) for some x∗ ∈ X ∗ . In other words, f (xt ) converges to the Pareto front.

Proof. See Appendix A.3.


2
This condition is a generalization to the case m ľ 1 of the condition that there exists a minimizer x∗ ∈ Rn .

4
Experimental considerations suggest that wt converges to some w∗ ∈ Rm such that both 0 ă w∗
and Jf (x∗ )⊤ w∗ = 0. This suggests that the boundedness of wt could be relaxed or even removed
from the set of assumptions of Theorem 1. This would demonstrate that JD with AUPGrad and an
appropriate step size converges to the Pareto front in the smooth convex case.

3 Applications
Instance-wise risk minimization In machine learning, we generally have access to a training set
consisting of m examples. The goal of empirical risk minimization (ERM) [18] is simply to minimize
the average loss over the whole training set. More generally, instance-wise risk minimization (IWRM)
considers the loss associated with each training example as a distinct objective. Formally, if x ∈ Rn
are the parameters of the model and fi (x) is the loss associated to the ith example, the respective
objective functions of ERM and IWRM are:
m
1 X
(Empirical risk) f¯(x) = fi (x) (6)
m i=1

(Instance-wise risk) f (x) = [f1 (x) f2 (x) · · · fm (x)] (7)

Naively using GD for ERM is inefficient in most practical cases. A prevalent alternative is to use
SGD or one of its variants. Similarly, using JD for IWRM is typically intractable. Indeed, it would
require computing a Jacobian matrix with one row per training example at each iteration. In contrast,
we can use the Jacobian of a random batch of training example losses. Since it consists of a subset of
the rows of the full Jacobian, this approach is a form of stochastic sub-Jacobian descent, as introduced
in Section 2.1. Note that IWRM can be extended to cases where each fi is a vector-valued function.
The objective would then be the concatenation of the losses of all examples.

Multi-task learning In multi-task learning, a single model is trained to perform several related
tasks simultaneously, leveraging shared representations to improve overall performance [22]. At its
core, multi-task learning is a multi-objective optimization problem [10], making it a straightforward
application for Jacobian descent. Yet, conflict between tasks is often too limited to justify the overhead
of computing all task-specific gradients, i.e. the whole Jacobian [23, 24]. In such cases, a practical
approach is to minimize some linear scalarization of the objectives using an SGD-based method.
Nevertheless, we believe that a setting with inherent conflict between tasks naturally prescribes
Jacobian descent with a non-conflicting aggregator. We analyze several related works applied to
multi-task learning in Section 4.

Adversarial training In adversarial domain adaptation, the feature extractor of a model is trained
with two conflicting objectives: The features should be helpful for the main task and should be
unable to discriminate the domain of the input [25]. Likewise, in adversarial fairness, the feature
extractor is trained to both minimize the predictability of sensitive attributes, such as race or gender,
and maximize the performance on the main task [26]. Combining the corresponding gradients with
a non-conflicting aggregator could enhance the optimization of such methods. We believe that the
training of generative adversarial networks [27] could be similarly formulated as a multi-objective
optimization problem. The generator and discriminator could then be jointly optimized with JD.

Momentum-based optimization In gradient-based single-objective optimization, several methods


use some form of gradient momentum to improve their convergence speed [28]. Essentially, their
updates consider an exponential moving average of past gradients rather than just the last one. An
appealing idea is to modify those algorithms to make them combine the gradient and the momentum
with some aggregator, such as AUPGrad , rather than summing them. This would apply to many popular
optimizers, like SGD with Nesterov momentum [29], Adam [19], AdamW [30] and NAdam [31].

Distributed optimization In a distributed data-parallel setting with multiple machines or multiple


GPUs, model updates are computed in parallel. This can be viewed as multi-objective optimization
with one objective per data share. Rather than the typical averaging, a specialized aggregator, such as
AUPGrad , could thus combine the model updates. This consideration can even be extended to federated
learning, in which multiple entities participate in the training of a common model from their own

5
private data by sharing model updates [32]. In this setting, as security is one of the main challenges,
the non-conflicting property of the aggregator could be interesting or even necessary.

4 Existing aggregators

In the context of multi-task learning, several works have proposed iterative optimization algorithms
based on the combination of task-specific gradients [10–16]. Aggregators can be extracted from
these methods, thereby formulating them as variants of JD. More specifically, since the gradients are
stochastically estimated from batches of data, these are cases of what we call SEJD. In the following,
we briefly present the most prominent aggregators and summarize their properties in Table 1. We
also consider AMean , which simply averages the rows of the Jacobian. Their formal definitions are
provided in Appendix B. Some of them are also illustrated in Figure 1b.
ARGW aggregates the matrix using a random vector of weights [14]. AMGDA gives the aggregation
that maximizes the smallest improvement [5, 6, 10]. ACAGrad maximizes the smallest improvement
in a ball around the average gradient whose radius is parameterized by c ∈ [0, 1[ [12]. APCGrad
projects each gradient onto the orthogonal hyperplane of other gradients in case of conflict, iteratively
and in a random order [11]. It is, however, only non-conflicting when m ĺ 2, in which case
APCGrad = m · AUPGrad . IMTL-G is a method to balance some gradients with impartiality [13].
It is only defined for linearly independent gradients, but we generalize it as a formal aggregator,
denoted AIMTL-G , in Appendix B.6. Aligned-MTL orthonormalizes the Jacobian and weights its rows
according to some preferences [16]. We denote by AAligned-MTL this method with uniform preferences.
ANash-MTL aggregates Jacobians by finding the Nash equilibrium between task-specific gradients [15].
Lastly, the GradDrop layer [17] defines a custom backward pass that combines gradients with respect
to some internal activation. The corresponding aggregator, denoted AGradDrop , randomly drops out
some gradient coordinates based on their sign and sums the remaining ones.
In the context of continual learning, to limit forgetting, an idea is to project the gradient onto the dual
cone of gradients computed with past examples [33]. This idea can be translated into an aggregator
that projects the mean gradient onto the dual cone of the rows of the Jacobian. We name this ADualProj .
Several other works consider the gradients to be noisy when making their theoretical analysis [34–38].
Their solutions for combining gradients are typically stateful. Unifying their formulations would thus
require a more complex variant of Jacobian descent.
In the federated learning setting, several aggregators have been proposed to combine the model
updates while being robust to adversaries [39–42]. We do not study them here as their desired
properties mainly revolve around security.

Table 1: Properties satisfied for any number of objectives. Proofs are provided in Appendix B.
Non- Linear under
Ref. Aggregator Weighted
conflicting scaling
— AMean ✗ ✓ ✓
[5, 6, 10] AMGDA ✓ ✗ ✓
[33] ADualProj ✓ ✗ ✓
[11] APCGrad ✗ ✓ ✓
[17] AGradDrop ✗ ✗ ✗
[13] AIMTL-G ✗ ✗ ✓
[12] ACAGrad ✗ ✗ ✓
[14] ARGW ✗ ✓ ✓
[15] ANash-MTL ✓ ✗ ✓
[16] AAligned-MTL ✗ ✗ ✓
(ours) AUPGrad ✓ ✓ ✓

6
5 Experiments

In the following, we present empirical results for instance-wise risk minimization on some simple
image classification datasets. IWRM is performed by stochastic sub-Jacobian descent, as described
in Section 3. A key consideration is that when the aggregator is AMean , this approach becomes
equivalent to empirical risk minimization with SGD. Thus, it constitutes a baseline for comparison.
We train convolutional neural networks on subsets of SVHN [43], CIFAR-10 [44], EuroSAT [45],
MNIST [46], Fashion-MNIST [47] and Kuzushiji-MNIST [48]. To make the comparisons as fair as
possible, we have tuned the learning rate very precisely for each aggregator, as explained in detail
in Appendix C.1. To keep computational costs reasonable, we have thus limited the size of each
training dataset to 1024 images. This also enables re-running the same experiments several times
on different subsets and with different seeds to gain confidence in our results. Note that this is
strictly an optimization problem: we are not studying the generalization of the model, which would
be captured by some performance metric on a test set. Other experimental settings, such as the
network architectures and the total computational budget used to run our experiments, are given in
Appendix C. Figure 2 reports the main results on SVHN and CIFAR-10, two of the datasets exhibiting
the most substantial performance gap. Results on the other datasets and aggregators are reported in
Appendix D.1. They also demonstrate a significant performance gap.

(a) SVHN: training loss (b) SVHN: update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0
UPGrad (ours) PCGrad
2.0 MGDA DualProj
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

(c) CIFAR-10: training loss (d) CIFAR-10: update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0

2.5 UPGrad (ours) PCGrad


MGDA DualProj
Categorical cross-entropy

0.8
2.0
Cosine similarity

0.6
1.5

1.0 0.4

0.5 0.2

0.0
0.0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
Iteration Iteration

Figure 2: Optimization metrics obtained with IWRM with 1024 training examples and a batch size of
32, averaged over 8 random runs. The shaded area around each curve shows the estimated standard
error of the mean over the 8 runs. Curves are smoothed for readability. Best viewed in color.

Here, we compare the aggregators in terms of their average loss over the training set. This is precisely
the goal of ERM. For this reason, it is quite surprising to see that the direct optimization of this
objective, as performed by AMean , can be outperformed by some other aggregators. In particular,

7
AUPGrad , and to some extent ADualProj , provide improvements on all datasets. Figures 2b and 2d
show the similarity between the update of each aggregator and the update given by AMean . As
explained in Section 2.3, if there is no conflict among gradients, AUPGrad becomes equivalent to
AMean . The similarity curve of AUPGrad thus suggests substantial conflict between the individual
gradients, especially early into the training. Our interpretation is that at the beginning, the gradients
of hard examples are dominated by those of easier examples if averaged. However, with AUPGrad ,
the updates will positively affect hard examples even during the early training phase. Since fitting
such examples is more complex and time-consuming, it is beneficial to consider them early on.
Another possibility is that the projections of AUPGrad improve its stability compared to AMean , making
it favor a learning rate giving larger updates, thus enabling a faster convergence. The sub-optimal
performance of AMGDA in this setting can be attributed to its sensitivity to small gradients. If any row
of the Jacobian approaches zero, the aggregation by AMGDA will also approach zero. Note that an
advantage of linearity under scaling is to explicitly prevent this from happening.
Overall, these experiments demonstrate a high potential for the IWRM paradigm and confirm the
validity of JD, and more specifically of SSJD, as multi-objective optimization algorithms. More
generally, they question the soundness of the ubiquitous loss scalarization in deep learning, especially
when conflict is significant. Besides, the superiority of AUPGrad in such a simple setting confirms
our theoretical findings on the desired properties of aggregators. Consequently, using AUPGrad would
likely be beneficial as an out-of-the-box replacement whenever some gradients are averaged or
summed. An interesting consideration is that with SGD, increasing the batch size decreases the
variance, but with SSJD combined with AUPGrad , the effect is non-trivial since it also narrows the
dual cone. Additional results obtained when varying the batch size or when updating the parameters
with the Adam optimizer are available in Appendices D.2 and D.3, respectively.
It is important to note that an iteration of SSJD takes more time than an iteration of SGD. The exact
runtime of SSJD depends on many factors, such as the aggregator, the parallelization ability of the
device on which the Jacobians are computed, and the implementation. We provide an empirical
comparison of the speed of SGD and SSJD with different aggregators in Appendix E. Besides, we
address computational efficiency concerns in the next section.

6 Towards an efficient implementation


When the number of objectives is dominated by the number of parameters of the model, the main
overhead of JD comes from the computation of a Jacobian matrix rather than a single gradient. We
give several paths towards faster algorithms.

Grouping objectives Linear scalarization of a multi-objective optimization problem casts it as


single-objective. This idea can be relaxed by considering k different scalarizations of the objectives,
with k ă m [49]. In the context of multi-task learning, several studies have already worked on
task relationships and groupings [50–54]. We believe combining objectives with generally aligned
gradients would be preferable.

Gramian-based JD In the following, we show how to make a step of JD without even having
to compute the Jacobian. For any J ∈ Rm×n , the matrix G = JJ ⊤ is called the Gramian of
J and is positive semi-definite. Let Mm ⊆ Rm×m be the set of positive semi-definite matrices.
The Gramian of the Jacobian, denoted Gf (x) = Jf (x)Jf (x)⊤ ∈ Mm , captures the relations –
including conflicts – between all pairs of gradients.
Whenever A is a weighted aggregator, the update of JD is y = −ηJf (x)⊤ w for some vector of
weights w ∈ Rm . Substituting this into the Taylor approximation of (1) gives
 q 

f (x + y) = f (x) − ηGf (x) · w + o η w · Gf (x) · w . (8)

This expression only depends on the Jacobian through its Gramian. It is thus sensible to focus on aggre-
gators whose weights are only function of the Gramian. Denoting this function as W : Mm → Rm ,
those aggregators are such that A(J) = J ⊤ · W(G). Remarkably, all weighted aggregators of Table 1
can be expressed in this form. For such aggregators, substitution and linearity of differentiation3 then
3
For any x ∈ Rn and any w ∈ Rm , Jf (x)⊤ w = ∇(w⊤ f )(x)

8
yield  
A(Jf (x)) = ∇ W(Gf (x))⊤ · f (x). (9)
After computing W(Gf (x)), a step of JD would thus only require the backpropagation of a scalar
function. The computational cost of applying W depends on the aggregator and is often dominated
by the cost of computing the Gramian. We now outline an promising alternative algorithm for the
latter.
Similarly to the backpropagation algorithm, the chain rule can be leveraged to propagate the Gramian.
Let g : Rn → Rk and f : Rk → Rm , then for any x ∈ Rn ,
G(f ◦ g)(x) = Jf (g(x)) · Gg(x) · Jf (g(x))⊤ . (10)
Besides, when the function has multiple inputs, the Gramian can be computed as a sum of individual
⊤
Gramians. Let f : Rn1 +···+nk → Rm and x = x⊤ · · · x⊤

1 k . We can write Jf (x) as the
concatenation of Jacobians [J1 f (x) · · · Jk f (x)], where Ji f (x) is the Jacobian of f with respect
to xi evaluated at x. For any i ∈ [m], let Gi f (x) = Ji f (x)Ji f (x)⊤ . Then
k
X
Gf (x1 , . . . , xk ) = Gi f (x1 , . . . , xk ). (11)
i=1

When a function is made of compositions and concatenations of simple functions, those two rules
enable efficient computation of the Gramian. In particular, let f : Rn × Rℓ → Rm be some layer of
a sequential neural network. Given an input x ∈ Rn , some parameter p ∈ Rℓ , the output is given by
y = f (x, p). If G ∈ Mn is some Gramian associated to x, then forwarding G through f yields
J1 f (x, p) · G · J1 f (x, p)⊤ + G2 f (x, p). (12)

Appendix F contains examples of Gramian forwarding for some usual layers. It is worth noting
that the matrices of (12) can be very sparse and structured. For this reason, representing them as
arrays can be inefficient. Finding the optimal representations in all practical cases and devising a full
algorithm for an optimized computation of the Gramian remains an open challenge.
Still, these considerations have the potential to drastically reduce the computation time of JD and
SSJD for some architectures. This would make our methods more extensively applicable and would
thus be a significant step towards unlocking multi-objective optimization as a practical paradigm.

7 Conclusion
In this paper, we have defined the JD algorithm, characterized by an aggregator. Moreover, we
have formulated desired aggregator properties and proposed AUPGrad to alleviate the weaknesses of
existing methods. Then, we have presented several possible applications for JD and aggregators.
In particular, we have introduced IWRM, a novel learning paradigm minimizing the vector of per-
instance losses. An empirical assessment of IWRM on some usual image classification datasets
shows promising results and hints that it could be used as a general-purpose learning paradigm. As
speed is the main limitation of JD, we have given a path towards an efficient implementation. We
think that a fast algorithm to compute the Gramian of the Jacobian could unlock the full potential of
JD. An important takeaway from our experiments is that considering an optimization problem in its
entire dimensionality can be beneficial to its resolution. We thus hope that our work enables more
multi-objective optimization problems to be considered as such.

Limitations and future directions Our experimentation has some important limitations. First, we
only evaluate JD on the minimization of the IWRM objective. It would be essential to develop proper
benchmarks to compare aggregators on a wide variety of problems. Ideally, such problems should
involve substantially conflicting objectives, e.g. multi-task learning with inherently competing or
even adversarial tasks. Then, we have limited our scope to the comparison of optimization speeds,
disregarding generalization. While this simplifies the experiments and makes the comparison precise
and rigorous, optimization and generalization are sometimes intertwined. We thus believe that future
works should focus on both aspects.

9
Acknowledgments

We would like to express our sincere thanks to Scott Pesme, Emre Telatar, Matthieu Buot de l’Épine,
Adrien Vandenbroucque, Alix Jeannerot and Damian Dudzicz for their careful and thorough review.
The many insightful discussions that we shared with them were essential to this project.

References
[1] Y. Sawaragi, H. Nakayama, and T. Tanino, Theory of Multiobjective Optimization. Elsevier,
1985.
[2] M. Ehrgott, Multicriteria Optimization. Springer Science & Business Media, 2005.
[3] J. Branke, Multiobjective Optimization: Interactive and Evolutionary Approaches. Springer
Science & Business Media, 2008.
[4] K. Deb, K. Sindhya, and J. Hakanen, “Multi-objective optimization,” in Decision sciences, CRC
Press, 2016.
[5] J. Fliege and B. F. Svaiter, “Steepest descent methods for multicriteria optimization,” Mathemat-
ical Methods of Operations Research, 2000.
[6] J.-A. Désidéri, “Multiple-gradient descent algorithm (MGDA) for multiobjective optimization,”
Comptes Rendus Mathematique, 2012.
[7] J. Fliege, A. I. F. Vaz, and L. N. Vicente, “Complexity of gradient descent for multiobjective
optimization,” Optimization Methods and Software, 2019.
[8] F. Poirion, Q. Mercier, and J.-A. Désidéri, “Descent algorithm for nonsmooth stochastic multi-
objective optimization,” Computational Optimization and Applications, 2017.
[9] Q. Mercier, F. Poirion, and J.-A. Désidéri, “A stochastic multiple gradient descent algorithm,”
European Journal of Operational Research, 2018.
[10] O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,” Advances in
Neural Information Processing Systems, 2018.
[11] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task
learning,” in Advances in Neural Information Processing Systems, 2020.
[12] B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, “Conflict-averse gradient descent for multi-task
learning,” in Advances in Neural Information Processing Systems, 2021.
[13] L. Liu, Y. Li, Z. Kuang, J.-H. Xue, Y. Chen, W. Yang, Q. Liao, and W. Zhang, “Towards
impartial multi-task learning,” in International Conference on Learning Representations, 2021.
[14] B. Lin, F. Ye, Y. Zhang, and I. W. Tsang, “Reasonable effectiveness of random weighting: A
litmus test for multi-task learning,” arXiv preprint arXiv:2111.10603, 2021.
[15] A. Navon, A. Shamsian, I. Achituve, H. Maron, K. Kawaguchi, G. Chechik, and E. Fetaya,
“Multi-task learning as a bargaining game,” in International Conference on Machine Learning,
2022.
[16] D. Senushkin, N. Patakin, A. Kuznetsov, and A. Konushin, “Independent component alignment
for multi-task learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2023.
[17] Z. Chen, J. Ngiam, Y. Huang, T. Luong, H. Kretzschmar, Y. Chai, and D. Anguelov, “Just pick
a sign: Optimizing deep multitask models with gradient sign dropout,” in Advances in Neural
Information Processing Systems, 2020.
[18] V. N. Vapnik, The Nature of Statistical learning theory. Wiley New York, 1995.

10
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[20] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge university press, 2004.
[21] S. Caron, D. Arnström, S. Bonagiri, A. Dechaume, N. Flowers, A. Heins, T. Ishikawa,
D. Kenefake, G. Mazzamuto, D. Meoli, B. O’Donoghue, A. A. Oppenheimer, A. Pandala,
J. J. Quiroz Omaña, N. Rontsis, P. Shah, S. St-Jean, N. Vitucci, S. Wolfers, F. Yang, @bdel-
haisse, @MeindertHH, @rimaddo, @urob, and @shaoanlu, “qpsolvers: Quadratic Programming
Solvers in Python,” 2024.
[22] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint
arXiv:1706.05098, 2017.
[23] V. Kurin, A. De Palma, I. Kostrikov, S. Whiteson, and P. K. Mudigonda, “In defense of
the unitary scalarization for deep multi-task learning,” in Advances in Neural Information
Processing Systems, 2022.
[24] D. Xin, B. Ghorbani, J. Gilmer, A. Garg, and O. Firat, “Do current multi-task optimization
methods in deep learning even help?,” in Advances in Neural Information Processing Systems,
2022.
[25] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and
V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of machine learning
research, 2016.
[26] T. Adel, I. Valera, Z. Ghahramani, and A. Weller, “One-network adversarial fairness,” in AAAI
Conference on Artificial Intelligence, 2019.
[27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing
Systems, 2014.
[28] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR
computational mathematics and mathematical physics, 1964.
[29] Y. Nesterov, “A method of solving a convex programming problem with convergence rate
O(1/k**2),” Proceedings of the USSR Academy of Sciences, 1983.
[30] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Confer-
ence on Learning Representations, 2019.
[31] T. Dozat, “Incorporating Nesterov momentum into Adam,” in International Conference on
Learning Representations Workshop, 2016.
[32] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz,
Z. Charles, G. Cormode, R. Cummings, et al., “Advances and open problems in federated
learning,” Foundations and Trends® in Machine Learning, 2021.
[33] D. Lopez-Paz and M. A. Ranzato, “Gradient episodic memory for continual learning,” in
Advances in Neural Information Processing Systems, 2017.
[34] S. Liu and L. N. Vicente, “The stochastic multi-gradient algorithm for multi-objective opti-
mization and its application to supervised machine learning,” Annals of Operations Research,
2021.
[35] S. Zhou, W. Zhang, J. Jiang, W. Zhong, J. Gu, and W. Zhu, “On the convergence of stochas-
tic multi-objective gradient manipulation and beyond,” in Advances in Neural Information
Processing Systems, 2022.
[36] H. D. Fernando, H. Shen, M. Liu, S. Chaudhury, K. Murugesan, and T. Chen, “Mitigating
gradient bias in multi-objective learning: A provably convergent approach,” in International
Conference on Learning Representations, 2022.

11
[37] L. Chen, H. Fernando, Y. Ying, and T. Chen, “Three-way trade-off in multi-objective learn-
ing: Optimization, generalization and conflict-avoidance,” in Advances in Neural Information
Processing Systems, 2024.
[38] P. Xiao, H. Ban, and K. Ji, “Direction-oriented multi-objective learning: Simple and provable
stochastic algorithms,” in Advances in Neural Information Processing Systems, 2024.
[39] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries:
Byzantine tolerant gradient descent,” in Advances in Neural Information Processing Systems,
2017.
[40] R. Guerraoui, S. Rouault, et al., “The hidden vulnerability of distributed learning in byzantium,”
in International Conference on Machine Learning, 2018.
[41] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings:
Byzantine gradient descent,” in Proceedings of the ACM on Measurement and Analysis of
Computing Systems, 2017.
[42] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantine-robust distributed learning: Towards
optimal statistical rates,” in International Conference on Machine Learning, 2018.
[43] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng, et al., “Reading digits in
natural images with unsupervised feature learning,” in NIPS workshop on deep learning and
unsupervised feature learning, 2011.
[44] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” 2009.
[45] P. Helber, B. Bischke, A. Dengel, and D. Borth, “EuroSAT: A novel dataset and deep learning
benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, 2019.
[46] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” in Proceedings of the IEEE, 1998.
[47] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
[48] T. Clanuwat, M. Bober-Irizar, A. Kitamoto, A. Lamb, K. Yamamoto, and D. Ha, “Deep learning
for classical japanese literature,” in NeurIPS Workshop on Machine Learning for Creativity and
Design, 2018.
[49] S. Dempe, G. Eichfelder, and J. Fliege, “On the effects of combining objectives in multi-
objective optimization,” Mathematical Methods of Operations Research, 2015.
[50] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in multi-task feature learning,”
in International Conference on Machine Learning, 2011.
[51] A. Kumar and H. Daumé III, “Learning task grouping and overlap in multi-task learning,” in
International Conference on Machine Learning, 2012.
[52] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentan-
gling task transfer learning,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2018.
[53] C. Fifty, E. Amid, Z. Zhao, T. Yu, R. Anil, and C. Finn, “Efficiently identifying task groupings
for multi-task learning,” in Advances in Neural Information Processing Systems, 2021.
[54] J. Shen, C. Wang, Z. Xiao, N. Van Noord, and M. Worring, “GO4Align: Group optimization for
multi-task alignment,” arXiv preprint arXiv:2404.06486, 2024.
[55] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al., “PyTorch: An imperative style, high-performance deep
learning library,” in Advances in Neural Information Processing Systems, 2019.

12
[56] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by
exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015.
[57] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing
internal covariate shift,” in International Conference on Machine Learning, 2015.

13
A Proofs
A.1 Supplementary theoretical results

Recall that a function f : Rn → Rm is ĺ-convex if for all x, y ∈ Rn and any λ ∈ [0, 1],
f (λx + (1 − λ)y) ĺ λf (x) + (1 − λ)f (y).
n m
Lemma 1. If f : R → R is a continuously differentiable ĺ-convex function, then for any pair of
vectors x, y ∈ Rn , Jf (x)(y − x) ĺ f (y) − f (x).

Proof.
f (x + λ(y − x)) − f (x)
Jf (x)(y − x) = lim (differentiation)
λ→0+ λ
f (x) + λ(f (y) − f (x)) − f (x)
ĺ lim (ĺ-convexity)
λ→0+ λ
= f (y) − f (x),
which concludes the proof. □
Lemma 2. Let J ∈ Rm×n , let u ∈ Rm and let x ∈ Rn , then
u⊤ Jx ĺ ∥u∥ · ∥J∥F · ∥x∥

Proof. Let Ji be the ith row of J, then


 
(u⊤ Jx)2 ĺ ∥u∥2 · ∥Jx∥2 Cauchy-Schwartz
inequality
m
X
= ∥u∥2 · (Ji⊤ x)2
i=1
m
X  
Cauchy-Schwartz
ĺ ∥u∥2 · ∥Ji ∥2 · ∥x∥2 inequality
i=1
= ∥u∥2 · ∥J∥2F · ∥x∥2 ,
which concludes the proof. □

Recall that a function f : Rn → Rm is β-smooth if for all x, y ∈ Rn ,


∥Jf (x) − Jf (y)∥F ĺ β∥x − y∥ (13)
n m m n
Lemma 3. Let f : R → R be β-smooth, then for any w ∈ R and any x, y ∈ R ,
β
w⊤ (f (x) − f (y) − Jf (y)(x − y)) ĺ ∥w∥ · ∥x − y∥2 (14)
2

Proof.
w⊤ (f (x) − f (y) − Jf (y)(x − y))
Z 1   
fundamental
= w⊤ Jf (y + t(x − y))(x − y) dt − Jf (y)(x − y) theorem
0 of calculus
Z 1
= w⊤ (Jf (y + t(x − y)) − Jf (y))(x − y) dt
0
Z 1
ĺ ∥w∥ · ∥Jf (y + t(x − y)) − Jf (y)∥F · ∥x − y∥ dt (Lemma 2)
0
Z 1
ĺ ∥w∥ · βt · ∥x − y∥2 dt (β-smoothness 13)
0
β
= ∥w∥ · ∥x − y∥2 ,
2
which concludes the proof. □

14
A.2 Proposition 1

Proposition 1. Let J ∈ Rm×n . For any u ∈ Rm , πJ (J ⊤ u) = J ⊤ w with

w ∈ arg min v ⊤ JJ ⊤ v. (4)


v∈Rm : uĺv

Proof. This is a direct consequence of Lemma 4. □

Lemma 4. Let J ∈ Rm×n , G = JJ ⊤ , u ∈ Rm . For any w ∈ Rm satisfying



u ĺ w (15a)
0 ĺ Gw (15b)
 ⊤
u Gw = w⊤ Gw (15c)
we have πJ (J ⊤ u) = J ⊤ w. Such a w is the solution to

w ∈ arg min v ⊤ Gv.


uĺv

Proof. Note that


1
πJ (J ⊤ u) = arg min ∥x − J ⊤ u∥2
x∈Rn : 2
0ĺJx

is a convex program. Consequently, the KKT conditions are both necessary and sufficient. The
Lagragian is given by L(x, v) = 12 ∥x − J ⊤ u∥2 − v ⊤ Jx. The KKT conditions are then given by

∇x L(x, v) = 0



0ĺv

 0 ĺ Jx

0 = v ⊤ Jx

x = J ⊤ (u + v)



0ĺv



 0 ĺ G(u + v)
0 = v ⊤ G(u + v)

x = J ⊤ (u + v)



uĺu+v



 0 ĺ G(u + v)
 ⊤
u G(u + v) = (u + v)⊤ G(u + v)
The simple change of variable w = u + v finishes the proof of the first part.
Since x = J ⊤ (u + v), the Wolfe dual program of πJ (J ⊤ u) gives

w ∈ u + arg max L(J ⊤ (u + v), v)


v∈Rm : 0ĺv
1 ⊤ 2
= u + arg max ∥J v∥ − v ⊤ JJ ⊤ (u + v)
v∈Rm : 0ĺv 2
1
= u + arg max − v ⊤ Gv − v ⊤ Gu
m
v∈R : 0ĺv 2
1
= u + arg min (u + v)⊤ G(u + v)
v∈Rm : uĺu+v 2
1 ′⊤ ′
= arg min v Gv ,
v ′ ∈Rm : uĺv ′ 2

which concludes the proof. □

15
A.3 Theorem 1

Theorem 1. Let f : Rn → Rm be a β-smooth and ĺ-convex function. Suppose that the Pareto
front f (X ∗ ) is bounded and that for any x ∈ Rn , there is x∗ ∈ X ∗ such that f (x∗ ) ĺ f (x). Let
x0 ∈ Rn , and for all t ∈ N, xt+1 = xt − ηAUPGrad (Jf (xt )), with η = β √1m . Let wt be the weights
defining AUPGrad (Jf (xt )) as per (5), i.e. AUPGrad (Jf (xt )) = Jf (xt )⊤ · wt . If wt is bounded, then
f (xt ) converges to f (x∗ ) for some x∗ ∈ X ∗ . In other words, f (xt ) converges to the Pareto front.

To prove the theorem we will need Lemmas 5, 6 and 7 below.


1
Pm
Lemma 5. Let J ∈ Rm×n and w = m i=1 w i be the weights defining AUPGrad (J) as per (5). Let,
as usual, G = JJ ⊤ , then,

w⊤ Gw ĺ 1⊤ Gw.

Proof. Observe that if, for any u, v ∈ Rm , ⟨u, v⟩ = u⊤ Gv, then ⟨·, ·⟩ is an inner product. In this
Hilbert space, the Cauchy-Schwartz inequality reads as

(u⊤ Gv)2 = ⟨u, v⟩2


ĺ ⟨u, u⟩ · ⟨v, v⟩
= u⊤ Gu · v ⊤ Gv.

Therefore

w⊤ Gw
1 X ⊤
= w Gwj
m2 i,j i
1 X
q q  
⊤ Gw · Cauchy-Schwartz
ĺ w i i w⊤
j Gw j inequality
m2 i,j
!2
X 1q
= w⊤ i Gw i
i
m
X 1 q 2
ĺ ⊤
wi Gwi (Jensen’s inequality)
i
m
1 X ⊤
= wi Gwi (G positive semi-definite)
m i
1 X ⊤
= e Gwi (Lemma 4, (15c))
m i i
1 X ⊤ Lemma 4, (15b)

ĺ 1 Gwi ei ĺ1
m i
= 1⊤ Gw,

which concludes the proof. □

Lemma 6. Under the assumptions of Theorem 1, for any w ∈ Rm ,


 ⊤
∥w∥ 1 w
w⊤ (f (xt+1 ) − f (xt )) ĺ √ √ − Gt wt .
β m 2 m ∥w∥

16
Proof. For all t ľ 0, let Jt = Jf (xt ), Gt = Jt Jt⊤ . Then xt+1 = xt −ηAUPGrad (Jt ) = xt −ηJt⊤ wt .
Therefore

w⊤ (f (xt+1 ) − f (xt ))
βη 2
ĺ − ηw⊤ Jt Jt⊤ wt + ∥w∥ · ∥Jt⊤ wt ∥2 (Lemma 3)
2
1 1
= − √ w⊤ Gt wt + ∥w∥ · w⊤ t Gt w t (η= β√1m )
β m 2βm
1 1
ĺ − √ w⊤ Gt wt + ∥w∥ · 1⊤ Gt wt (Lemma 5)
β m 2βm
 ⊤
∥w∥ 1 w
= √ √ − Gt wt ,
β m 2 m ∥w∥

which concludes the proof. □

Lemma 7. Under the assumptions of Theorem 1, if x∗ ∈ X ∗ is such that 1⊤ f (x∗ ) ĺ 1⊤ f (xt ) for
all t, then
T −1  √ 
1 X ⊤ 1 β m
wt (f (xt ) − f (x∗ )) ĺ 1⊤ (f (x0 ) − f (x∗ )) + ∥x0 − x∗ ∥2 . (16)
T t=0 T 2

Proof. We first bound, for any t ľ 0, 1⊤ (f (xt+1 ) − f (xt )) as follows

1⊤ (f (xt+1 ) − f (xt ))
1
ĺ − √ · 1⊤ Gt w t Lemma 6 )
(with w=1
2β m
1
ĺ − √ · w⊤ t Gt w t . (Lemma 5)
2β m

Summing this over t = 0, . . . , T − 1 yields


T −1
1 X
√ w⊤
t Gt w t
2β m t=0
T
X −1
ĺ 1⊤ (f (xt ) − f (xt+1 ))
t=0

= 1 (f (x0 ) − f (xT )) (Telescoping sum)
 
Assumption
ĺ 1⊤ (f (x0 ) − f (x∗ )). 1⊤ f (x∗ )ĺ1⊤ f (xT )
(17)

Since 0 ĺ wt ,

w⊤ ∗
t (f (xt ) − f (x ))
ĺ w⊤ ∗
t Jt (xt − x ) (Lemma 1)
1
= (xt − xt+1 )⊤ (xt − x∗ ) (xt+1 =xt −ηJt⊤ wt )
η
 
1
∥xt − xt+1 ∥2 + ∥xt − x∗ ∥2 − ∥xt+1 − x∗ ∥2 Parallelogram

= law

√  
1 β m
= √ w⊤ G w
t t + ∥x t − x ∗ 2
∥ − ∥x t+1 − x ∗ 2
∥ . (η= β√1m )
2β m t 2

17
Summing this over t = 0, . . . , T − 1 yields

T
X −1
w⊤ ∗
t (f (xt ) − f (x ))
t=0
⊤ √  
1 X β m
ĺ √ w⊤
t Gt w t + ∥x0 − x∗ ∥2 − ∥xT − x∗ ∥2 (Telescoping sum)
2β m t=0 2
⊤ √
1 X ⊤ β m
ĺ √ wt Gt wt + ∥x0 − x∗ ∥2
2β m t=0 2

β m
ĺ 1⊤ (f (x0 ) − f (x∗ )) + ∥x0 − x∗ ∥2 . (By (17))
2

Scaling down this inequality by T yields

T −1  √ 
1 X ⊤ ∗ 1 ⊤ ∗ β m ∗ 2
w (f (xt ) − f (x )) ĺ 1 (f (x0 ) − f (x )) + ∥x0 − x ∥ ,
T t=0 t T 2

which concludes the proof. □

We are now ready to prove Theorem 1.

Proof. For all t ľ 0, let Jt = Jf (xt ), Gt = Jt Jt⊤ . Then

xt+1 = xt − ηAUPGrad (Jt )


= xt − ηJt⊤ wt .

1 w
Substituting w = 1 in the term √
2 m
− ∥w∥ of Lemma 6 yields

1 w 1
√ − =− √
2 m ∥w∥ 2 m
ă 0.

Therefore there exists some ε ą 0 such that any w ∈ Rm with ∥1 − w∥ ă ε satisfies 2√1m ă ∥w∥
w
.
m 1 w
Denote by Bε (1) = {w ∈ R : ∥1 − w∥ ă ε}, i.e. for all w ∈ Bε (1), 2√m ă ∥w∥ . By the
non-conflicting property of AUPGrad , 0 ĺ Gt wt and therefore for all w ∈ Bε (1),

w⊤ (f (xt+1 ) − f (xt ))
 
∥w∥ 1 w
ĺ √ √ − Gt wt (Lemma 6)
β m 2 m ∥w∥
ĺ 0.

Since w⊤ f (xt ) is bounded and non-increasing, it converges. Since Bε (1) contains a basis of Rm ,
f (xt ) converges to some f ∗ ∈ Rm . By assumption on f , there exists x∗ in the Pareto set such that
f (x∗ ) ĺ f ∗ .
We now prove that f (x∗ ) = f ∗ . Since f (x∗ ) ĺ f ∗ , it is sufficient to show that 1⊤ (f ∗ −f (x∗ )) ĺ 0.

18
First, the additional assumption of Lemma 7 applies since 1⊤ f (xt ) decreases to 1⊤ f ∗ which is
larger than 1⊤ f (x∗ ). Therefore
1⊤ (f ∗ − f (x∗ ))
T −1
!⊤
1 X 
f (x∗ )ĺf ∗

ĺ wt (f ∗ − f (x∗ )) 1ĺwt by (15a)
T t=0
T −1  
1 X ⊤ ∗ ∗
= w f − f (xt ) + f (xt ) − f (x )
T t=0 t
−1 −1
T T
!
1 X

X
= w⊤
t (f − f (xt )) + w⊤
t (f (xt )

− f (x ))
T t=0 t=0
T −1 √ !
1 X
∗ β m
ĺ w⊤
t (f

− f (xt )) + 1 (f (x0 ) − f (x )) + ∗ ∗ 2
∥x0 − x ∥ . (Lemma 7)
T t=0
2

Taking the limit as T → ∞, we get


1⊤ (f ∗ − f (x∗ ))
T −1
1 X ⊤ ∗
ĺ lim wt (f − f (xt ))
T →∞ T
t=0
T −1
1 X  
ĺ lim ∥wt ∥ · ∥f ∗ − f (xt )∥ Cauchy-Schwartz
inequality
T →∞ T
t=0
 
wt bounded
= 0, f (xt )→f ∗

which concludes the proof. □

19
B Properties of existing aggregators
In the following, we prove the properties of the aggregators from Table 1. Some aggregators, e.g.
ARGW , AGradDrop and APCGrad , are non-deterministic and are thus not technically functions but rather
random variables whose distribution depends on the matrix J ∈ Rm×n to aggregate. Still, the
properties of Section 2.2 can be easily adapted to a random setting. If A is a random aggregator,
then for any J, A(J) is a random vector in Rn . The aggregator is non-conflicting if A(J) is in the
dual cone of the rows of J with probability 1. It is linear under scaling if for all J ∈ Rm×n , there is
a – possibly random – matrix J ∈ Rm×n , such that for all 0 ă c ∈ Rm , A(diag(c) · J) = J⊤ · c.
Finally, A is weighted if for any J ∈ Rm×n there is, with probability 1, some weighting w ∈ Rm
such that A(J) = J ⊤ · w.

B.1 Mean
1 ⊤
AMean simply averages the rows of the input matrix, i.e. for all J ∈ Rm×n , AMean (J) = m J · 1. It
1
is weighted with constant weighting equal to m 1. For any c ∈ R  m
, AMean
 (diag(c) · J) = J ⊤ · c,
−2
which is linear in c. AMean is therefore linear under scaling. AMean = [1], which conflicts
4
with [−2], so AMean is not non-conflicting.

B.2 MGDA

The optimization algorithm presented in [6], called MGDA, is tied to a particular method for
aggregating the gradients. We thus refer to this aggregator as AMGDA . The dual problem of this
method was also introduced independently in [5]. We show the equivalence between the two solutions
to make the analysis of AMGDA easier.
Let J ∈ Rm×n . The aggregation described in [6] is defined by the weighting that is a solution to
arg min ∥J ⊤ w∥2 (18)
0ĺw:
1⊤ w=1

By construction, AMGDA is thus weighted.


In Equation (3) of [5], the following problem is studied
1
min α + ∥x∥2 (19)
α∈R,x∈Rn : 2
Jxĺα1

We show that the problems in (18) and (19) are dual to each other. Furthermore, the duality gap is
null since this is a convex problem.
The Lagrangian of the problem in (19) is given by L(α, x, µ) = α + 12 ∥x∥2 − µ⊤ (α1 − Jx).
Differentiating w.r.t. α and x gives respectively 1 − µ⊤ 1 and x + J ⊤ µ. The dual problem is obtained
by setting those two to 0 and then maximizing the Lagrangian on 0 ĺ µ and α, i.e.
1
arg max α + ∥J ⊤ µ∥2 − µ⊤ (α1 + JJ ⊤ µ)
α,0ĺµ: 2
µ⊤ 1=1
1
= arg max α + ∥J ⊤ µ∥2 − αµ⊤ 1 − µ⊤ JJ ⊤ µ
α,0ĺµ: 2
µ⊤ 1=1
1
= arg min ∥J ⊤ µ∥2
0ĺµ: 2
µ⊤ 1=1

Therefore, (18) and (19) are equivalent, with x = −J ⊤ w.


Observe that since in (19), α = 0 and x = 0 is feasible, the objective is non-positive and therefore
α ĺ 0. Substituting x = −J ⊤ w in J · x ĺ α1 ĺ 0 yields 0 ĺ JJ ⊤ w, i.e. 0 ĺ J · AMGDA (J), so
AMGDA is non-conflicting.

20
2 0
" #    
a 1
With J = 0 2 , if 0 ĺ a ĺ 1, AMGDA (J) = . However, if a ľ 1, AMGDA (J) = .
a 1
a a
This is not affine in a, so AMGDA is not linear under scaling. In particular, if any row of J is 0,
AMGDA (J) = 0. This implies that the optimization will stop whenever one objective has converged.

B.3 DualProj

The projection of a gradient of interest onto a dual cone was first described in [33]. When this
gradient is the average of the rows of the Jacobian, we call this aggregator ADualProj . Formally,
ADualProj (J) = m 1
· πJ (J ⊤ · 1). By construction, ADualProj is thus non-conflicting. By Proposition 1,
ADualProj (J) = m J · w, with w ∈ arg min1ĺv v ⊤ JJ ⊤ v. ADualProj is thus weighted.
1 ⊤

   
2 0 0
With J = , if a ľ 1, ADualProj (J) = . However, if 0.5 ĺ a ĺ 1, ADualProj (J) =
−2a 2a a
 
1−a
. This is not affine in a, so ADualProj is not linear under scaling.
a

B.4 PCGrad

APCGrad is described in [11]. It projects each gradient onto the orthogonal hyperplane of other
gradients in case of conflict with them, iteratively and in random order. When m ĺ 2, APCGrad is
deterministic and is such that APCGrad = m · AUPGrad . Therefore, in this case, it satisfies all three
properties. When m ą 2, APCGrad is non-deterministic, so APCGrad (J) is a random vector.
For any index i ∈ [m], let g i = J ⊤ · ei and let P (i) be distributed uniformly on the permutations

of the elements in [m] \ {i}. For instance, if m = 3, P (2) = [1 3] with probability 0.5 and

P (2) = [3 1] with probability 0.5. The iterative projection of APCGrad is then defined recursively
as:
g PC
i,1 = g i
jk = P (i)k
g PC
i,k · g jk
g PC PC PC
i,k+1 = g i,k − 1{g i,k · g jk ă 0} g jk (20)
∥g jk ∥2

We noticed that an equivalent formulation to the conditional projection of (20) is the projection onto
the dual cone of {g jk }:

g PC
i,k+1 = πg ⊤
j
(g PC
i,k ) (21)
k

Pm
Finally, the aggregation is given by APCGrad (J) = i=1 g PC
i,m .

For all i, g PC
i,m is always a linear combination of rows of J. APCGrad is thus weighted.

1 0
" #
If J = 0 1 , the only non-conflicting direction is 0. However, APCGrad (J) is uniform
−0.5 −1
       
0.4 0.8 0.4 0.8
over the set , , , . APCGrad is thus not non-conflicting. Note that here,
0.2 0.2 −0.2 −0.2

E[APCGrad (J)] = [0.6 0] , so APCGrad is neither non-conflicting in expectation.
To show that APCGrad is linear under scaling, let 0 ă c ∈ Rm , g ′i = ci g i , g ′i,1
PC
= g ′i and g ′i,k+1
PC
=
′ PC ′ PC PC
πg′⊤
j
(g i,k ). We show by induction that g i,k = c g
i i,k .
k

The base case is given by g ′i,1


PC
= g ′i = ci g i = ci g PC
i,1 .

Then, assuming the induction hypothesis g ′i,k


PC
= ci g PC ′ PC PC
i,k , we show g i,k+1 = ci g i,k+1 :

21
g ′i,k+1
PC
= πcj g⊤
j
PC
(ci g i,k ) (Induction hypothesis)
k k

g ′i,k+1
PC
= ci πg⊤
j
PC
(g i,k ) (0ăci and 0ăcjk )
k

g ′i,k+1
PC
= ci g PC
i,k+1 (By (21))
Pm PC
Therefore APCGrad (diag(c) · J) = i=1 ci g i,m , so APCGrad is linear under scaling.

B.5 GradDrop

The aggregator used by the GradDrop layer, which we denote AGradDrop , is described in [17]. It is
non-deterministic, so AGradDrop (J) is a random vector. Given J ∈ Rm×n , let|J| ∈ Rm×n be the
1 J ⊤ ·1
element-wise absolute value of J. Let P ∈ Rn be such that P = 2 1+ |J|⊤ ·1
, where the division
is element-wise. Each coordinate i ∈ [n] is independently assigned to the set A with probability Pi
and to the set B otherwise. The aggregation at coordinate i ∈ A is given by the sum of all positive
Jji , for j ∈ [m]. The aggregation at coordinate i ∈ B is given by the sum of all negative Jji , for
j ∈ [m].
 
−2
If J = , then P = [1/3]. Therefore, P[AGradDrop (J) = [−2]] = 2/3 and P[AGradDrop (J) =
1
[1]] = 1/3. Therefore, AGradDrop is not non-conflicting (even in expectation).
 
1 −1
AGradDrop is not linear under scaling. Indeed, suppose that J = then P = 21 · 1 and the
−1 1

 vectors [±1 ±1] with equal probability. Scaling the first line of J
aggregation is one of the four
2 −2 ⊤
by 2 yields J = , P = [2/3 1/3] which cannot lead to a uniform distribution over
−1 1
four elements.
 
1 −1 ⊤ ⊤
With J = , the span of J does not include [1 1] nor [−1 −1] . Therefore, AGradDrop
−1 1
is not weighted.

B.6 IMTL-G

Pm them. Let g i be
In [13], the authors describe a method to impartially balance gradients by weighting
the i’th row of J and let ui = ∥ggi ∥ . Then, they want to find a combination g = i=1 αi g i such that
i
⊤ ⊤
g ⊤ ui is equal for all i. Let U = [u1 − u2 . . . u1 − um ] , D = [g 1 − g 2 . . . g 1 − g m ] . If
⊤ Pm
α2:m = [α2 . . . αm ] , then α2:m = (U D⊤ )−1 U · g 1 and α1 = 1 − i=2 αi . Notice that this
is defined only when the gradients are linearly independent. Strictly speaking, this is thus not an
aggregator since it can only be computed on matrices of rank m. To generalize to the case where J has
rank less than m, we provide what we believe to be a more natural formulation, which is equivalent
in the case where J has rank m. Let J ∈ Rm×n and J ′ ∈ Rm×n be the row normalized version of J.
Reformulating the desired property that g ⊤ ui is constant over i, in our framework, we solve for the
following objective. Let w ∈ Rm be such that 1⊤ w = 1 and J ′ J ⊤ w ∝ 1. The aggregated vector
is then J ⊤ w. If J ′ J ⊤ is full rank, then w ∝ (J ′ J ⊤ )−1 1. Letting d ∈ Rm be the vector of norms
of the rows of J, this can be rewritten as w ∝ (JJ ⊤ )−1 d. When JJ ⊤ is not full rank, using the
Moore-Penrose inverse instead of the usual inverse seems to be a reasonable generalization. The
vector of weights is then given by w ∝ (JJ ⊤ )† d. Finally whenever 1⊤ (JJ ⊤ )† d = 0, set w = 0.
Throughout this paper, AIMTL-G refers to this generalization.
Clearly, AIMTL-G is weighted. We now show that" AIMTL-G is not # non-conflicting. If J =
1 −1 −1
⊤ ⊤
[1 −1 −1] , then d = [1 1 1] . Also JJ ⊤ = −1 1 1 and therefore (JJ ⊤ )† · d =
−1 1 1

1
9 [−1
1 1] . Note that in order to ensure 1 w = 1, we have to scale by 1⊤ (JJ ⊤ )† · d = 91 .

22
⊤ ⊤
Therefore w = [−1 1 1] , AIMTL-G (J) = J ⊤ w = [−3] and J · AIMTL-G (J) = [−3 3 3] .
AIMTL-G is thus not non-conflicting.
It should be noted that when J has rank m, AIMTL-G seems to be non-conflicting. Thus, it would
be possible to make a different non-conflicting generalization, for instance, by deciding 0 when the
matrix is not full rank.
AIMTL-G is not linear under scaling. if 0 ă c ∈ Rm , then the weights associated to diag(c) · J are
proportional to (diag(c) · JJ ⊤ · diag(c))† · diag(c) · d = diag(c)† · (JJ ⊤ )† · d. The normalization
value is therefore 1⊤ diag(c)1† ·(JJ ⊤ )† ·d . Therefore

J ⊤ · diag(c) · diag(c)† · (JJ ⊤ )† · d


AIMTL-G (diag(c) · J) =
1⊤ diag(c)† · (JJ ⊤ )† · d
J ⊤ (JJ ⊤ )† · d
= ⊤
1 diag(c)† · (JJ ⊤ )† · d

This is not linear in c, so AIMTL-G is not linear under scaling.

B.7 CAGrad

ACAGrad is described in [12]. It is parameterized by c ∈ [0, 1[ . If c = 0, this is equivalent to AMean .


Therefore, we restrict our analysis to the case c ą 0. For any J ∈ Rm×n , let ḡ be the average gradient
1 ⊤ n
m J · 1. The aggregated vector is defined as the minimizer d ∈ R of the following optimization
problem

arg max min e⊤


i Jd
d∈Rn : i∈[m]
∥d−g∥ĺc∥g∥

 
2 0 ⊤
For 0 ă c ă 1, ACAGrad can have conflict. If J = , then g = [−a 1] and
−2a − 2 2

∥g∥ = a2 + 1. Observe that any d ∈ Rn satisfying the constraint ∥d − g∥ ĺ c∥g∥ q has first
√ √ c2
2 2
coordinate at most −a + c a + 1. Suppose that a is such that −a + c a + 1 ă 0, i.e. 1−c 2 ă a.

For such an a, any feasible d has a negative first coordinate, making it conflict with the first row of J.
ACAGrad is thus not non-conflicting.
Note that if we generalize to the case c ľ 1, as suggested in the original paper, then d = 0 becomes
feasible, which yields mini∈[m] e⊤ ⊤
i Jd = 0. Therefore the optimal d is such that 0 ĺ mini∈[m] ei Jd,
i.e. 0 ĺ Jd. With c ľ 1, ACAGrad would thus be non-conflicting.
In [12], they formulate ACAGrad using its dual formulation. If w ∈ Rm is a solution to

min 1⊤ JJ ⊤ w + c · ∥J ⊤ 1∥ · ∥J ⊤ w∥
0ĺw:
1⊤ w=1

c∥J ⊤ 1∥
 
1 ⊤
then the aggregation is mJ 1+ ∥J ⊤ w∥
w . Therefore, ACAGrad is weighted.
 
2 0
ACAGrad is not linear under scaling. We sketch a proof of this. Suppose that J = . Note that
0 2a
⊤ √
g = [1 a] and ∥g∥ = 1 + a2 . One can show that the constraint ∥d − g∥ ĺ c∥g∥ needs to be
satisfied with equality since, otherwise, we can scale d to make√the objective larger. Substituting J
in mini∈[m] e⊤ 2 2
i Jd yields 2 min(d1 , ad2 ). For any a such that c 1 + a + 1 ă a , it can be shown
that the optimal d is such that d1 ă ad2 . In that case the inner minimum over i is 2d1 and, to satisfy

∥d − g∥ = c∥g∥, the KKT  conditions
 √ over the Lagrangian yield d − g ∝ ∇d d1 = [1 0] . This
c · ∥g∥ + 1 c 1 + a2 + 1
yields d = = . This is not affine in a; therefore, ACAGrad is not linear
a a
under scaling.

23
B.8 RGW

ARGW is defined in [14] as the weighted sum of the rows of the input matrix, with a random weighting.
The weighting is obtained by sampling m i.i.d. normally distributed random variables and applying a
softmax. Formally, ARGW (J) = J ⊤ · σ(W ), with W ∼ N (0, I). By design, ARGW is thus weighted.
 
1
When J = , the only non-conflicting solution is 0. However, P[ARGW (J) = 0] = 0. ARGW is
−1
thus not non-conflicting.
ARGW (diag(c) · J) = (diag(c) · J)⊤ · σ(W ) = J ⊤ · diag(c) · σ(W ) = J ⊤ · diag(σ(W )) · c.
Therefore, ARGW is linear under scaling in distribution.

B.9 Nash-MTL

Nash-MTL is described in [15]. Unfortunately, we were not able to verify the proof of Claim 3.1,
and we believe that the official implementation of Nash-MTL may mismatch the desired objective by
which it is defined. Therefore, we only analyze the initial objective even though our experiments for
this aggregator are conducted with the official implementation.
Let J ∈ Rm×n and ε ą 0. Let also Bε = {d ∈ Rn : ∥d∥ ĺ ε, 0 ĺ Jd}. ANash-MTL is then defined
as
m
X
ANash-MTL (J) ∈ arg max log(e⊤
i Jd)
d∈Bε i=1

By the constraint, ANash-MTL is non-conflicting. If an aggregator A is linear under scaling, it should


be the case that A(aJ) = aA(J) for any scalar a ą 0 and any J ∈ Rm×n . However, log(ae⊤ i Jd) =
log(e⊤i Jd) + log(a). This means that scaling by a scalar does not impact aggregation. Since this is
not the trivial 0 aggregator, ANash-MTL is not linear under scaling.
If d′ is the projection of d onto the span of the rows of J, then ∥d′ ∥ ĺ ∥d∥ and Jd = Jd′ . Therefore,
d′ is feasible and has the same associated objective value. Therefore, without loss of generality,
ANash-MTL is weighted.

B.10 Aligned-MTL

The Aligned-MTL method for balancing the Jacobian is described in [16]. For simplicity, we fix
1
the vector of preferences to m 1, but the proofs can be adapted for any non-trivial vector. Given
J ∈R m×n
, compute the eigen-decomposition of G = JJ ⊤ = V Σ2 V ⊤ . Let Σ† be the diagonal
matrix whose non-zero elements are the inverse of corresponding non-zero diagonal elements of Σ.
1 ⊤
Let σmin = mini∈[m],Σii ̸=0 Σii . Then the aggregation by AAligned-MTL is equal to m J · w, with
† ⊤
w = σmin · V Σ V · 1. It follows immediately that AAligned-MTL is weighted.
If the SVD of J is V ΣU ⊤ , then J ⊤ w = σmin U P V ⊤ 1 with P = Σ† Σ a diagonal projec-
tion matrix with 1s corresponding to non zero elements of Σ √ and 0s everywhere
 else.
 Fur- 
σmin ⊤ 1 3 √1 1 0
ther, J · AAligned-MTL (J) = m · V ΣV 1. Taking V = 2 , and Σ =
−1 3 0 0
1 ⊤ 1
 √ √ ⊤
yields J · AAligned-MTL (J) = 2 · V ΣV 1 = 8 −3 + 3 1 − 3 which is not non-negative.
AAligned-MTL is thus not non-conflicting.
 
1 0
If J = , then U = V = I, Σ = J. For 0 ă a ĺ 1, σmin = a, therefore AAligned-MTL (J) =
0 a
a 1
2 · 1. For 1 ă a, σmin = 1 and therefore AAligned-MTL (J) = 2 , which makes AAligned-MTL not linear
under scaling.

24
C Experimental settings
For all of our experiments, we used PyTorch [55]. We have developed an open-source library4 on
top of it to enable Jacobian descent easily. This library is designed to be reusable for many other use
cases than the experiments presented in our work. To separate them from the library, the experiments
have been conducted with a different code repository5 mainly using PyTorch and our library.

C.1 Learning rate selection

The learning rate has a very important impact on the speed of optimization. To make the comparisons
as fair as possible, we always show the results corresponding to the best learning rate. We have
selected the area under the loss curve as the criterion to compare the learning rates. This choice
is arbitrary but seems to work well in practice: lower area under the loss curve means that the
optimization is fast (quick loss decrease) and stable (few bumps in the loss curve). Concretely, for
each random rerun and for each aggregator, we first try 22 learning rates from 10−5 to 102 , increasing
1
by a factor 10 3 every time. The two best learning rates from this range then define a refined range of
1
plausible good learning rates, going from the smallest of those two multiplied by 10− 3 to the largest
1
of those two multiplied by 10 3 . This margin makes it unlikely for the best learning rate to lie out of
the refined range. After this, 50 learning rates from the refined range are tried. These learning rates
are evenly spaced in the exponent domain. The one with the best area under the loss curve is then
selected and presented in the plots. Note that for simplicity, we have always used a constant learning
rate, i.e. no learning rate scheduler was used.
We think that this approach has the advantage of being quite simple and very precise, thus giving
trustworthy results. However, it requires a total of 72 trainings for each aggregator, each random
rerun and each dataset. For this reason, we have opted for working on small subsets of the original
datasets.

C.2 Random reruns and standard error of the mean

To get an idea of confidence in our results, every experiment is performed 8 times on a different seed
and on a different subset, of size 1024, of the training dataset. The seed used for run i ∈ [8] is always
simply set to i. Because each random rerun includes the full learning rate selection method described
in Appendix C.1, it is sensible to consider the 8 sets of results as i.i.d. For each point of both the loss
curves and the cosine similarity
q P8 curves, we thus compute the estimated standard error of the mean
2
1 i=1 (vi −v̄)
with the usual formula √8 8−1 , where vi is the value of a point of the curve for random
rerun i, and v̄ is the average value of this point over the 8 runs.

C.3 Model architectures

In all experiments, the models are simple convolutional neural networks. All convolutions always
have a stride of 1x1, a kernel size of 3x3, a learnable bias and no padding. All linear layers always
have a learnable bias. The activation function is the exponential linear unit [56]. The full architectures
are given in Tables 2, 3, 4 and 5. Note that these architectures have been fixed arbitrarily, i.e. they
were not optimized through some hyper-parameter selection. The weights of the model have been
initialized with the default initialization scheme of PyTorch.

C.4 Optimizer

For all experiments except those described in Appendix D.3, we always use the basic SGD optimizer
of PyTorch, without any regularization or momentum. Note that here, SGD refers to the PyTorch
optimizer that updates the parameters of the model in the opposite direction of the gradient, which, in
our case, is replaced by the aggregation of the Jacobian matrix. In the rest of this paper, SGD refers
to the whole stochastic gradient descent algorithm. In the experiments of Appendix D.3, we use Adam
instead, to study its interactions with JD.
4
Available at https://github.com/TorchJD/torchjd
5
Available at https://github.com/ValerianRey/jde

25
Table 2: Architecture used for SVHN
Conv2d (3 input channels, 16 output channels, 1 groups), ELU
Conv2d (16 input channels, 32 output channels, 16 groups)
MaxPool2d (stride of 2x2, kernel size of 2x2), ELU
Conv2d (32 input channels, 32 output channels, 32 groups)
MaxPool2d (stride of 3x3, kernel size of 3x3), ELU, Flatten
Linear (512 input features, 64 output features), ELU
Linear (64 input features, 10 outputs)

Table 3: Architecture used for CIFAR-10


Conv2d (3 input channels, 32 output channels, 1 group), ELU
Conv2d (32 input channels, 64 output channels, 32 groups)
MaxPool2d (stride of 2x2, kernel size of 2x2), ELU
Conv2d (64 input channels, 64 output channels, 64 groups)
MaxPool2d (stride of 3x3, kernel size of 3x3), ELU, Flatten
Linear (1024 input features, 128 output features), ELU
Linear (128 input features, 10 outputs)

Table 4: Architecture used for EuroSAT


Conv2d (3 input channels, 32 output channels, 1 group)
MaxPool2d (stride of 2x2, kernel size of 2x2), ELU
Conv2d (32 input channels, 64 output channels, 32 groups)
MaxPool2d (stride of 2x2, kernel size of 2x2), ELU
Conv2d (64 input channels, 64 output channels, 64 groups)
MaxPool2d (stride of 3x3, kernel size of 3x3), ELU, Flatten
Linear (1024 input features, 128 output features), ELU
Linear (128 input features, 10 outputs)

Table 5: Architecture used for MNIST, Fashion-MNIST and Kuzushiji-MNIST


Conv2d (1 input channel, 32 output channels, 1 group), ELU
Conv2d (32 input channels, 64 output channels, 1 group)
MaxPool2d (stride of 2x2, kernel size of 2x2), ELU
Conv2d (64 input channels, 64 output channels, 1 group)
MaxPool2d (stride of 3x3, kernel size of 3x3), ELU, Flatten
Linear (576 input features, 128 output features), ELU
Linear (128 input features, 10 outputs)

C.5 Loss function

The loss function is always the usual cross-entropy, with the default parameters of PyTorch.

C.6 Preprocessing

The inputs are always normalized per-channel based on the mean and standard deviation computed
on the entire training split of the dataset.

C.7 Iterations and computational budget

The numbers of epochs and the corresponding numbers of iterations for all datasets are provided in
Table 6, along with the required number of NVIDIA L4 GPU-hours, to run all 72 learning rates for

26
the 11 aggregators, on a single seed. The total computational budget to run the main experiments on
8 seeds was thus around 760 GPU-hours. Additionally, we have used a total of about 100 GPU-hours
for the experiments varying the batch size and using Adam, and about 200 more GPU-hours were
used for early investigations.

Table 6: Numbers of epochs, iterations and GPU-hours for each dataset


Dataset Epochs Iterations GPU-Hours
SVHN 25 800 17
CIFAR-10 20 640 15
EuroSAT 30 960 32
MNIST 8 256 6
Fashion-MNIST 25 800 17
Kuzushiji-MNIST 10 320 8

27
D Additional experimental results
In this appendix, we provide additional experimental results about IWRM.

D.1 All datasets and all aggregators

Figures 3, 4, 5, 6, 7 and 8 show the full results of the experiments described in Section 5 on SVHN,
CIFAR-10, EuroSAT, MNIST, Fashion-MNIST and Kuzushiji-MNIST, respectively. For readability,
the results are displayed on three different plots for each dataset. We always show AUPGrad and AMean
for reference. The exact experimental settings are described in Appendix C.
It should be noted that some of these aggregators were not developed as general-purpose aggregators,
but mostly for the use case of multi-task learning, with one gradient per task. Our experiments present
a more challenging setting than multi-task learning optimization, because conflict between rows of
the Jacobian is typically higher. Besides, for some aggregators, e.g. AGradDrop and AIMTL-G , it was
advised to make the aggregation of gradients w.r.t. an internal activation (such as the last shared
representation), rather than w.r.t. the parameters of the model [13, 17]. To enable comparison, we
instead always aggregated the Jacobian w.r.t. all parameters.
We can see that AUPGrad provides a significant improvement over AMean on all datasets. Moreover, the
performance gaps seem to be linked to the difficulty of the dataset, which suggests that experimenting
with harder tasks is a promising future direction. The intrinsic randomness of ARGW and AGradDrop
reduces the train set performance, but it could positively impact the generalization, which we do not
study here. We suspect the disappointing results of ANash-MTL to be caused by issues in the official
implementation that we used, leading to instability.

D.2 Varying the batch size

Figure 9 shows the results on CIFAR-10 with AUPGrad when varying the batch size from 4 to 64.
Concretely, because we are using SSJD, this makes the number of rows of the sub-Jacobian aggregated
at each step vary from 4 to 64. Recall that IWRM with SSJD and AMean is equivalent to ERM with
SGD. We see that with a small batch size, in this setting, AUPGrad becomes very similar to AMean .
This is not surprising, since with a batch size of 1, both would be equivalent. Conversely, a larger
batch size increases the gap between AUPGrad and AMean . Since the projections of AUPGrad are onto
the dual cone of more rows, each step becomes non-conflicting with respect to more of the original
1024 objectives, pushing even further the benefits of the non-conflicting property. In other words,
increasing the batch size refines the dual cone, thereby improving the quality of the projections. It
would be interesting to study more theoretically the impact of batch size in this setting.

D.3 Compatibility with Adam

Figure 10 gives the results on CIFAR-10 and SVHN when using Adam rather than the SGD optimizer.
Concretely, this corresponds to the Adam algorithm in which the gradient is replaced by the aggre-
gation of the Jacobian. The learning rate is still tuned as described in Appendix C.1, but the other
hyperparameters of Adam are fixed to the default values of PyTorch, i.e. β1 = 0.9, β2 = 0.999 and
ϵ = 10−8 . Because optimization with Adam is faster, the number of epochs for SVHN and CIFAR-10
is reduced to 20 and 15, respectively. While the performance gap is smaller with this optimizer, it is
still significant and suggests that our methods are beneficial with other optimizers than the simple
SGD. Note that this analysis is fairly superficial. The thorough investigation of the interplay between
aggregators and momentum-based optimizers is a compelling future research direction.

28
(a) Training loss (b) Update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0
UPGrad (ours) PCGrad
2.0 MGDA DualProj
Categorical cross-entropy

0.8

Cosine similarity
1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

(c) Training loss (d) Update similarity to the SGD update


Mean (SGD) RGW 1.0
UPGrad (ours) GradDrop
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

(e) Training loss (f) Update similarity to the SGD update


Mean (SGD) IMTL-G
2.5 1.0
UPGrad (ours) Nash-MTL
CAGrad(c=0.5)
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

Figure 3: SVHN results.

29
(a) Training loss (b) Update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0

2.5 UPGrad (ours) PCGrad


MGDA DualProj
Categorical cross-entropy

0.8
2.0

Cosine similarity
0.6
1.5

1.0 0.4

0.5 0.2

0.0
0.0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
Iteration Iteration

(c) Training loss (d) Update similarity to the SGD update


2.5
Mean (SGD) RGW 1.0
UPGrad (ours) GradDrop
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
Iteration Iteration

(e) Training loss (f) Update similarity to the SGD update


2.5
Mean (SGD) IMTL-G 1.0
UPGrad (ours) Nash-MTL
2.0 CAGrad(c=0.5)
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
Iteration Iteration

Figure 4: CIFAR-10 results.

30
(a) Training loss (b) Update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0
UPGrad (ours) PCGrad
2.0 MGDA DualProj
Categorical cross-entropy

0.8

Cosine similarity
1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 200 400 600 800 0 200 400 600 800
Iteration Iteration

(c) Training loss (d) Update similarity to the SGD update


Mean (SGD) RGW 1.0
UPGrad (ours) GradDrop
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 200 400 600 800 0 200 400 600 800
Iteration Iteration

(e) Training loss (f) Update similarity to the SGD update


Mean (SGD) IMTL-G 1.0
UPGrad (ours) Nash-MTL
2.0 CAGrad(c=0.5)
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 200 400 600 800 0 200 400 600 800
Iteration Iteration

Figure 5: EuroSAT results.

31
(a) Training loss (b) Update similarity to the SGD update
2.5
Mean (SGD) Aligned-MTL 1.0
UPGrad (ours) PCGrad
2.0 MGDA DualProj
Categorical cross-entropy

0.8

Cosine similarity
1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 0 50 100 150 200 250
Iteration Iteration

(c) Training loss (d) Update similarity to the SGD update


2.5
Mean (SGD) RGW 1.0
UPGrad (ours) GradDrop
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 0 50 100 150 200 250
Iteration Iteration

(e) Training loss (f) Update similarity to the SGD update


2.5
Mean (SGD) IMTL-G 1.0
UPGrad (ours) Nash-MTL
2.0 CAGrad(c=0.5)
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 0 50 100 150 200 250
Iteration Iteration

Figure 6: MNIST results.

32
(a) Training loss (b) Update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0
UPGrad (ours) PCGrad
2.0 MGDA DualProj
Categorical cross-entropy

0.8

Cosine similarity
1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

(c) Training loss (d) Update similarity to the SGD update


Mean (SGD) RGW 1.0
UPGrad (ours) GradDrop
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

(e) Training loss (f) Update similarity to the SGD update


Mean (SGD) IMTL-G 1.0
UPGrad (ours) Nash-MTL
2.0 CAGrad(c=0.5)
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Iteration Iteration

Figure 7: Fashion-MNIST results.

33
(a) Training loss (b) Update similarity to the SGD update
Mean (SGD) Aligned-MTL 1.0
UPGrad (ours) PCGrad
2.0 MGDA DualProj
Categorical cross-entropy

0.8

Cosine similarity
1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Iteration Iteration

(c) Training loss (d) Update similarity to the SGD update


Mean (SGD) RGW 1.0
UPGrad (ours) GradDrop
2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Iteration Iteration

(e) Training loss (f) Update similarity to the SGD update


Mean (SGD) IMTL-G 1.0
UPGrad (ours) Nash-MTL
2.0 CAGrad(c=0.5)
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Iteration Iteration

Figure 8: Kuzushiji-MNIST results.

34
(a) BS = 4: Training loss (b) BS = 4: Update similarity to the SGD update
2.5
Mean (SGD) UPGrad (ours) 1.0

2.0
Categorical cross-entropy

0.8

Cosine similarity
1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Iteration Iteration

(c) BS = 16: Training loss (d) BS = 16: Update similarity to the SGD update
2.5
Mean (SGD) UPGrad (ours) 1.0

2.0
Categorical cross-entropy

0.8
Cosine similarity

1.5
0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Iteration Iteration

(e) BS = 64: Training loss (f) BS = 64: Update similarity to the SGD update

2.5 Mean (SGD) UPGrad (ours) 1.0


Categorical cross-entropy

2.0 0.8
Cosine similarity

1.5 0.6

1.0 0.4

0.5
0.2

0.0
0.0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Iteration Iteration

Figure 9: CIFAR-10 results with different batch sizes (BS). The number of epochs is always 20, so
the number of iterations varies.

35
(a) SVHN: Training loss (b) SVHN: Update similarity to the SGD update
2.5 Mean UPGrad (ours) 1.0

2.0
Categorical cross-entropy

0.8

Cosine similarity
1.5 0.6

1.0
0.4

0.5
0.2

0.0
0.0
0 100 200 300 400 500 600 0 100 200 300 400 500 600
Iteration Iteration

(c) CIFAR-10: Training loss (d) CIFAR-10: Update similarity to the SGD update

Mean UPGrad (ours) 1.0


2.5
Categorical cross-entropy

0.8
2.0
Cosine similarity

0.6
1.5

1.0 0.4

0.5 0.2

0.0
0.0
0 100 200 300 400 0 100 200 300 400
Iteration Iteration

Figure 10: Results with the Adam optimizer.

36
E Computation time
In this appendix, we compare the computation time of SGD with that of SSJD for all the aggregators
that we experimented with. Since we used the same architecture for MNIST, Fashion-MNIST and
Kuzushiji-MNIST, we only report the results for one of them. It is important to clarify all of the
factors that affect this computation time. First, the batch size affects the number of rows in the
Jacobian to aggregate. Increasing the batch size thus requires more GPU memory and the aggregation
of a taller matrix. Then, some aggregators, e.g. ANash-MTL and AMGDA , seem to greatly increase the
run time. When the aggregation is the bottleneck, a faster implementation will be necessary to make
them usable in practice. Lastly, the current implementation of our library for JD with PyTorch is
still fairly inefficient in terms of memory management, which in turn limit how well the GPU can
parallelize. Therefore, these results just give a rough indication of the current computation times, but
they are in no way final.

Table 7: Time required in seconds for one epoch of training with SGD and different instances of
SSJD, on an NVIDIA L4 GPU. The batch size is always 32.
Method SVHN CIFAR-10 EuroSAT MNIST
SGD 0.79 0.50 0.81 0.47
SSJD + AMean 1.41 1.76 2.93 1.64
SSJD + AMGDA 5.50 5.22 6.91 5.22
SSJD + ADualProj 1.51 1.88 3.02 1.76
SSJD + APCGrad 2.78 3.13 4.18 3.01
SSJD + AGradDrop 1.57 1.90 3.06 1.78
SSJD + AIMTL-G 1.48 1.79 2.94 1.69
SSJD + ACAGrad 1.93 2.26 3.42 2.17
SSJD + ARGW 1.42 1.76 2.89 1.73
SSJD + ANash-MTL 7.88 8.12 9.33 7.91
SSJD + AAligned-MTL 1.53 1.98 2.97 1.71
SSJD + AUPGrad 1.80 2.01 3.21 1.90

37
F Forward propagation of the Gramian through some typical layers
In this appendix, we study the forward propagation of the Gramian, as per (12), through some typical
neural network layers. We start by noting that some computations can be saved when the function
has a block structure.
Let g : Rn × Rℓ → Rm a function that maps an input x ∈ Rn and some parameter p ∈ Rℓ to some
output g(x, p). Applying g independently in parallel yields a function f : Rn×k × Rℓ → Rm×k ,
where k is the number of blocks.6 For an input X ∈ Rn×k and a parameter p ∈ Rℓ ,
| |
" #
f (X, p) = g(X · e1 , p) . . . g(X · ek , p) .
| |

In that case, forwarding the Gramian requires the matrices


J1 g(X · e1 , p) 0 ··· 0
 
 0 J1 g(X · e2 , p) · · · 0 
J1 f (X, p) =  .. .. .. .. 
 . . . . 
0 0 ··· J1 g(X · ek , p)
J1 J1⊤ J1 J2⊤ · · · J1 Jk⊤
 
J2 J1⊤ J2 J2⊤ · · · J2 J ⊤ 
k 
G2 f (X, p) =  . ..  ,

.. ..
 .. . . . 
Jk J1⊤ Jk J2⊤ ··· Jk Jk⊤
where Ji = J2 g(X · ei , p).
Because of the block-diagonal structure of J1 f (X, p), for a given G ∈ Mnk , the term
J1 f (X, p) · G · J1 f (X, p)⊤ can be computed in O(k 2 (mn2 + nm2 )) operations rather than
the naive O(k 3 (mn2 + nm2 )). Appendices F.1, F.2 and F.3 provide considerations about Gramian
forwarding through some usual layers that have a particular block structure. For parameterized layers,
G2 f (X, p) can sometimes be precomputed efficiently, depending on the layer. More details are given
in Appendix F.4 for a fully connected layer.

F.1 Activation functions

Activation functions generally consider each element of the activation independently. We thus have
m = n = 1, ℓ = 0 and k equal to the number of elements of the activation multiplied by the batch
size. The matrix J1 f (X) is thus diagonal. The computational complexity of forwarding a Gramian
through such a function is then O(k 2 ). This applies to most usual activation functions, such as ReLU,
ELU, sigmoid, and tanh.

F.2 Softmax layer

In the softmax layer, the group size k is the batch size. Moreover, m = n and ℓ = 0. The
computational complexity of forwarding a Gramian is thus O(k 2 m3 ).

F.3 BatchNorm layer

The batch normalization layer [57], in its simplest form (without the learnable affine transformation),
normalizes its input on the batch dimension. Here k is the input size, ℓ = 0, and m = n is the batch
size. In that case, the computational complexity of forwarding a Gramian is O(k 2 m3 ).

F.4 Fully connected layer

In a fully connected layer, g : Rn × Rm×n × Rm → Rm is given by g(x, W, b) = W x + b and f


has a k-blockwise structure with k the batch size. If X ∈ Rn×k and G ∈ Rnk×nk , then forwarding
6
The block size k could be, among others, the batch size or the number of elements of an activation.

38
those through the layer yields
J1 f (X, W, b) · G · J1 f (X, W, b)⊤ + G2 f (X, W, b) + G3 f (X, W, b)
with
W 0 ··· 0
 
0 W ··· 0
J1 f (X, W, b) = 
 ... .. .. .. 
. . . 
0 0 ··· W
= Ik ⊗ W,
where ⊗ is the Kronecker product and Ik the identity matrix of dimension k. In principle,
J1 f (X, W, b) · G · J1 f (X, W, b)⊤ can be computed efficiently.
The Gramians with respect to the parameters are also rather simple. Let xi = Xei , then
 ⊤
x1 x1 Im x⊤ ⊤

1 x2 Im · · · x1 xk Im
x⊤ ⊤ ⊤
 2 x1 Im x2 x2 Im · · · x2 xk Im 

G2 f (X, W, b) =  .. .. .. .
 . . . .. 

x⊤
k x1 Im x⊤
k x 2 Im ··· x⊤
k xk Im
= (X ⊤ X) ⊗ Im
Im Im · · · Im
 
Im Im · · · Im 
G3 f (X, W, b) = 
 ... .. .. .. 
. . . 
Im Im · · · Im
= 1k ⊗ Im ,
with 1k the all 1 matrix of dimension k × k.
Therefore, (12) rewrites as
(Ik ⊗ W ) · G · (Ik ⊗ W ) + (X ⊤ X + 1k ) ⊗ Im .

This illustrates that one should think of G as the quadratic mapping J → JGJ ⊤ , and similarly of J
as the linear mapping G → JGJ ⊤ . Representing those mappings with arrays would be sub-optimal.
Taking advantage of their structure (in this case, the Kronecker product of two matrices) would be
beneficial.
It is worth noting that even though (X ⊤ X + 1k ) ⊗ Im encompasses the dependencies of the Gramian
to the parameters, it does not require any computation over the parameter dimension. If the number
of parameters is large, this can provide a substantial improvement over the naive computation of the
Gramian via the Jacobian.

39

You might also like