LectureNotes-large-scale and Distributed Optimization
LectureNotes-large-scale and Distributed Optimization
distributed optimization
Clément W. Royer
The purpose of these lectures is to review optimization frameworks that allow to deploy the methods
seen in other parts of the course at scale. The learning goals are as follows:
• Understand the main principle behind coordinate descent methods, and their interest in a
distributed environment.
2
Large-scale/distrib. opti - 2022/2023 3
Notations
• Scalars (i.e. reals) are denoted by lowercase letters: a, b, c, α, β, γ.
• The set of natural numbers (nonnegative integers) is denoted by N; the set of integers is
denoted by Z.
• The set of real numbers is denoted by R. Our notations for the subset of nonnegative real
numbers and the set of positive real numbers are R+ and R++ , respectively.
• The notation Rd is used for the set of vectors with d ∈ N real components; although we may
not explicitly indicate it in the rest of these notes, we always assume that d ≥ 1.
xd
• We use Rn×d to denote the set of real rectangular matrices with n rows and d columns, where
n and d will always be assumed to be at least 1. If n = d, Rd×d refers to the set of square
matrices of size d.
• Given a matrix A ∈ Rn×d , Aij refers to the coefficient from the i-th row and the j-th column of
A: the diagonal of A is given by the coefficients Aii . Provided this notation is not ambiguous,
we use the notations A, [Aij ]1≤i≤n and [Aij ] interchangeably.
1≤j≤d
• For every d ≥ 1, Id refers to the identity matrix in Rd×d (with 1s on the diagonal and 0s
elsewhere).
Contents
1 Introduction 5
4
Chapter 1
Introduction
In these lectures, we dive into a increasingly important area of focus in optimization methods for
data science. As we witness a growth in both the model complexity (i.e. the number of parameters)
and the amount of data available (i.e. the size of the dataset), standard optimization techniques
may suffer from the curse of dimensionality and their performance may deteriorate as dimensions
grow. The goal of these notes is to present some algorithmic ideas that can reduce the impact of
large dimensions, either in terms of parameters or data points.
Chapter 2 focuses on a setting in which the number of parameters to optimize over is extremely
large, and describes how coordinate descent approaches can be used in this setting. Chapter 3
considers the case in which the data itself is not available in a centralized fashion. Using tools from
duality theory, efficient algorithms that can exploit this particular problem structure can be applied
in that setting.
5
Chapter 2
In this chapter, we address the treatment of large-scale optimization problems, where the number of
parameters to be optimized over is extremely large. In general, due to the curse of dimensionality,
the difficulty of the problem increases with the dimension, simply because there are more variables to
consider. However, on structured problems such as those arising in data science, there often exists a
low-dimensional or separable structure that allows for optimization steps to be taken over a subset of
variables. This is the underlying idea of coordinate descent methods, that have regained interest
in the early 2000s due to their applicability in certain data science settings.
where f ∈ C 1 (Rd ). The idea of coordinate descent methods consist in taking a gradient step with
respect to a single decision variable at every iteration. To this end, we observe that for every x ∈ Rd ,
the gradient of f at x can be decomposed as
d
X
∇f (x) = ∇j f (x)ej ,
j=1
where ∇j denotes the partial derivative with respect to the j-th variable of the function f (that
is, the jth coordinate of f ) and ej ∈ Rd is the jth coordinate vector of the canonical basis in Rd .
The coordinate descent approach replaces the full gradient by a step along a coordinate gradient, as
formalized in Algorithm 1.
The variants of coordinate descent are mainly identified by the way they select the coordinate
sequence {jk }. There exist numerous rules for choosing the coordinate index, among which:
• Cyclic: Select the indices by cycling over {1, . . . , d} in that order. After d iterations, all indices
have been selected.
• Randomized cyclic: Cycle through a random ordering of {1, . . . , d}, that changes every d steps.
6
Large-scale/distrib. opti - 2022/2023 7
3. Set
xk+1 = xk − αk ∇jk f (xk )ejk . (2.1.2)
end
Block coordinate descent Rather than using a single index, it is possible to select a subset of the
coordinates (called “block” in the literature). The kth iteration of such a block coordinate descent
algorithm thus is X
xk+1 = xk − αk ∇j f (xk )ej , (2.1.3)
j∈Bk
where Bk ⊂ {1, . . . , d}. A block typically consists in coordinates drawn without replacement.
Remark 2.1.1 As described in the lab session, it is possible to combine randomized coordinate
descent with Nesterov’s acceleration technique to yield improve theoretical guarantees. However,
this raises implementation issues that may alleviate the practical interest of coordinate descent
approaches.
is used to define a finite sum with d gradients that can be sampled using any distribution that one
would use in a stochastic gradient setting. Note that, unlike in a stochastic gradient framework, any
coordinate descent step (even randomized ones) uses a descent direction.
Consider a finite-sum problem of the form
n
1X
minimize fi (x), fi (x) := `i (aT
i x), (2.1.4)
x∈Rd n
i=1
8 Large-scale/distrib. opti - 2022/2023
where aT i x is a linear model of the data vector ai , and `i : R → R is a convex loss function specific to
the ith data point (such as `i (h) = 12 (h − yi )2 for linear least squares). If the number of data points
is large, it is natural to think of applying stochastic gradient to this problem. Another approach
consists in considering an equivalent formulation of (2.1.4) through (Fenchel) duality, given by
n
1X ∗
maximize g(v) := − fi (vi ) (2.1.5)
v∈Rn n
i=1
where for any convex function φ : Rm → R, the convex conjugate function φ∗ is defined by
The so-called dual problem (2.1.5) has a finite-sum, separable form. It can thus be tackled using
(dual) coordinate ascent, the counterpart of coordinate descent for minimization: the iteration of
this method is given by
v k+1 = v k + αk ∇i g(v k ), (2.1.6)
leading to updating the iterate one coordinate at a time. Under appropriate assumptions on the
problem, the iteration (2.1.6) is equivalent to the original stochastic gradient iteration, with xk =
1 Pn
λn i=1 [v k ]i ai . For this reason, stochastic gradient is sometimes viewed as applying coordinate
ascent to the dual problem.
Note that if f is separable and all fj s are C 1 , then f is also C 1 and we have
which means that a coordinate descent update can be computed while only accessing one coordinate.
The separability property is a strong property, that essentially states that a function can be
minimized independently with respect to each of its variables. It is actually more frequent for a
function to be partially separable in the following sense.
Definition 2.1.2 A function f : Rd → R is called partially separable if it exists J functions {fj }Jj=1
and J subsets of {1, . . . , d} denoted by B1 , . . . , BJ such that
Large-scale/distrib. opti - 2022/2023 9
i) For any x ∈ Rd ,
J
X
f (x) = fj [x]Bj ,
j=1
where [x]Bj ∈ R|Bj | denotes the vector formed by the coordinates of x belonging to Bj ;
The notion of partial separability is only interesting whenever the different sets {Bj }Jj=1 do not
intersect too much, the ideal case being when they form a partition of {1, . . . , d}.
Example 2.1.1 (Regularized least squares with sparse data) Given A ∈ Rn×d with sparse rows
and y ∈ Rn , consider the problem
n d
1 X X
minimize f (x) := kAx − yk2 + λ [x]2j .
x∈R d 2n
i=1 j=1
Apply Algorithm 1 to this problem. For any iteration k, if we move along the jk th coordinate, the
partial derivative under consideration is
∇jk f (xk ) = xT
jk (Axk − y) + 2λ[xk ]jk .
By storing the vector {Axk } across all iterations, the calculation of ∇jk f (xk ) can be greatly reduced
when ajk is sparse, to the point that the cost of a coordinate descent iteration will be of the order
of the number of nonzero elements in ajk .
Proximal coordinate descent Consider a proximal gradient subproblem arising from minimizing
f (x) + λΩ(x) where Ω is partially separable. The subproblem at iteration k will be of the form
1
minimize f (xk ) + ∇f (xk )T (x − xk ) + kx − xk k2 + λΩ(x).
x∈Rd 2αk
Both the linear term and the proximal term are separable functions, since they can be written as
sums of one-dimensional functions. Since Ω is partially separable by assumption, this subproblem
can be tackled by coordinate descent techniques, and this is one popular way of solving numerous
proximal subproblems.
Another possibility consists in replacing the proximal gradient step by a proximal coordinate
descent step. For simplicity, we describe the procedure in the context of a separable regularization
10 Large-scale/distrib. opti - 2022/2023
function Ω(x) = kxk1 . At iteration k, the next iterate is computed by first drawing a coordinate
index jk , then performing the update
1
zk ∈ argmin ∇jk f (xk )(z − [xk ]jk ) + (z − [xk ]jk )2 + λ|z|.
z∈R 2αk
This problem has a closed-form solution, and thus the next iterate can by computed by
xk+1 = xk + (z − [xk ]jk )ejk .
This function is nonconvex and C1. If we apply cyclic coordinate descent starting from (1, −1, 1) and
using the best stepsize possible (i.e. the one that minimizes the function in the chosen direction) at
every iteration, the method will never converge to a minimum.
Despite this negative result, it is possibly to provide guarantees on coordinate descent methods
under appropriate assumptions. In particular, a linear rate of convergence can be obtained for coor-
dinate descent methods on strongly convex problems: we provide below the necessary assumptions
to arrive at such a result.
Assumption 2.2.1 The objective function f in (2.1.1) is C 1 and µ-strongly convex, with f ∗ =
minx∈Rd f (x). Moreover, for every j = 1, . . . , d, the partial derivative ∇i f is Li -Lipschitz continu-
ous, i.e.
∀x ∈ Rd , ∀h ∈ R, |∇j f (x + hej ) − ∇j f (x)| ≤ Lj |h|. (2.2.1)
We let Lmax = max1≤j≤d Lj .
Theorem 2.2.1 Suppose that Assumption 2.2.1 holds, and that Algorithm 1 is applied to prob-
1
lem (2.1.1) with αk = Lmax for all k and jk being drawn uniformly at random in {1, . . . , d}. Then,
for any K ∈ N, we have
K
∗ µ
E [f (xK ) − f ] ≤ 1 − (f (x0 ) − f ∗ ) . (2.2.2)
dLmax
Suppose that f ∈ CL1,1 . Then L ≤ dLmax , and a gradient descent iteration with stepsize L1
satisfies µ K
f (xk ) − f ∗ ≤ 1 − (f (x0 ) − f ∗ )
L
under the same assumptions than that of Theorem 2.2.1. This result is better than (2.2.2) in terms
of iterations, however the cost of an iteration of gradient descent can be d times higher than that
of a coordinate descent iteration. Similarly to the reasoning we did for stochastic gradient methods,
convergence rates must be thought in terms of relevant cost metrics.
Large-scale/distrib. opti - 2022/2023 11
Remark 2.2.1 Other results have been established in the convex and nonconvex settings, under
additional assumptions. In all cases, properties on the partial derivatives are required.
We end this section with a result for cyclic coordinate descent, that illustrates that this method
has worse guarantees than its randomized counterpart in general.
Theorem 2.2.2 Suppose that Assumption 2.2.1 holds, that f ∈ CL1,1 and that Algorithm 1 is applied
1
to problem (2.1.1) with αk = Lmax for all k and jk being drawn in a cyclic fashion in {1, . . . , d}.
Then, for any K ∈ N, we have
K/d
µ
f (xK ) − f ∗ ≤ 1 −
2
(f (x0 ) − f ∗ ) . (2.2.3)
2Lmax 1 + d LL2
max
Synchronous coordinate descent A typical setup for coordinate descent is that of several proces-
sors sharing a memory. In very high dimensions, the iterate xk will be stored on this memory, and
processors can only access (blocks of) coordinates to perform updates. A synchronous coordinate
descent approach then consists in assigning coordinates (or blocks thereof) to various processors, so
that each of them performs an update of the form
A synchronization step guarantees that all updates are accounted for prior to moving to iteration
k + 1. This step is essential if the function is not separable, but may result in significant idle time
for processors as they wait for all updates to be completed.
Theorem 2.3.1 Suppose that f is convex and C 1 with a unique minimizer x∗ . Suppose further that
Algorithm 2 is run under the following assumptions:
ii) For every processor, every coordinate of x̂k is updated infinitely often;
12 Large-scale/distrib. opti - 2022/2023
3. Set
xk+1 = xk − αk ∇jk f (x̂k )ejk , (2.3.1)
where x̂k conactenates the last values of the coordinates available to the processor.
4. Update k = k + 1.
end
Remark 2.3.1 The asynchronous coordinate descent paradigm was used in conjunction with stochas-
tic gradient in the HOGWILD! algorithm, that won the NeurIPS test of time award in 2020.
2.4 Conclusion
Large-scale problems have always pushed optimization algorithms to their limits, and have lead to
reconsidering certain algorithms in light of their applicability to large-scale settings. Coordinate
descent methods are the perfect example of classical techniques that regained popularity because
of their efficiency in data science settings. On some instances, randomized coordinate descent
techniques bear a close connection with stochastic gradient methods. More globally, coordinate
descent methods are quite efficient on large-dimensional problems that have a separable structure.
Finally, the use of coordinate descent methods in parallel environments has also contributed to their
revival in optimization.
Chapter 3
In this chapter, we describe the theoretical insights behind distributed optimization formulations, in
which several agents collaborate to solve an optimization problem. This paradigm can be modeled
using a linearly constrained optimization formulation, and handling such formulations requires dedi-
cated algorithms. We first set the mathematical foundations of these methods via a brief introduction
to duality, then present our algorithms of interest.
where A ∈ Rm×d and b ∈ Rm . For simplicity, we will assume that the feasible set {x ∈ Rd | Ax = b}
is not empty.
Duality theory consists in handling constraints formulations by reformulating the problem into an
unconstrained optimization problem. We present the theoretical arguments for the special case of
problem (3.1.1), which yields a much simpler analysis.
The Lagrangian function combines the objective function and the constraints, and allows to
restate the original problem as an unconstrained one, called the primal problem:
The solutions of the primal problem are identical to that of problem (3.1.3) in our case. The difficulty
of solving problem (3.1.3) lies in the definition of its objective function as the optimal value of a
maximization problem.
13
14 Large-scale/distrib. opti - 2022/2023
maximize
m
min L(x, y), (3.1.4)
y∈R x∈Rd
where the function y 7→ minx∈Rd L(x, y) is called the dual function of the problem.
Unlike the primal problem, the dual problem is always concave (i.e. the opposite of the dual
function is convex), which facilitates its resolution by standard optimization techniques. The goal is
then to solve the dual problem in order to get the solution of the primal problem, thanks to properties
such as the one below.
Assumption 3.1.1 We suppose that strong duality holds between problem (3.1.1) and its dual,
that is,
min maxm L(x, y) = maxm min L(x, y).
x∈Rd y∈R y∈R x∈Rd
A sufficient condition for Assumption 3.1.1 is that f be convex, but this is not necessary.
where αk > 0 is a stepsize for the dual ascent step, and Axk+1 − b is a subgradient for the dual
function y 7→ minx∈Rd L(x, y) at y k .
Definition 3.2.1 The augmented Lagrangian of problem (3.1.1) is the function on Rd ×Rm ×R++
by
λ
La (x, y; λ) := f (x) + y T (Ax − b) + kAx − bk2 . (3.2.2)
2
Augmented Lagrangians thus are a family of functions parameterized by λ > 0, that put more
emphasis on the constraint violation as λ grows.
Large-scale/distrib. opti - 2022/2023 15
The augmented Lagrangian algorithm, also called method of multipliers, performs the following
iteration:
xk+1 ∈ argminx∈Rd La (x, y k ; λ)
(3.2.3)
y k+1 = y k + λ(Axk+1 − b).
In this algorithm, λ is constant and used as a constant stepsize: many more sophisticated choices
of both the augmented Lagrangian function and the stepsizes have been proposed. In general, the
advantages of augmented Lagrangian techniques are that the subproblems defining xk+1 become
easier to solve (thanks to regularization) and that the overall guarantees on the primal-dual pair are
stronger.
where A ∈ Rd1 ×m , B ∈ Rd2 ×m and c ∈ Rm . Given the particular structure (sometimes called
splitting) between the variables u and v, one may want to update those variables separately rather
than gathering them into a single vector x and performing a single update.
This idea is precisely that of dual decomposition. At iteration k, the dual decomposition method
applied to problem (3.3.1) computes
uk+1 ∈ argminu∈Rd1 L(u, v k , y k )
v ∈ argminv∈Rd2 L(uk , v, y k ) (3.3.2)
k+1
y k+1 = y k + αk (Auk+1 + Bv k+1 − c),
where αk > 0. Interestingly, the calculations for uk+1 and v k+1 are completely independent, and
can be carried out in parallel. This observation and its practical realization have lead to successful
applications of dual decomposition in several fields.
3.3.2 ADMM
The Alternated Direction Method of Multipliers, or ADMM, is an increasingly popular varia-
tion on the augmented Lagrangian paradigm that bears some connection with coordinate descent
approaches, in that it splits the problem in two sets of variables.
Recall problem (3.3.1) above. For any λ > 0, the augmented Lagrangian of problem (3.3.1) has
the form
λ
La (u, v, y; λ) = f (u) + g(v) + y T (Au + Bv − c) + kAu + Bv − ck2 .
2
16 Large-scale/distrib. opti - 2022/2023
The ADMM iteration exploits the separable nature of the problem by computing the values u and v
independently. Starting from (uk , v k , y k ), the ADMM counterpart to iteration (3.2.3) is
The first two updates of the iteration (3.3.3) cannot be run in parallel, but the philosophy is slightly
different from that of the dual decomposition method. Indeed, in ADMM, it is common that solving
for a subset of the variables will be much easier than solving for all variables at once (see our example
in the next section). The ADMM framework gives the possibility to exploit this property within an
augmented Lagrangian method.
Remark 3.3.1 The idea of splitting the objective and the constraints across two groups of variables
can be declined into as many groups of variables as possible, depending on the structure of the
problem.
To end this section, we briefly mention that there exist convergence results for ADMM-type
frameworks, typically under convexity assumptions on the problem [1]. A typical result consist in
showing that
kAuk + Bv k − ck → 0
f (uk ) + g(v k ) → minu,v f (u) + g(v)
yk → y∗,
Generalization The idea behind the formulation (3.4.1) can be extended to the case of data spread
over a network, represented by a graph G = (V, E): every vertex s ∈ V of the graph represents an
agent, while every edge (s, s0 ) ∈ E represents a channel of communication between two agents in the
graph. Letting x(s) ∈ Rd and f (s) : Rd → R represent the parameter copy and objective function
for agent s ∈ V, respectively, the consensus optimization problem can be written as:
(s) (x(s) )
P
minimize{x(s) }s∈V ∈(Rd )|V| s∈V f
0 (3.4.2)
subject to x(s) = x(s ) ∀(s, s0 ) ∈ E.
When the graph is fully connected, i.e. all agents communicate, this problem reduces to an uncon-
strained problem. However, in general, the solutions of this problem are much difficult to identify, and
one must work through minimizing the objective and satisfying the so-called consensus constraints.
without consensus constraints but with an implicit graph structure (V, E) connecting the agents.
Given a matrix W ∈ Rm×m that is doubly stochastic (i.e. with nonnegative coefficients such that
the sum of all rows and all columns are 1) and satisfies [W ]ij 6= 0 if and only if i = j or (i, j) ∈ E,
the kth iteration of the decentralized gradient method at agent i reads
m
(i) (j) (i)
X
xk+1 = [W ]ij xk − αk ∇fi (xk ). (3.4.3)
j=1
The iteration (3.4.3) thus combines a gradient step for agent i together with a so-called mixing step
(or consensus step) in which the current value of the iterate for agent i is combined with that of its
neighbors. This framework is steadily gaining popularity in the machine learning community.
[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends c in Machine
Learning, 3:1–122, 2010.
[3] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge University Press, 2022.
18