0% found this document useful (0 votes)

9 views

LectureNotes-large-scale and Distributed Optimization

The document discusses coordinate descent methods for solving large-scale optimization problems. It introduces the basic coordinate descent algorithm which iteratively optimizes over one coordinate at a time. It also covers variants of the algorithm including cyclic and randomized selection of coordinates. The document further provides theoretical guarantees of coordinate descent methods and discusses their use in parallel and distributed environments.

Uploaded by

Waqas Akhter

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

LectureNotes-large-scale and Distributed Optimization

Uploaded by

Waqas Akhter

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Lecture notes on large-scale and

distributed optimization
Clément W. Royer

M2 IASD & M2 MASH - 2022/2023

• The last version of these notes can be found at:
https://www.lamsade.dauphine.fr/∼croyer/ensdocs/LSD/LectureNotesOML-LSD.pdf.

• Comments, typos, etc, can be sent to [email protected].

• Major updates of the document

– 2022.12.11: Full version of the notes released.

– 2022.11.27: First version of the notes.
Foreword

The purpose of these lectures is to review optimization frameworks that allow to deploy the methods
seen in other parts of the course at scale. The learning goals are as follows:

• Understand the main principle behind coordinate descent methods, and their interest in a
distributed environment.

• Apply duality to design algorithms for constrained and distributed optimization.

• Propose algorithms tailored to a decentralized optimization setting.

2
Large-scale/distrib. opti - 2022/2023 3

Notations
• Scalars (i.e. reals) are denoted by lowercase letters: a, b, c, α, β, γ.

• Vectors are denoted by bold lowercase letters: a, b, c, α, β, γ.

• Matrices are denoted by bold uppercase letters: A, B, C.

• Sets are denoted by bold uppercase cursive letters : A, B, C.

• The set of natural numbers (nonnegative integers) is denoted by N; the set of integers is
denoted by Z.

• The set of real numbers is denoted by R. Our notations for the subset of nonnegative real
numbers and the set of positive real numbers are R+ and R++ , respectively.

• The notation Rd is used for the set of vectors with d ∈ N real components; although we may
not explicitly indicate it in the rest of these notes, we always assume that d ≥ 1.

• A vector x ∈ Rd is thought as a column vector, with xi ∈ R denoting its i-th coordinate in

x1
the canonical basis of Rd . We thus write x =  ... , or, in a compact form, x = [xi ]1≤ı≤d .
 

• Given a column vector x ∈ Rd , the corresponding row vector is denoted by xT , so that

xT = [x1 · · · xP T T d
d ] and [x ] = x. The scalar product between two vectors in R is defined as
T T d
x y = y x = i=1 xi yi .
√
• The Euclidean norm of a vector x ∈ Rd is defined by kxk = xT x.

• We use Rn×d to denote the set of real rectangular matrices with n rows and d columns, where
n and d will always be assumed to be at least 1. If n = d, Rd×d refers to the set of square
matrices of size d.

• We identify a matrix in Rd×1 with its corresponding column vector in Rd .

• Given a matrix A ∈ Rn×d , Aij refers to the coefficient from the i-th row and the j-th column of
A: the diagonal of A is given by the coefficients Aii . Provided this notation is not ambiguous,
we use the notations A, [Aij ]1≤i≤n and [Aij ] interchangeably.
1≤j≤d

• For every d ≥ 1, Id refers to the identity matrix in Rd×d (with 1s on the diagonal and 0s
elsewhere).
Contents

1 Introduction 5

2 Coordinate descent methods 6

2.1 Basics of coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Coordinate descent and stochastic gradient . . . . . . . . . . . . . . . . . . 7
2.1.3 Separability and proximal coordinate descent . . . . . . . . . . . . . . . . . . 8
2.2 Theoretical guarantees of coordinate descent methods . . . . . . . . . . . . . . . . . 10
2.3 Coordinate descent in parallel and distributed environments . . . . . . . . . . . . . . 11
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Distributed and constrained optimization 13

3.1 Linear constraints and dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Dual algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Dual ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Augmented Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Dual methods and decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Dual decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Decentralized optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Conclusion of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4
Chapter 1

Introduction

In these lectures, we dive into a increasingly important area of focus in optimization methods for
data science. As we witness a growth in both the model complexity (i.e. the number of parameters)
and the amount of data available (i.e. the size of the dataset), standard optimization techniques
may suffer from the curse of dimensionality and their performance may deteriorate as dimensions
grow. The goal of these notes is to present some algorithmic ideas that can reduce the impact of
large dimensions, either in terms of parameters or data points.
Chapter 2 focuses on a setting in which the number of parameters to optimize over is extremely
large, and describes how coordinate descent approaches can be used in this setting. Chapter 3
considers the case in which the data itself is not available in a centralized fashion. Using tools from
duality theory, efficient algorithms that can exploit this particular problem structure can be applied
in that setting.

5
Chapter 2

Coordinate descent methods

In this chapter, we address the treatment of large-scale optimization problems, where the number of
parameters to be optimized over is extremely large. In general, due to the curse of dimensionality,
the difficulty of the problem increases with the dimension, simply because there are more variables to
consider. However, on structured problems such as those arising in data science, there often exists a
low-dimensional or separable structure that allows for optimization steps to be taken over a subset of
variables. This is the underlying idea of coordinate descent methods, that have regained interest
in the early 2000s due to their applicability in certain data science settings.

2.1 Basics of coordinate descent

2.1.1 Algorithm
Consider the unconstrained optimization problem

minimize f (x), (2.1.1)

x∈Rd

where f ∈ C 1 (Rd ). The idea of coordinate descent methods consist in taking a gradient step with
respect to a single decision variable at every iteration. To this end, we observe that for every x ∈ Rd ,
the gradient of f at x can be decomposed as
d
X
∇f (x) = ∇j f (x)ej ,
j=1

where ∇j denotes the partial derivative with respect to the j-th variable of the function f (that
is, the jth coordinate of f ) and ej ∈ Rd is the jth coordinate vector of the canonical basis in Rd .
The coordinate descent approach replaces the full gradient by a step along a coordinate gradient, as
formalized in Algorithm 1.
The variants of coordinate descent are mainly identified by the way they select the coordinate
sequence {jk }. There exist numerous rules for choosing the coordinate index, among which:
• Cyclic: Select the indices by cycling over {1, . . . , d} in that order. After d iterations, all indices
have been selected.

• Randomized cyclic: Cycle through a random ordering of {1, . . . , d}, that changes every d steps.

6
Large-scale/distrib. opti - 2022/2023 7

Algorithm 1: Coordinate descent method.

Initialization: x0 ∈ Rd .
for k = 0, 1, ... do
1. Select a coordinate index jk ∈ {1, . . . , d}.

2. Compute a steplength αk > 0.

3. Set
xk+1 = xk − αk ∇jk f (xk )ejk . (2.1.2)

end

• Randomized: Draw jk at random in {1, . . . , d} at every iteration.

The last two strategies are those for which the strongest results can be obtained.

Block coordinate descent Rather than using a single index, it is possible to select a subset of the
coordinates (called “block” in the literature). The kth iteration of such a block coordinate descent
algorithm thus is X
xk+1 = xk − αk ∇j f (xk )ej , (2.1.3)
j∈Bk

where Bk ⊂ {1, . . . , d}. A block typically consists in coordinates drawn without replacement.

Remark 2.1.1 As described in the lab session, it is possible to combine randomized coordinate
descent with Nesterov’s acceleration technique to yield improve theoretical guarantees. However,
this raises implementation issues that may alleviate the practical interest of coordinate descent
approaches.

2.1.2 Coordinate descent and stochastic gradient

Our description of coordinate descent, and particularly the randomized variant, is reminiscent of the
stochastic gradient algorithm. In fact, randomized coordinate descent can be viewed as a special
case of stochastic gradient, in which the formula
d
1X
∇f (x) = d∇j f (x)ej
d
j=1

is used to define a finite sum with d gradients that can be sampled using any distribution that one
would use in a stochastic gradient setting. Note that, unlike in a stochastic gradient framework, any
coordinate descent step (even randomized ones) uses a descent direction.
Consider a finite-sum problem of the form
n
1X
minimize fi (x), fi (x) := `i (aT
i x), (2.1.4)
x∈Rd n
i=1
8 Large-scale/distrib. opti - 2022/2023

where aT i x is a linear model of the data vector ai , and `i : R → R is a convex loss function specific to
the ith data point (such as `i (h) = 12 (h − yi )2 for linear least squares). If the number of data points
is large, it is natural to think of applying stochastic gradient to this problem. Another approach
consists in considering an equivalent formulation of (2.1.4) through (Fenchel) duality, given by
n
1X ∗
maximize g(v) := − fi (vi ) (2.1.5)
v∈Rn n
i=1

where for any convex function φ : Rm → R, the convex conjugate function φ∗ is defined by

φ∗ (a) = sup aT b − φ(b) .

b∈Rm

The so-called dual problem (2.1.5) has a finite-sum, separable form. It can thus be tackled using
(dual) coordinate ascent, the counterpart of coordinate descent for minimization: the iteration of
this method is given by
v k+1 = v k + αk ∇i g(v k ), (2.1.6)
leading to updating the iterate one coordinate at a time. Under appropriate assumptions on the
problem, the iteration (2.1.6) is equivalent to the original stochastic gradient iteration, with xk =
1 Pn
λn i=1 [v k ]i ai . For this reason, stochastic gradient is sometimes viewed as applying coordinate
ascent to the dual problem.

2.1.3 Separability and proximal coordinate descent

Although coordinate descent methods seem cheaper to apply than gradient-type methods, their cost
actually depends on that of computing partial derivatives. Indeed, if it turns out that all coordinates
of the gradient depend on all coordinates of the input vector, then there will essentially be no value
in using coordinate updates.
Fortunately, it is quite common for optimization problems to exhibit a structure amenable to
coordinate descent. This structure is rigorously defined below.

Definition 2.1.1 A function f : Rd → R is called separable if it exists d functions f1 , . . . , fd : R →

R such that
Xd
∀x ∈ Rd , f (x) = fj ([x]j ).
j=1

Note that if f is separable and all fj s are C 1 , then f is also C 1 and we have

∀x ∈ Rd , ∀j = 1, . . . , d, ∇j f (x) = ∇fj ([x]j ),

which means that a coordinate descent update can be computed while only accessing one coordinate.
The separability property is a strong property, that essentially states that a function can be
minimized independently with respect to each of its variables. It is actually more frequent for a
function to be partially separable in the following sense.

Definition 2.1.2 A function f : Rd → R is called partially separable if it exists J functions {fj }Jj=1
and J subsets of {1, . . . , d} denoted by B1 , . . . , BJ such that
Large-scale/distrib. opti - 2022/2023 9

i) For any x ∈ Rd ,
J
X
f (x) = fj [x]Bj ,
j=1

where [x]Bj ∈ R|Bj | denotes the vector formed by the coordinates of x belonging to Bj ;

ii) ∪Jj=1 Bj = {1, . . . , d};

The notion of partial separability is only interesting whenever the different sets {Bj }Jj=1 do not
intersect too much, the ideal case being when they form a partition of {1, . . . , d}.

Example 2.1.1 (Regularized least squares with sparse data) Given A ∈ Rn×d with sparse rows
and y ∈ Rn , consider the problem
n d
1 X X
minimize f (x) := kAx − yk2 + λ [x]2j .
x∈R d 2n
i=1 j=1

Apply Algorithm 1 to this problem. For any iteration k, if we move along the jk th coordinate, the
partial derivative under consideration is

∇jk f (xk ) = xT
jk (Axk − y) + 2λ[xk ]jk .

By storing the vector {Axk } across all iterations, the calculation of ∇jk f (xk ) can be greatly reduced
when ajk is sparse, to the point that the cost of a coordinate descent iteration will be of the order
of the number of nonzero elements in ajk .

More broadly, any regularized problem of the form

n d
1X˜ T X
minimize fi (ai x) + Ω(wj ), (2.1.7)
x∈Rd n
i=1 j=1

where f˜i : R → R is (possibly) data-dependent, ai ∈ Rd is a sparse data vector, and Ω : R → R is

a regularization function applied componentwise to the vector x, is partially separable.

Proximal coordinate descent Consider a proximal gradient subproblem arising from minimizing
f (x) + λΩ(x) where Ω is partially separable. The subproblem at iteration k will be of the form

1
minimize f (xk ) + ∇f (xk )T (x − xk ) + kx − xk k2 + λΩ(x).
x∈Rd 2αk

Both the linear term and the proximal term are separable functions, since they can be written as
sums of one-dimensional functions. Since Ω is partially separable by assumption, this subproblem
can be tackled by coordinate descent techniques, and this is one popular way of solving numerous
proximal subproblems.
Another possibility consists in replacing the proximal gradient step by a proximal coordinate
descent step. For simplicity, we describe the procedure in the context of a separable regularization
10 Large-scale/distrib. opti - 2022/2023

function Ω(x) = kxk1 . At iteration k, the next iterate is computed by first drawing a coordinate
index jk , then performing the update
1
zk ∈ argmin ∇jk f (xk )(z − [xk ]jk ) + (z − [xk ]jk )2 + λ|z|.
z∈R 2αk
This problem has a closed-form solution, and thus the next iterate can by computed by
xk+1 = xk + (z − [xk ]jk )ejk .

2.2 Theoretical guarantees of coordinate descent methods

A famous 3-dimensional example designed by M. J. D. Powell in 1973 shows that coordinate descent
methods do not necessarily converge. This example partly explains why coordinate descent methods
fell out of favor in optimization for several decades.
Example 2.2.1 Consider the function f : R3 × R defined by
3
X
f (x1 , x2 , x3 ) = −(x1 x2 + x2 x3 + x1 x3 ) + max{|xi | − 1, 0}.
j=1

This function is nonconvex and C1. If we apply cyclic coordinate descent starting from (1, −1, 1) and
using the best stepsize possible (i.e. the one that minimizes the function in the chosen direction) at
every iteration, the method will never converge to a minimum.
Despite this negative result, it is possibly to provide guarantees on coordinate descent methods
under appropriate assumptions. In particular, a linear rate of convergence can be obtained for coor-
dinate descent methods on strongly convex problems: we provide below the necessary assumptions
to arrive at such a result.
Assumption 2.2.1 The objective function f in (2.1.1) is C 1 and µ-strongly convex, with f ∗ =
minx∈Rd f (x). Moreover, for every j = 1, . . . , d, the partial derivative ∇i f is Li -Lipschitz continu-
ous, i.e.
∀x ∈ Rd , ∀h ∈ R, |∇j f (x + hej ) − ∇j f (x)| ≤ Lj |h|. (2.2.1)
We let Lmax = max1≤j≤d Lj .
Theorem 2.2.1 Suppose that Assumption 2.2.1 holds, and that Algorithm 1 is applied to prob-
1
lem (2.1.1) with αk = Lmax for all k and jk being drawn uniformly at random in {1, . . . , d}. Then,
for any K ∈ N, we have
K
∗ µ
E [f (xK ) − f ] ≤ 1 − (f (x0 ) − f ∗ ) . (2.2.2)
dLmax
Suppose that f ∈ CL1,1 . Then L ≤ dLmax , and a gradient descent iteration with stepsize L1
satisfies µ K
f (xk ) − f ∗ ≤ 1 − (f (x0 ) − f ∗ )
L
under the same assumptions than that of Theorem 2.2.1. This result is better than (2.2.2) in terms
of iterations, however the cost of an iteration of gradient descent can be d times higher than that
of a coordinate descent iteration. Similarly to the reasoning we did for stochastic gradient methods,
convergence rates must be thought in terms of relevant cost metrics.
Large-scale/distrib. opti - 2022/2023 11

Remark 2.2.1 Other results have been established in the convex and nonconvex settings, under
additional assumptions. In all cases, properties on the partial derivatives are required.

We end this section with a result for cyclic coordinate descent, that illustrates that this method
has worse guarantees than its randomized counterpart in general.

Theorem 2.2.2 Suppose that Assumption 2.2.1 holds, that f ∈ CL1,1 and that Algorithm 1 is applied
1
to problem (2.1.1) with αk = Lmax for all k and jk being drawn in a cyclic fashion in {1, . . . , d}.
Then, for any K ∈ N, we have
 K/d
µ
f (xK ) − f ∗ ≤ 1 −
2
 (f (x0 ) − f ∗ ) . (2.2.3)
2Lmax 1 + d LL2
max

2.3 Coordinate descent in parallel and distributed environments

Coordinate descent techniques are quite prominent in parallel optimization algorithms. In this setting,
several cores are cooperating to solve problem (2.1.1): each core can then run its own coordinate
descent method and all cores update the same shared iterate vector. The most efficient parallel
coordinate descent techniques perform these iterations in an asynchronous fashion, which does not
prevent from guaranteeing convergence of this framework.

Synchronous coordinate descent A typical setup for coordinate descent is that of several proces-
sors sharing a memory. In very high dimensions, the iterate xk will be stored on this memory, and
processors can only access (blocks of) coordinates to perform updates. A synchronous coordinate
descent approach then consists in assigning coordinates (or blocks thereof) to various processors, so
that each of them performs an update of the form

[xk ]j ← [xk ]j − αk ∇j f (xk ).

A synchronization step guarantees that all updates are accounted for prior to moving to iteration
k + 1. This step is essential if the function is not separable, but may result in significant idle time
for processors as they wait for all updates to be completed.

Asynchronous coordinate descent In an asynchronous framework, every processor actually per-

forms a run of coordinate descent, almost independently from the other processors. As a result, a
processor may reason on an iterate that was read from memory then updated by other processors.
The resulting method is described in Algorithm 2.
In general, one cannot obtain a convergence rate for this method, however asymptotic conver-
gence to a solution can be established under certain conditions.

Theorem 2.3.1 Suppose that f is convex and C 1 with a unique minimizer x∗ . Suppose further that
Algorithm 2 is run under the following assumptions:

i) Every coordinate of xk is updated infinitely often;

ii) For every processor, every coordinate of x̂k is updated infinitely often;
12 Large-scale/distrib. opti - 2022/2023

Algorithm 2: Asynchronous coordinate descent method.

Initialization: x0 ∈ Rd , shared iteration counter k = 0.
for all processors do
1. Select a coordinate index jk ∈ {1, . . . , d}.

2. Compute a steplength αk > 0.

3. Set
xk+1 = xk − αk ∇jk f (x̂k )ejk , (2.3.1)
where x̂k conactenates the last values of the coordinates available to the processor.

4. Update k = k + 1.

end

iii) αk = α > 0, where α satisfies:

∀x ∈ Rd , kx − α∇f (x) − x∗ k∞ ≤ ηkx − x∗ k∞

for some η ∈ (0, 1).

Then, the sequence of iterates of Algorithm 2 {xk } converges to x∗ as k → ∞.

Remark 2.3.1 The asynchronous coordinate descent paradigm was used in conjunction with stochas-
tic gradient in the HOGWILD! algorithm, that won the NeurIPS test of time award in 2020.

2.4 Conclusion
Large-scale problems have always pushed optimization algorithms to their limits, and have lead to
reconsidering certain algorithms in light of their applicability to large-scale settings. Coordinate
descent methods are the perfect example of classical techniques that regained popularity because
of their efficiency in data science settings. On some instances, randomized coordinate descent
techniques bear a close connection with stochastic gradient methods. More globally, coordinate
descent methods are quite efficient on large-dimensional problems that have a separable structure.
Finally, the use of coordinate descent methods in parallel environments has also contributed to their
revival in optimization.
Chapter 3

Distributed and constrained

optimization

In this chapter, we describe the theoretical insights behind distributed optimization formulations, in
which several agents collaborate to solve an optimization problem. This paradigm can be modeled
using a linearly constrained optimization formulation, and handling such formulations requires dedi-
cated algorithms. We first set the mathematical foundations of these methods via a brief introduction
to duality, then present our algorithms of interest.

3.1 Linear constraints and dual problem

Consider the following optimization problem with linear equality constraints:

minimize f (x) subject to Ax = b, (3.1.1)

x∈Rd

where A ∈ Rm×d and b ∈ Rm . For simplicity, we will assume that the feasible set {x ∈ Rd | Ax = b}
is not empty.
Duality theory consists in handling constraints formulations by reformulating the problem into an
unconstrained optimization problem. We present the theoretical arguments for the special case of
problem (3.1.1), which yields a much simpler analysis.

Definition 3.1.1 The Lagrangian function of problem (3.1.1) is given by

L(x, y) := f (x) + y T (Ax − b) . (3.1.2)

The Lagrangian function combines the objective function and the constraints, and allows to
restate the original problem as an unconstrained one, called the primal problem:

minimize maxm L(x, y). (3.1.3)

x∈Rd y∈R

The solutions of the primal problem are identical to that of problem (3.1.3) in our case. The difficulty
of solving problem (3.1.3) lies in the definition of its objective function as the optimal value of a
maximization problem.

13
14 Large-scale/distrib. opti - 2022/2023

Definition 3.1.2 The dual problem of (3.1.1) is the maximization problem

maximize
m
min L(x, y), (3.1.4)
y∈R x∈Rd

where the function y 7→ minx∈Rd L(x, y) is called the dual function of the problem.

Unlike the primal problem, the dual problem is always concave (i.e. the opposite of the dual
function is convex), which facilitates its resolution by standard optimization techniques. The goal is
then to solve the dual problem in order to get the solution of the primal problem, thanks to properties
such as the one below.

Assumption 3.1.1 We suppose that strong duality holds between problem (3.1.1) and its dual,
that is,
min maxm L(x, y) = maxm min L(x, y).
x∈Rd y∈R y∈R x∈Rd

A sufficient condition for Assumption 3.1.1 is that f be convex, but this is not necessary.

3.2 Dual algorithms

We are now concerned with solving the dual problem (3.1.4), and we will present three methods for
this purpose.

3.2.1 Dual ascent

The dual ascent method is implicitly a subgradient method applied to the dual problem (which we
recall is a maximimization problem). At every iteration, it starts from a primal-dual pair (xk , y k )
and performs the following iteration:

xk+1 ∈ argminx∈Rd L(x, y k )
(3.2.1)
y k+1 = y k + αk (Axk+1 − b),

where αk > 0 is a stepsize for the dual ascent step, and Axk+1 − b is a subgradient for the dual
function y 7→ minx∈Rd L(x, y) at y k .

3.2.2 Augmented Lagrangian

The dual ascent method generally has weak convergence guarantees. For this reason, the optimization
literature has introduced other frameworks based on a regularized version of the Lagrangian function.

Definition 3.2.1 The augmented Lagrangian of problem (3.1.1) is the function on Rd ×Rm ×R++
by
λ
La (x, y; λ) := f (x) + y T (Ax − b) + kAx − bk2 . (3.2.2)
2

Augmented Lagrangians thus are a family of functions parameterized by λ > 0, that put more
emphasis on the constraint violation as λ grows.
Large-scale/distrib. opti - 2022/2023 15

The augmented Lagrangian algorithm, also called method of multipliers, performs the following
iteration:
xk+1 ∈ argminx∈Rd La (x, y k ; λ)

(3.2.3)
y k+1 = y k + λ(Axk+1 − b).
In this algorithm, λ is constant and used as a constant stepsize: many more sophisticated choices
of both the augmented Lagrangian function and the stepsizes have been proposed. In general, the
advantages of augmented Lagrangian techniques are that the subproblems defining xk+1 become
easier to solve (thanks to regularization) and that the overall guarantees on the primal-dual pair are
stronger.

3.3 Dual methods and decomposition

A key idea in modern optimization, that has resulted in numerous theoretical and numerical improve-
ments, consists in exploiting the structure of a given problem as much as possible. The underlying
idea is that a decomposition of a large, complex problem can lead to many smaller and simpler
(sub)problems that will be easier and cheaper to solve than the original one. We describe below how
this idea can be carried out in the context of dual algorithms.

3.3.1 Dual decomposition

Suppose that we consider a linearly constrained problem with a separable form:

minimizeu∈Rd1 ,v∈Rd2 f (u) + g(v)
(3.3.1)
subject to Au + Bv = c,

where A ∈ Rd1 ×m , B ∈ Rd2 ×m and c ∈ Rm . Given the particular structure (sometimes called
splitting) between the variables u and v, one may want to update those variables separately rather
than gathering them into a single vector x and performing a single update.
This idea is precisely that of dual decomposition. At iteration k, the dual decomposition method
applied to problem (3.3.1) computes

 uk+1 ∈ argminu∈Rd1 L(u, v k , y k )
v ∈ argminv∈Rd2 L(uk , v, y k ) (3.3.2)
 k+1
y k+1 = y k + αk (Auk+1 + Bv k+1 − c),
where αk > 0. Interestingly, the calculations for uk+1 and v k+1 are completely independent, and
can be carried out in parallel. This observation and its practical realization have lead to successful
applications of dual decomposition in several fields.

3.3.2 ADMM
The Alternated Direction Method of Multipliers, or ADMM, is an increasingly popular varia-
tion on the augmented Lagrangian paradigm that bears some connection with coordinate descent
approaches, in that it splits the problem in two sets of variables.
Recall problem (3.3.1) above. For any λ > 0, the augmented Lagrangian of problem (3.3.1) has
the form
λ
La (u, v, y; λ) = f (u) + g(v) + y T (Au + Bv − c) + kAu + Bv − ck2 .
2
16 Large-scale/distrib. opti - 2022/2023

The ADMM iteration exploits the separable nature of the problem by computing the values u and v
independently. Starting from (uk , v k , y k ), the ADMM counterpart to iteration (3.2.3) is

 uk+1 ∈ argminu∈Rd1 La (u, v k , y k ; λ)



v ∈ argminv∈Rd2 La (uk+1 , v, y k ; λ) (3.3.3)

 k+1
y k+1 = y k + λ(Auk+1 + Bv k+1 − c).

The first two updates of the iteration (3.3.3) cannot be run in parallel, but the philosophy is slightly
different from that of the dual decomposition method. Indeed, in ADMM, it is common that solving
for a subset of the variables will be much easier than solving for all variables at once (see our example
in the next section). The ADMM framework gives the possibility to exploit this property within an
augmented Lagrangian method.

Remark 3.3.1 The idea of splitting the objective and the constraints across two groups of variables
can be declined into as many groups of variables as possible, depending on the structure of the
problem.

To end this section, we briefly mention that there exist convergence results for ADMM-type
frameworks, typically under convexity assumptions on the problem [1]. A typical result consist in
showing that 
 kAuk + Bv k − ck → 0
f (uk ) + g(v k ) → minu,v f (u) + g(v)
yk → y∗,


where y ∗ is a solution of the dual problem.

3.4 Decentralized optimization

We end this chapter by describing an increasingly common setup in optimization over large datasets,
often termed consensus optimization or decentralized optimization. In this setup, we consider a
dataset that is split across m entities called agents. Every agent uses its own data to train a certain
learning model parameterized by a vector in Rd . To this end, each agent not only has its own
function f (i) , but also its own copy of the model parameters x(i) . The optimization problem at hand
considers a master iterate x, and attempts to reach consensus between all the agents. This leads to
the following formulation:
Pm (i) (i)
minimizex,x(1) ,...,x(m) ∈Rd i=1 f (x ) (3.4.1)
subject to x = x(i) ∀i = 1, . . . , m.

This problem is a proxy for minimizex∈Rd m (i)

P
i=1 f (x), but the latter problem cannot be solved by
a single agent since every agent has exclusive access to its data by design. The formulation (3.4.1)
models the fact that all agents are involved in computing x by acting on xi . It is possible to apply
ADMM to problem (3.4.1) by setting
 (1) 
x
 .. 
u =  .  ∈ Rmd , v = x ∈ Rd .
x(m)
Large-scale/distrib. opti - 2022/2023 17

Generalization The idea behind the formulation (3.4.1) can be extended to the case of data spread
over a network, represented by a graph G = (V, E): every vertex s ∈ V of the graph represents an
agent, while every edge (s, s0 ) ∈ E represents a channel of communication between two agents in the
graph. Letting x(s) ∈ Rd and f (s) : Rd → R represent the parameter copy and objective function
for agent s ∈ V, respectively, the consensus optimization problem can be written as:
(s) (x(s) )
P
minimize{x(s) }s∈V ∈(Rd )|V| s∈V f
0 (3.4.2)
subject to x(s) = x(s ) ∀(s, s0 ) ∈ E.

When the graph is fully connected, i.e. all agents communicate, this problem reduces to an uncon-
strained problem. However, in general, the solutions of this problem are much difficult to identify, and
one must work through minimizing the objective and satisfying the so-called consensus constraints.

Decentralized gradient methods To wrap up this chapter, we describe an increasingly popular

class of algorithms that extends gradient descent to the decentralized setting. The decentralized
gradient framework is designed for problems of the form
m
X
minimize fi (x(i) ),
x(1) ,...,x(m) ∈Rd
i=1

without consensus constraints but with an implicit graph structure (V, E) connecting the agents.
Given a matrix W ∈ Rm×m that is doubly stochastic (i.e. with nonnegative coefficients such that
the sum of all rows and all columns are 1) and satisfies [W ]ij 6= 0 if and only if i = j or (i, j) ∈ E,
the kth iteration of the decentralized gradient method at agent i reads
m
(i) (j) (i)
X
xk+1 = [W ]ij xk − αk ∇fi (xk ). (3.4.3)
j=1

The iteration (3.4.3) thus combines a gradient step for agent i together with a so-called mixing step
(or consensus step) in which the current value of the iterate for agent i is combined with that of its
neighbors. This framework is steadily gaining popularity in the machine learning community.

3.5 Conclusion of Chapter 3

In modern data science tasks, the amount of data available requires distributed storage, and possibly
agents cooperating in order to solve the optimization problem at hand. Linearly constrained formula-
tions can capture this behavior, and such constraints can be handled in a efficient manner using dual
variables. Augmented Lagrangian techniques are among the most popular methods in this category,
and these methods can be further specialized to account for structure in the problem. The ADMM
framework has emerged as one of the most interesting formulations used to split the calculation
into (presumably) cheaper subproblems. It is also very well suited for operating in a distributed or
decentralized environment.
Bibliography

[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends c in Machine
Learning, 3:1–122, 2010.

[2] S. J. Wright. Coordinate descent algorithms. Math. Program., 151:3–34, 2015.

[3] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge University Press, 2022.

Math1034 Sem2 Online PDF
No ratings yet
Math1034 Sem2 Online PDF
115 pages
Part IB - Optimisation: Definitions
No ratings yet
Part IB - Optimisation: Definitions
7 pages
Problem Set 2
No ratings yet
Problem Set 2
18 pages
Course 231: Equations of Mathematical Physics: Notes by Chris Blair
No ratings yet
Course 231: Equations of Mathematical Physics: Notes by Chris Blair
17 pages
Linear Algebra Toronto LectureNotes223
No ratings yet
Linear Algebra Toronto LectureNotes223
96 pages
degreetheory (1)
No ratings yet
degreetheory (1)
48 pages
Lecture CIMPA School
No ratings yet
Lecture CIMPA School
89 pages
supplementary _active learning alloys
No ratings yet
supplementary _active learning alloys
38 pages
An Investigation of Iterative Methods For Large Scale Linear Systems
No ratings yet
An Investigation of Iterative Methods For Large Scale Linear Systems
77 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Notes On Graph Algorithms Used in Optimizing Compilers: Carl D. Offner
No ratings yet
Notes On Graph Algorithms Used in Optimizing Compilers: Carl D. Offner
100 pages
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
No ratings yet
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
28 pages
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) June 20, 2020
No ratings yet
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) June 20, 2020
29 pages
Calcul 2
No ratings yet
Calcul 2
93 pages
Svd and Data Science
No ratings yet
Svd and Data Science
52 pages
Optimisation THM Proof
No ratings yet
Optimisation THM Proof
12 pages
Complex Analysis A Short Course: M.Thamban Nair Department of Mathematics Indian Institute of Technology Madras
No ratings yet
Complex Analysis A Short Course: M.Thamban Nair Department of Mathematics Indian Institute of Technology Madras
113 pages
Optimisation THM
No ratings yet
Optimisation THM
7 pages
Wisdom of Crowds Intro
No ratings yet
Wisdom of Crowds Intro
53 pages
Optimisation
No ratings yet
Optimisation
38 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Homography Theory
No ratings yet
Homography Theory
32 pages
Part IA - Vectors and Matrices: Theorems With Proof
No ratings yet
Part IA - Vectors and Matrices: Theorems With Proof
26 pages
Notes (Calculo Vectorial
No ratings yet
Notes (Calculo Vectorial
63 pages
Introduction To Ecological Multivariate Analysis
No ratings yet
Introduction To Ecological Multivariate Analysis
79 pages
CS 532 Lecture Notes
No ratings yet
CS 532 Lecture Notes
25 pages
Gocad Tutorial Old
No ratings yet
Gocad Tutorial Old
34 pages
Bo
No ratings yet
Bo
36 pages
Mori 2015
No ratings yet
Mori 2015
32 pages
Least Squares Adjustment
100% (1)
Least Squares Adjustment
47 pages
Numerical_Methods_for_PDEs__Multigrid_inplementation1
No ratings yet
Numerical_Methods_for_PDEs__Multigrid_inplementation1
11 pages
Ode Xeqsma
No ratings yet
Ode Xeqsma
67 pages
Combined CLP 3
100% (1)
Combined CLP 3
813 pages
Part IA - Vectors and Matrices: Definitions
No ratings yet
Part IA - Vectors and Matrices: Definitions
19 pages
Complex Analysis
No ratings yet
Complex Analysis
113 pages
classnote-MA1101
No ratings yet
classnote-MA1101
171 pages
Maths For Intelligent Systems
No ratings yet
Maths For Intelligent Systems
76 pages
Dimension and Multiplicity of Graded Rings and Modules: Mark Blumstein September 11, 2016
No ratings yet
Dimension and Multiplicity of Graded Rings and Modules: Mark Blumstein September 11, 2016
52 pages
NM_Course
No ratings yet
NM_Course
37 pages
Algorithms For Grid Graphs in The Mapreduce Model: Digitalcommons@University of Nebraska - Lincoln
No ratings yet
Algorithms For Grid Graphs in The Mapreduce Model: Digitalcommons@University of Nebraska - Lincoln
46 pages
Neural ODES
No ratings yet
Neural ODES
32 pages
Carrell J. - Basic Concepts of Linear Algebra
No ratings yet
Carrell J. - Basic Concepts of Linear Algebra
344 pages
Modular Forms Updated
No ratings yet
Modular Forms Updated
78 pages
IMAGE MATTING Student Paper Informative
No ratings yet
IMAGE MATTING Student Paper Informative
17 pages
Finite Element Method: Foundations: Lecture Notes
No ratings yet
Finite Element Method: Foundations: Lecture Notes
19 pages
Osborn CFTNotes
No ratings yet
Osborn CFTNotes
61 pages
Part IA - Vectors and Matrices: Theorems
No ratings yet
Part IA - Vectors and Matrices: Theorems
14 pages
Conformal optimization of eigenvalues on surfaces with symmetries
No ratings yet
Conformal optimization of eigenvalues on surfaces with symmetries
48 pages
The Matrix Cookbook: Kaare Brandt Petersen Michael Syskind Pedersen Version: January 5, 2005
No ratings yet
The Matrix Cookbook: Kaare Brandt Petersen Michael Syskind Pedersen Version: January 5, 2005
43 pages
Weak Convergence and Empirical Processes: Soumendu Sundar Mukherjee Indian Statistical Institute, Kolkata April 24, 2019
No ratings yet
Weak Convergence and Empirical Processes: Soumendu Sundar Mukherjee Indian Statistical Institute, Kolkata April 24, 2019
34 pages
Pde Numerics Tutorial
No ratings yet
Pde Numerics Tutorial
44 pages
Numerical Solution of Reaction-Diffusion Problems: Computational Biology Group (Cobi)
No ratings yet
Numerical Solution of Reaction-Diffusion Problems: Computational Biology Group (Cobi)
44 pages
Differential & Integral Calculus Notes in Brief Author Matthew Scroggs
No ratings yet
Differential & Integral Calculus Notes in Brief Author Matthew Scroggs
22 pages
Linear and Integer Optimization (V3C1/F4C1) : Lecture Notes
No ratings yet
Linear and Integer Optimization (V3C1/F4C1) : Lecture Notes
129 pages
Math 1034 SG 2024 Sem 2
No ratings yet
Math 1034 SG 2024 Sem 2
129 pages
3rd Dissertation
No ratings yet
3rd Dissertation
20 pages
RTV 4 Manual
No ratings yet
RTV 4 Manual
128 pages
Open Data Structures: An Introduction
From Everand
Open Data Structures: An Introduction
Pat Morin
4/5 (4)
Wave and Scattering Methods for Numerical Simulation
From Everand
Wave and Scattering Methods for Numerical Simulation
Stefan Bilbao
No ratings yet
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
From Everand
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
Dieter Jacob
No ratings yet
Online change-points and events detection for dynamic structured data Cédric Richard
No ratings yet
Online change-points and events detection for dynamic structured data Cédric Richard
2 pages
Lower Bounds and Optimal Algorithms For Personalized Federated Learning
No ratings yet
Lower Bounds and Optimal Algorithms For Personalized Federated Learning
12 pages
Personalized Federated Learning With Theoretical Guarantees A Model Agnostic Meta Learning Approach
No ratings yet
Personalized Federated Learning With Theoretical Guarantees A Model Agnostic Meta Learning Approach
28 pages
Stationary Stochastic Process
No ratings yet
Stationary Stochastic Process
47 pages
Evans - P.D.E PDF
No ratings yet
Evans - P.D.E PDF
11 pages
Extra Practice 5
No ratings yet
Extra Practice 5
2 pages
Frequency Response Analysis On First-Order and Second-Order Passive Filters With RLC Components
No ratings yet
Frequency Response Analysis On First-Order and Second-Order Passive Filters With RLC Components
11 pages
20 Questions To Test Your Skills On CNN Convolutional Neural Networks
No ratings yet
20 Questions To Test Your Skills On CNN Convolutional Neural Networks
11 pages
Machine Learning Lab (3) Report (21 CP 81)
No ratings yet
Machine Learning Lab (3) Report (21 CP 81)
7 pages
Python Recursion
No ratings yet
Python Recursion
29 pages
Scipy - FFT Cheat Sheet
No ratings yet
Scipy - FFT Cheat Sheet
2 pages
Haar Transform and Its Applications
No ratings yet
Haar Transform and Its Applications
20 pages
Page Replacement Algorithms
No ratings yet
Page Replacement Algorithms
11 pages
Basic of SVM Algorithm
No ratings yet
Basic of SVM Algorithm
10 pages
Assignment 2 cs771 Iitk
No ratings yet
Assignment 2 cs771 Iitk
8 pages
Image Processing - Unit - 3 - MCQ
No ratings yet
Image Processing - Unit - 3 - MCQ
14 pages
Numerical Methods Test 1
No ratings yet
Numerical Methods Test 1
17 pages
Math. Quadratic and Cubic Equations Solve With VBA Functions
No ratings yet
Math. Quadratic and Cubic Equations Solve With VBA Functions
42 pages
Lecture 04 - Conjugate Gradient Methods
No ratings yet
Lecture 04 - Conjugate Gradient Methods
9 pages
1.1 Systems of Linear Equations - Babar
No ratings yet
1.1 Systems of Linear Equations - Babar
7 pages
Pattern Recognition: An Overview: Prof. Richard Zanibbi
No ratings yet
Pattern Recognition: An Overview: Prof. Richard Zanibbi
29 pages
Levenberg Marquardt Algorithm Pic
No ratings yet
Levenberg Marquardt Algorithm Pic
1 page
DSP-Lec 02-Quantization PDF
No ratings yet
DSP-Lec 02-Quantization PDF
17 pages
Toc - Digital Signal Processing by Kenton Williston
No ratings yet
Toc - Digital Signal Processing by Kenton Williston
5 pages
AoA Part3 Notes
No ratings yet
AoA Part3 Notes
6 pages
Vogal's Approximation Method
No ratings yet
Vogal's Approximation Method
12 pages
Data Structure Most Important Question All Unit
No ratings yet
Data Structure Most Important Question All Unit
7 pages
Compsci 611 Advanced Algorithms Final Exam Fall 2017: Draft Solutions
No ratings yet
Compsci 611 Advanced Algorithms Final Exam Fall 2017: Draft Solutions
6 pages
CH 4-Design Optimization-Optimum Design Concepts PDF
No ratings yet
CH 4-Design Optimization-Optimum Design Concepts PDF
62 pages
Hw5 Solution
No ratings yet
Hw5 Solution
4 pages
Compensator For Control System
No ratings yet
Compensator For Control System
56 pages
Optical Communication and Networking Syllabus
No ratings yet
Optical Communication and Networking Syllabus
2 pages
Technical Test: Astra Data Science Bootcamp 2019
No ratings yet
Technical Test: Astra Data Science Bootcamp 2019
10 pages
Duality
No ratings yet
Duality
13 pages
ARYAN DAA Practical FILE PDF
No ratings yet
ARYAN DAA Practical FILE PDF
18 pages

LectureNotes-large-scale and Distributed Optimization

Uploaded by

LectureNotes-large-scale and Distributed Optimization

Uploaded by

Lecture notes on large-scale and

M2 IASD & M2 MASH - 2022/2023

• Comments, typos, etc, can be sent to [email protected].

• Major updates of the document

– 2022.12.11: Full version of the notes released.

• Apply duality to design algorithms for constrained and distributed optimization.

• Propose algorithms tailored to a decentralized optimization setting.

• Vectors are denoted by bold lowercase letters: a, b, c, α, β, γ.

• Matrices are denoted by bold uppercase letters: A, B, C.

• Sets are denoted by bold uppercase cursive letters : A, B, C.

• A vector x ∈ Rd is thought as a column vector, with xi ∈ R denoting its i-th coordinate in

• Given a column vector x ∈ Rd , the corresponding row vector is denoted by xT , so that

• We identify a matrix in Rd×1 with its corresponding column vector in Rd .

2 Coordinate descent methods 6

3 Distributed and constrained optimization 13

Coordinate descent methods

2.1 Basics of coordinate descent

minimize f (x), (2.1.1)

Algorithm 1: Coordinate descent method.

2. Compute a steplength αk > 0.

• Randomized: Draw jk at random in {1, . . . , d} at every iteration.

2.1.2 Coordinate descent and stochastic gradient

φ∗ (a) = sup aT b − φ(b) .

2.1.3 Separability and proximal coordinate descent

Definition 2.1.1 A function f : Rd → R is called separable if it exists d functions f1 , . . . , fd : R →

∀x ∈ Rd , ∀j = 1, . . . , d, ∇j f (x) = ∇fj ([x]j ),

ii) ∪Jj=1 Bj = {1, . . . , d};

More broadly, any regularized problem of the form

where f˜i : R → R is (possibly) data-dependent, ai ∈ Rd is a sparse data vector, and Ω : R → R is

2.2 Theoretical guarantees of coordinate descent methods

2.3 Coordinate descent in parallel and distributed environments

[xk ]j ← [xk ]j − αk ∇j f (xk ).

Asynchronous coordinate descent In an asynchronous framework, every processor actually per-

i) Every coordinate of xk is updated infinitely often;

Algorithm 2: Asynchronous coordinate descent method.

2. Compute a steplength αk > 0.

iii) αk = α > 0, where α satisfies:

∀x ∈ Rd , kx − α∇f (x) − x∗ k∞ ≤ ηkx − x∗ k∞

for some η ∈ (0, 1).

Then, the sequence of iterates of Algorithm 2 {xk } converges to x∗ as k → ∞.

Distributed and constrained

3.1 Linear constraints and dual problem

minimize f (x) subject to Ax = b, (3.1.1)

Definition 3.1.1 The Lagrangian function of problem (3.1.1) is given by

L(x, y) := f (x) + y T (Ax − b) . (3.1.2)

minimize maxm L(x, y). (3.1.3)

Definition 3.1.2 The dual problem of (3.1.1) is the maximization problem

3.2 Dual algorithms

3.2.1 Dual ascent

3.2.2 Augmented Lagrangian

3.3 Dual methods and decomposition

3.3.1 Dual decomposition

 uk+1 ∈ argminu∈Rd1 La (u, v k , y k ; λ)

v ∈ argminv∈Rd2 La (uk+1 , v, y k ; λ) (3.3.3)

where y ∗ is a solution of the dual problem.

3.4 Decentralized optimization

This problem is a proxy for minimizex∈Rd m (i)

Decentralized gradient methods To wrap up this chapter, we describe an increasingly popular

3.5 Conclusion of Chapter 3

[2] S. J. Wright. Coordinate descent algorithms. Math. Program., 151:3–34, 2015.

You might also like