0% found this document useful (0 votes)
9 views

LectureNotes-large-scale and Distributed Optimization

The document discusses coordinate descent methods for solving large-scale optimization problems. It introduces the basic coordinate descent algorithm which iteratively optimizes over one coordinate at a time. It also covers variants of the algorithm including cyclic and randomized selection of coordinates. The document further provides theoretical guarantees of coordinate descent methods and discusses their use in parallel and distributed environments.

Uploaded by

Waqas Akhter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

LectureNotes-large-scale and Distributed Optimization

The document discusses coordinate descent methods for solving large-scale optimization problems. It introduces the basic coordinate descent algorithm which iteratively optimizes over one coordinate at a time. It also covers variants of the algorithm including cyclic and randomized selection of coordinates. The document further provides theoretical guarantees of coordinate descent methods and discusses their use in parallel and distributed environments.

Uploaded by

Waqas Akhter
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Lecture notes on large-scale and

distributed optimization
Clément W. Royer

M2 IASD & M2 MASH - 2022/2023


• The last version of these notes can be found at:
https://www.lamsade.dauphine.fr/∼croyer/ensdocs/LSD/LectureNotesOML-LSD.pdf.

• Comments, typos, etc, can be sent to [email protected].

• Major updates of the document

– 2022.12.11: Full version of the notes released.


– 2022.11.27: First version of the notes.
Foreword

The purpose of these lectures is to review optimization frameworks that allow to deploy the methods
seen in other parts of the course at scale. The learning goals are as follows:

• Understand the main principle behind coordinate descent methods, and their interest in a
distributed environment.

• Apply duality to design algorithms for constrained and distributed optimization.

• Propose algorithms tailored to a decentralized optimization setting.

2
Large-scale/distrib. opti - 2022/2023 3

Notations
• Scalars (i.e. reals) are denoted by lowercase letters: a, b, c, α, β, γ.

• Vectors are denoted by bold lowercase letters: a, b, c, α, β, γ.

• Matrices are denoted by bold uppercase letters: A, B, C.

• Sets are denoted by bold uppercase cursive letters : A, B, C.

• The set of natural numbers (nonnegative integers) is denoted by N; the set of integers is
denoted by Z.

• The set of real numbers is denoted by R. Our notations for the subset of nonnegative real
numbers and the set of positive real numbers are R+ and R++ , respectively.

• The notation Rd is used for the set of vectors with d ∈ N real components; although we may
not explicitly indicate it in the rest of these notes, we always assume that d ≥ 1.

• A vector x ∈ Rd is thought as a column vector, with xi ∈ R denoting its i-th coordinate in


x1
the canonical basis of Rd . We thus write x =  ... , or, in a compact form, x = [xi ]1≤ı≤d .
 

xd

• Given a column vector x ∈ Rd , the corresponding row vector is denoted by xT , so that


xT = [x1 · · · xP T T d
d ] and [x ] = x. The scalar product between two vectors in R is defined as
T T d
x y = y x = i=1 xi yi .

• The Euclidean norm of a vector x ∈ Rd is defined by kxk = xT x.

• We use Rn×d to denote the set of real rectangular matrices with n rows and d columns, where
n and d will always be assumed to be at least 1. If n = d, Rd×d refers to the set of square
matrices of size d.

• We identify a matrix in Rd×1 with its corresponding column vector in Rd .

• Given a matrix A ∈ Rn×d , Aij refers to the coefficient from the i-th row and the j-th column of
A: the diagonal of A is given by the coefficients Aii . Provided this notation is not ambiguous,
we use the notations A, [Aij ]1≤i≤n and [Aij ] interchangeably.
1≤j≤d

• For every d ≥ 1, Id refers to the identity matrix in Rd×d (with 1s on the diagonal and 0s
elsewhere).
Contents

1 Introduction 5

2 Coordinate descent methods 6


2.1 Basics of coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Coordinate descent and stochastic gradient . . . . . . . . . . . . . . . . . . 7
2.1.3 Separability and proximal coordinate descent . . . . . . . . . . . . . . . . . . 8
2.2 Theoretical guarantees of coordinate descent methods . . . . . . . . . . . . . . . . . 10
2.3 Coordinate descent in parallel and distributed environments . . . . . . . . . . . . . . 11
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Distributed and constrained optimization 13


3.1 Linear constraints and dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Dual algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Dual ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Augmented Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Dual methods and decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Dual decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Decentralized optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Conclusion of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4
Chapter 1

Introduction

In these lectures, we dive into a increasingly important area of focus in optimization methods for
data science. As we witness a growth in both the model complexity (i.e. the number of parameters)
and the amount of data available (i.e. the size of the dataset), standard optimization techniques
may suffer from the curse of dimensionality and their performance may deteriorate as dimensions
grow. The goal of these notes is to present some algorithmic ideas that can reduce the impact of
large dimensions, either in terms of parameters or data points.
Chapter 2 focuses on a setting in which the number of parameters to optimize over is extremely
large, and describes how coordinate descent approaches can be used in this setting. Chapter 3
considers the case in which the data itself is not available in a centralized fashion. Using tools from
duality theory, efficient algorithms that can exploit this particular problem structure can be applied
in that setting.

5
Chapter 2

Coordinate descent methods

In this chapter, we address the treatment of large-scale optimization problems, where the number of
parameters to be optimized over is extremely large. In general, due to the curse of dimensionality,
the difficulty of the problem increases with the dimension, simply because there are more variables to
consider. However, on structured problems such as those arising in data science, there often exists a
low-dimensional or separable structure that allows for optimization steps to be taken over a subset of
variables. This is the underlying idea of coordinate descent methods, that have regained interest
in the early 2000s due to their applicability in certain data science settings.

2.1 Basics of coordinate descent


2.1.1 Algorithm
Consider the unconstrained optimization problem

minimize f (x), (2.1.1)


x∈Rd

where f ∈ C 1 (Rd ). The idea of coordinate descent methods consist in taking a gradient step with
respect to a single decision variable at every iteration. To this end, we observe that for every x ∈ Rd ,
the gradient of f at x can be decomposed as
d
X
∇f (x) = ∇j f (x)ej ,
j=1

where ∇j denotes the partial derivative with respect to the j-th variable of the function f (that
is, the jth coordinate of f ) and ej ∈ Rd is the jth coordinate vector of the canonical basis in Rd .
The coordinate descent approach replaces the full gradient by a step along a coordinate gradient, as
formalized in Algorithm 1.
The variants of coordinate descent are mainly identified by the way they select the coordinate
sequence {jk }. There exist numerous rules for choosing the coordinate index, among which:
• Cyclic: Select the indices by cycling over {1, . . . , d} in that order. After d iterations, all indices
have been selected.

• Randomized cyclic: Cycle through a random ordering of {1, . . . , d}, that changes every d steps.

6
Large-scale/distrib. opti - 2022/2023 7

Algorithm 1: Coordinate descent method.


Initialization: x0 ∈ Rd .
for k = 0, 1, ... do
1. Select a coordinate index jk ∈ {1, . . . , d}.

2. Compute a steplength αk > 0.

3. Set
xk+1 = xk − αk ∇jk f (xk )ejk . (2.1.2)

end

• Randomized: Draw jk at random in {1, . . . , d} at every iteration.


The last two strategies are those for which the strongest results can be obtained.

Block coordinate descent Rather than using a single index, it is possible to select a subset of the
coordinates (called “block” in the literature). The kth iteration of such a block coordinate descent
algorithm thus is X
xk+1 = xk − αk ∇j f (xk )ej , (2.1.3)
j∈Bk

where Bk ⊂ {1, . . . , d}. A block typically consists in coordinates drawn without replacement.

Remark 2.1.1 As described in the lab session, it is possible to combine randomized coordinate
descent with Nesterov’s acceleration technique to yield improve theoretical guarantees. However,
this raises implementation issues that may alleviate the practical interest of coordinate descent
approaches.

2.1.2 Coordinate descent and stochastic gradient


Our description of coordinate descent, and particularly the randomized variant, is reminiscent of the
stochastic gradient algorithm. In fact, randomized coordinate descent can be viewed as a special
case of stochastic gradient, in which the formula
d
1X
∇f (x) = d∇j f (x)ej
d
j=1

is used to define a finite sum with d gradients that can be sampled using any distribution that one
would use in a stochastic gradient setting. Note that, unlike in a stochastic gradient framework, any
coordinate descent step (even randomized ones) uses a descent direction.
Consider a finite-sum problem of the form
n
1X
minimize fi (x), fi (x) := `i (aT
i x), (2.1.4)
x∈Rd n
i=1
8 Large-scale/distrib. opti - 2022/2023

where aT i x is a linear model of the data vector ai , and `i : R → R is a convex loss function specific to
the ith data point (such as `i (h) = 12 (h − yi )2 for linear least squares). If the number of data points
is large, it is natural to think of applying stochastic gradient to this problem. Another approach
consists in considering an equivalent formulation of (2.1.4) through (Fenchel) duality, given by
n
1X ∗
maximize g(v) := − fi (vi ) (2.1.5)
v∈Rn n
i=1

where for any convex function φ : Rm → R, the convex conjugate function φ∗ is defined by

φ∗ (a) = sup aT b − φ(b) .



b∈Rm

The so-called dual problem (2.1.5) has a finite-sum, separable form. It can thus be tackled using
(dual) coordinate ascent, the counterpart of coordinate descent for minimization: the iteration of
this method is given by
v k+1 = v k + αk ∇i g(v k ), (2.1.6)
leading to updating the iterate one coordinate at a time. Under appropriate assumptions on the
problem, the iteration (2.1.6) is equivalent to the original stochastic gradient iteration, with xk =
1 Pn
λn i=1 [v k ]i ai . For this reason, stochastic gradient is sometimes viewed as applying coordinate
ascent to the dual problem.

2.1.3 Separability and proximal coordinate descent


Although coordinate descent methods seem cheaper to apply than gradient-type methods, their cost
actually depends on that of computing partial derivatives. Indeed, if it turns out that all coordinates
of the gradient depend on all coordinates of the input vector, then there will essentially be no value
in using coordinate updates.
Fortunately, it is quite common for optimization problems to exhibit a structure amenable to
coordinate descent. This structure is rigorously defined below.

Definition 2.1.1 A function f : Rd → R is called separable if it exists d functions f1 , . . . , fd : R →


R such that
Xd
∀x ∈ Rd , f (x) = fj ([x]j ).
j=1

Note that if f is separable and all fj s are C 1 , then f is also C 1 and we have

∀x ∈ Rd , ∀j = 1, . . . , d, ∇j f (x) = ∇fj ([x]j ),

which means that a coordinate descent update can be computed while only accessing one coordinate.
The separability property is a strong property, that essentially states that a function can be
minimized independently with respect to each of its variables. It is actually more frequent for a
function to be partially separable in the following sense.

Definition 2.1.2 A function f : Rd → R is called partially separable if it exists J functions {fj }Jj=1
and J subsets of {1, . . . , d} denoted by B1 , . . . , BJ such that
Large-scale/distrib. opti - 2022/2023 9

i) For any x ∈ Rd ,
J
X 
f (x) = fj [x]Bj ,
j=1

where [x]Bj ∈ R|Bj | denotes the vector formed by the coordinates of x belonging to Bj ;

ii) ∪Jj=1 Bj = {1, . . . , d};

The notion of partial separability is only interesting whenever the different sets {Bj }Jj=1 do not
intersect too much, the ideal case being when they form a partition of {1, . . . , d}.

Example 2.1.1 (Regularized least squares with sparse data) Given A ∈ Rn×d with sparse rows
and y ∈ Rn , consider the problem
n d
1 X X
minimize f (x) := kAx − yk2 + λ [x]2j .
x∈R d 2n
i=1 j=1

Apply Algorithm 1 to this problem. For any iteration k, if we move along the jk th coordinate, the
partial derivative under consideration is

∇jk f (xk ) = xT
jk (Axk − y) + 2λ[xk ]jk .

By storing the vector {Axk } across all iterations, the calculation of ∇jk f (xk ) can be greatly reduced
when ajk is sparse, to the point that the cost of a coordinate descent iteration will be of the order
of the number of nonzero elements in ajk .

More broadly, any regularized problem of the form


n d
1X˜ T X
minimize fi (ai x) + Ω(wj ), (2.1.7)
x∈Rd n
i=1 j=1

where f˜i : R → R is (possibly) data-dependent, ai ∈ Rd is a sparse data vector, and Ω : R → R is


a regularization function applied componentwise to the vector x, is partially separable.

Proximal coordinate descent Consider a proximal gradient subproblem arising from minimizing
f (x) + λΩ(x) where Ω is partially separable. The subproblem at iteration k will be of the form

1
minimize f (xk ) + ∇f (xk )T (x − xk ) + kx − xk k2 + λΩ(x).
x∈Rd 2αk

Both the linear term and the proximal term are separable functions, since they can be written as
sums of one-dimensional functions. Since Ω is partially separable by assumption, this subproblem
can be tackled by coordinate descent techniques, and this is one popular way of solving numerous
proximal subproblems.
Another possibility consists in replacing the proximal gradient step by a proximal coordinate
descent step. For simplicity, we describe the procedure in the context of a separable regularization
10 Large-scale/distrib. opti - 2022/2023

function Ω(x) = kxk1 . At iteration k, the next iterate is computed by first drawing a coordinate
index jk , then performing the update
1
zk ∈ argmin ∇jk f (xk )(z − [xk ]jk ) + (z − [xk ]jk )2 + λ|z|.
z∈R 2αk
This problem has a closed-form solution, and thus the next iterate can by computed by
xk+1 = xk + (z − [xk ]jk )ejk .

2.2 Theoretical guarantees of coordinate descent methods


A famous 3-dimensional example designed by M. J. D. Powell in 1973 shows that coordinate descent
methods do not necessarily converge. This example partly explains why coordinate descent methods
fell out of favor in optimization for several decades.
Example 2.2.1 Consider the function f : R3 × R defined by
3
X
f (x1 , x2 , x3 ) = −(x1 x2 + x2 x3 + x1 x3 ) + max{|xi | − 1, 0}.
j=1

This function is nonconvex and C1. If we apply cyclic coordinate descent starting from (1, −1, 1) and
using the best stepsize possible (i.e. the one that minimizes the function in the chosen direction) at
every iteration, the method will never converge to a minimum.
Despite this negative result, it is possibly to provide guarantees on coordinate descent methods
under appropriate assumptions. In particular, a linear rate of convergence can be obtained for coor-
dinate descent methods on strongly convex problems: we provide below the necessary assumptions
to arrive at such a result.
Assumption 2.2.1 The objective function f in (2.1.1) is C 1 and µ-strongly convex, with f ∗ =
minx∈Rd f (x). Moreover, for every j = 1, . . . , d, the partial derivative ∇i f is Li -Lipschitz continu-
ous, i.e.
∀x ∈ Rd , ∀h ∈ R, |∇j f (x + hej ) − ∇j f (x)| ≤ Lj |h|. (2.2.1)
We let Lmax = max1≤j≤d Lj .
Theorem 2.2.1 Suppose that Assumption 2.2.1 holds, and that Algorithm 1 is applied to prob-
1
lem (2.1.1) with αk = Lmax for all k and jk being drawn uniformly at random in {1, . . . , d}. Then,
for any K ∈ N, we have
 K
∗ µ
E [f (xK ) − f ] ≤ 1 − (f (x0 ) − f ∗ ) . (2.2.2)
dLmax
Suppose that f ∈ CL1,1 . Then L ≤ dLmax , and a gradient descent iteration with stepsize L1
satisfies  µ K
f (xk ) − f ∗ ≤ 1 − (f (x0 ) − f ∗ )
L
under the same assumptions than that of Theorem 2.2.1. This result is better than (2.2.2) in terms
of iterations, however the cost of an iteration of gradient descent can be d times higher than that
of a coordinate descent iteration. Similarly to the reasoning we did for stochastic gradient methods,
convergence rates must be thought in terms of relevant cost metrics.
Large-scale/distrib. opti - 2022/2023 11

Remark 2.2.1 Other results have been established in the convex and nonconvex settings, under
additional assumptions. In all cases, properties on the partial derivatives are required.

We end this section with a result for cyclic coordinate descent, that illustrates that this method
has worse guarantees than its randomized counterpart in general.

Theorem 2.2.2 Suppose that Assumption 2.2.1 holds, that f ∈ CL1,1 and that Algorithm 1 is applied
1
to problem (2.1.1) with αk = Lmax for all k and jk being drawn in a cyclic fashion in {1, . . . , d}.
Then, for any K ∈ N, we have
 K/d
µ
f (xK ) − f ∗ ≤ 1 − 
2
 (f (x0 ) − f ∗ ) . (2.2.3)
2Lmax 1 + d LL2
max

2.3 Coordinate descent in parallel and distributed environments


Coordinate descent techniques are quite prominent in parallel optimization algorithms. In this setting,
several cores are cooperating to solve problem (2.1.1): each core can then run its own coordinate
descent method and all cores update the same shared iterate vector. The most efficient parallel
coordinate descent techniques perform these iterations in an asynchronous fashion, which does not
prevent from guaranteeing convergence of this framework.

Synchronous coordinate descent A typical setup for coordinate descent is that of several proces-
sors sharing a memory. In very high dimensions, the iterate xk will be stored on this memory, and
processors can only access (blocks of) coordinates to perform updates. A synchronous coordinate
descent approach then consists in assigning coordinates (or blocks thereof) to various processors, so
that each of them performs an update of the form

[xk ]j ← [xk ]j − αk ∇j f (xk ).

A synchronization step guarantees that all updates are accounted for prior to moving to iteration
k + 1. This step is essential if the function is not separable, but may result in significant idle time
for processors as they wait for all updates to be completed.

Asynchronous coordinate descent In an asynchronous framework, every processor actually per-


forms a run of coordinate descent, almost independently from the other processors. As a result, a
processor may reason on an iterate that was read from memory then updated by other processors.
The resulting method is described in Algorithm 2.
In general, one cannot obtain a convergence rate for this method, however asymptotic conver-
gence to a solution can be established under certain conditions.

Theorem 2.3.1 Suppose that f is convex and C 1 with a unique minimizer x∗ . Suppose further that
Algorithm 2 is run under the following assumptions:

i) Every coordinate of xk is updated infinitely often;

ii) For every processor, every coordinate of x̂k is updated infinitely often;
12 Large-scale/distrib. opti - 2022/2023

Algorithm 2: Asynchronous coordinate descent method.


Initialization: x0 ∈ Rd , shared iteration counter k = 0.
for all processors do
1. Select a coordinate index jk ∈ {1, . . . , d}.

2. Compute a steplength αk > 0.

3. Set
xk+1 = xk − αk ∇jk f (x̂k )ejk , (2.3.1)
where x̂k conactenates the last values of the coordinates available to the processor.

4. Update k = k + 1.

end

iii) αk = α > 0, where α satisfies:

∀x ∈ Rd , kx − α∇f (x) − x∗ k∞ ≤ ηkx − x∗ k∞

for some η ∈ (0, 1).

Then, the sequence of iterates of Algorithm 2 {xk } converges to x∗ as k → ∞.

Remark 2.3.1 The asynchronous coordinate descent paradigm was used in conjunction with stochas-
tic gradient in the HOGWILD! algorithm, that won the NeurIPS test of time award in 2020.

2.4 Conclusion
Large-scale problems have always pushed optimization algorithms to their limits, and have lead to
reconsidering certain algorithms in light of their applicability to large-scale settings. Coordinate
descent methods are the perfect example of classical techniques that regained popularity because
of their efficiency in data science settings. On some instances, randomized coordinate descent
techniques bear a close connection with stochastic gradient methods. More globally, coordinate
descent methods are quite efficient on large-dimensional problems that have a separable structure.
Finally, the use of coordinate descent methods in parallel environments has also contributed to their
revival in optimization.
Chapter 3

Distributed and constrained


optimization

In this chapter, we describe the theoretical insights behind distributed optimization formulations, in
which several agents collaborate to solve an optimization problem. This paradigm can be modeled
using a linearly constrained optimization formulation, and handling such formulations requires dedi-
cated algorithms. We first set the mathematical foundations of these methods via a brief introduction
to duality, then present our algorithms of interest.

3.1 Linear constraints and dual problem


Consider the following optimization problem with linear equality constraints:

minimize f (x) subject to Ax = b, (3.1.1)


x∈Rd

where A ∈ Rm×d and b ∈ Rm . For simplicity, we will assume that the feasible set {x ∈ Rd | Ax = b}
is not empty.
Duality theory consists in handling constraints formulations by reformulating the problem into an
unconstrained optimization problem. We present the theoretical arguments for the special case of
problem (3.1.1), which yields a much simpler analysis.

Definition 3.1.1 The Lagrangian function of problem (3.1.1) is given by

L(x, y) := f (x) + y T (Ax − b) . (3.1.2)

The Lagrangian function combines the objective function and the constraints, and allows to
restate the original problem as an unconstrained one, called the primal problem:

minimize maxm L(x, y). (3.1.3)


x∈Rd y∈R

The solutions of the primal problem are identical to that of problem (3.1.3) in our case. The difficulty
of solving problem (3.1.3) lies in the definition of its objective function as the optimal value of a
maximization problem.

13
14 Large-scale/distrib. opti - 2022/2023

Definition 3.1.2 The dual problem of (3.1.1) is the maximization problem

maximize
m
min L(x, y), (3.1.4)
y∈R x∈Rd

where the function y 7→ minx∈Rd L(x, y) is called the dual function of the problem.

Unlike the primal problem, the dual problem is always concave (i.e. the opposite of the dual
function is convex), which facilitates its resolution by standard optimization techniques. The goal is
then to solve the dual problem in order to get the solution of the primal problem, thanks to properties
such as the one below.

Assumption 3.1.1 We suppose that strong duality holds between problem (3.1.1) and its dual,
that is,
min maxm L(x, y) = maxm min L(x, y).
x∈Rd y∈R y∈R x∈Rd

A sufficient condition for Assumption 3.1.1 is that f be convex, but this is not necessary.

3.2 Dual algorithms


We are now concerned with solving the dual problem (3.1.4), and we will present three methods for
this purpose.

3.2.1 Dual ascent


The dual ascent method is implicitly a subgradient method applied to the dual problem (which we
recall is a maximimization problem). At every iteration, it starts from a primal-dual pair (xk , y k )
and performs the following iteration:

xk+1 ∈ argminx∈Rd L(x, y k )
(3.2.1)
y k+1 = y k + αk (Axk+1 − b),

where αk > 0 is a stepsize for the dual ascent step, and Axk+1 − b is a subgradient for the dual
function y 7→ minx∈Rd L(x, y) at y k .

3.2.2 Augmented Lagrangian


The dual ascent method generally has weak convergence guarantees. For this reason, the optimization
literature has introduced other frameworks based on a regularized version of the Lagrangian function.

Definition 3.2.1 The augmented Lagrangian of problem (3.1.1) is the function on Rd ×Rm ×R++
by
λ
La (x, y; λ) := f (x) + y T (Ax − b) + kAx − bk2 . (3.2.2)
2

Augmented Lagrangians thus are a family of functions parameterized by λ > 0, that put more
emphasis on the constraint violation as λ grows.
Large-scale/distrib. opti - 2022/2023 15

The augmented Lagrangian algorithm, also called method of multipliers, performs the following
iteration:
xk+1 ∈ argminx∈Rd La (x, y k ; λ)

(3.2.3)
y k+1 = y k + λ(Axk+1 − b).
In this algorithm, λ is constant and used as a constant stepsize: many more sophisticated choices
of both the augmented Lagrangian function and the stepsizes have been proposed. In general, the
advantages of augmented Lagrangian techniques are that the subproblems defining xk+1 become
easier to solve (thanks to regularization) and that the overall guarantees on the primal-dual pair are
stronger.

3.3 Dual methods and decomposition


A key idea in modern optimization, that has resulted in numerous theoretical and numerical improve-
ments, consists in exploiting the structure of a given problem as much as possible. The underlying
idea is that a decomposition of a large, complex problem can lead to many smaller and simpler
(sub)problems that will be easier and cheaper to solve than the original one. We describe below how
this idea can be carried out in the context of dual algorithms.

3.3.1 Dual decomposition


Suppose that we consider a linearly constrained problem with a separable form:

minimizeu∈Rd1 ,v∈Rd2 f (u) + g(v)
(3.3.1)
subject to Au + Bv = c,

where A ∈ Rd1 ×m , B ∈ Rd2 ×m and c ∈ Rm . Given the particular structure (sometimes called
splitting) between the variables u and v, one may want to update those variables separately rather
than gathering them into a single vector x and performing a single update.
This idea is precisely that of dual decomposition. At iteration k, the dual decomposition method
applied to problem (3.3.1) computes

 uk+1 ∈ argminu∈Rd1 L(u, v k , y k )
v ∈ argminv∈Rd2 L(uk , v, y k ) (3.3.2)
 k+1
y k+1 = y k + αk (Auk+1 + Bv k+1 − c),
where αk > 0. Interestingly, the calculations for uk+1 and v k+1 are completely independent, and
can be carried out in parallel. This observation and its practical realization have lead to successful
applications of dual decomposition in several fields.

3.3.2 ADMM
The Alternated Direction Method of Multipliers, or ADMM, is an increasingly popular varia-
tion on the augmented Lagrangian paradigm that bears some connection with coordinate descent
approaches, in that it splits the problem in two sets of variables.
Recall problem (3.3.1) above. For any λ > 0, the augmented Lagrangian of problem (3.3.1) has
the form
λ
La (u, v, y; λ) = f (u) + g(v) + y T (Au + Bv − c) + kAu + Bv − ck2 .
2
16 Large-scale/distrib. opti - 2022/2023

The ADMM iteration exploits the separable nature of the problem by computing the values u and v
independently. Starting from (uk , v k , y k ), the ADMM counterpart to iteration (3.2.3) is

 uk+1 ∈ argminu∈Rd1 La (u, v k , y k ; λ)


v ∈ argminv∈Rd2 La (uk+1 , v, y k ; λ) (3.3.3)


 k+1
y k+1 = y k + λ(Auk+1 + Bv k+1 − c).

The first two updates of the iteration (3.3.3) cannot be run in parallel, but the philosophy is slightly
different from that of the dual decomposition method. Indeed, in ADMM, it is common that solving
for a subset of the variables will be much easier than solving for all variables at once (see our example
in the next section). The ADMM framework gives the possibility to exploit this property within an
augmented Lagrangian method.

Remark 3.3.1 The idea of splitting the objective and the constraints across two groups of variables
can be declined into as many groups of variables as possible, depending on the structure of the
problem.

To end this section, we briefly mention that there exist convergence results for ADMM-type
frameworks, typically under convexity assumptions on the problem [1]. A typical result consist in
showing that 
 kAuk + Bv k − ck → 0
f (uk ) + g(v k ) → minu,v f (u) + g(v)
yk → y∗,

where y ∗ is a solution of the dual problem.

3.4 Decentralized optimization


We end this chapter by describing an increasingly common setup in optimization over large datasets,
often termed consensus optimization or decentralized optimization. In this setup, we consider a
dataset that is split across m entities called agents. Every agent uses its own data to train a certain
learning model parameterized by a vector in Rd . To this end, each agent not only has its own
function f (i) , but also its own copy of the model parameters x(i) . The optimization problem at hand
considers a master iterate x, and attempts to reach consensus between all the agents. This leads to
the following formulation:
Pm (i) (i)
minimizex,x(1) ,...,x(m) ∈Rd i=1 f (x ) (3.4.1)
subject to x = x(i) ∀i = 1, . . . , m.

This problem is a proxy for minimizex∈Rd m (i)


P
i=1 f (x), but the latter problem cannot be solved by
a single agent since every agent has exclusive access to its data by design. The formulation (3.4.1)
models the fact that all agents are involved in computing x by acting on xi . It is possible to apply
ADMM to problem (3.4.1) by setting
 (1) 
x
 .. 
u =  .  ∈ Rmd , v = x ∈ Rd .
x(m)
Large-scale/distrib. opti - 2022/2023 17

Generalization The idea behind the formulation (3.4.1) can be extended to the case of data spread
over a network, represented by a graph G = (V, E): every vertex s ∈ V of the graph represents an
agent, while every edge (s, s0 ) ∈ E represents a channel of communication between two agents in the
graph. Letting x(s) ∈ Rd and f (s) : Rd → R represent the parameter copy and objective function
for agent s ∈ V, respectively, the consensus optimization problem can be written as:
(s) (x(s) )
P
minimize{x(s) }s∈V ∈(Rd )|V| s∈V f
0 (3.4.2)
subject to x(s) = x(s ) ∀(s, s0 ) ∈ E.

When the graph is fully connected, i.e. all agents communicate, this problem reduces to an uncon-
strained problem. However, in general, the solutions of this problem are much difficult to identify, and
one must work through minimizing the objective and satisfying the so-called consensus constraints.

Decentralized gradient methods To wrap up this chapter, we describe an increasingly popular


class of algorithms that extends gradient descent to the decentralized setting. The decentralized
gradient framework is designed for problems of the form
m
X
minimize fi (x(i) ),
x(1) ,...,x(m) ∈Rd
i=1

without consensus constraints but with an implicit graph structure (V, E) connecting the agents.
Given a matrix W ∈ Rm×m that is doubly stochastic (i.e. with nonnegative coefficients such that
the sum of all rows and all columns are 1) and satisfies [W ]ij 6= 0 if and only if i = j or (i, j) ∈ E,
the kth iteration of the decentralized gradient method at agent i reads
m
(i) (j) (i)
X
xk+1 = [W ]ij xk − αk ∇fi (xk ). (3.4.3)
j=1

The iteration (3.4.3) thus combines a gradient step for agent i together with a so-called mixing step
(or consensus step) in which the current value of the iterate for agent i is combined with that of its
neighbors. This framework is steadily gaining popularity in the machine learning community.

3.5 Conclusion of Chapter 3


In modern data science tasks, the amount of data available requires distributed storage, and possibly
agents cooperating in order to solve the optimization problem at hand. Linearly constrained formula-
tions can capture this behavior, and such constraints can be handled in a efficient manner using dual
variables. Augmented Lagrangian techniques are among the most popular methods in this category,
and these methods can be further specialized to account for structure in the problem. The ADMM
framework has emerged as one of the most interesting formulations used to split the calculation
into (presumably) cheaper subproblems. It is also very well suited for operating in a distributed or
decentralized environment.
Bibliography

[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends c in Machine
Learning, 3:1–122, 2010.

[2] S. J. Wright. Coordinate descent algorithms. Math. Program., 151:3–34, 2015.

[3] S. J. Wright and B. Recht. Optimization for Data Analysis. Cambridge University Press, 2022.

18

You might also like