Distributed Optimization In Networked Systems Algorithms And Applications Qingguo L download
Distributed Optimization In Networked Systems Algorithms And Applications Qingguo L download
https://ebookbell.com/product/distributed-optimization-in-
networked-systems-algorithms-and-applications-qingguo-l-47735708
https://ebookbell.com/product/distributed-optimizationbased-control-
of-multiagent-networks-in-complex-environments-1st-edition-minghui-
zhu-5141560
https://ebookbell.com/product/gametheoretic-learning-and-distributed-
optimization-in-memoryless-multiagent-systems-tatarenko-6753550
https://ebookbell.com/product/distributed-optimization-advances-in-
theories-methods-and-applications-1st-ed-huaqing-li-22476542
https://ebookbell.com/product/distributed-control-and-optimization-
technologies-in-smart-grid-systems-first-edition-fanghong-guo-6837710
Distributed Optimization Game And Learning Algorithms Theory And
Applications In Smart Grid Systems 1st Edition Huiwei Wang
https://ebookbell.com/product/distributed-optimization-game-and-
learning-algorithms-theory-and-applications-in-smart-grid-systems-1st-
edition-huiwei-wang-36373782
https://ebookbell.com/product/stochastic-optimization-for-distributed-
energy-resources-in-smart-grids-1st-edition-yuanxiong-guo-5884530
https://ebookbell.com/product/merging-optimization-and-control-in-
power-systems-physical-and-cyber-restrictions-in-distributed-
frequency-control-and-beyond-1st-edition-feng-liu-45413710
https://ebookbell.com/product/developments-in-modelbased-optimization-
and-control-distributed-control-and-industrial-applications-1st-
edition-sorin-olaru-5355162
https://ebookbell.com/product/coordination-of-distributed-energy-
resources-in-microgrids-optimisation-control-and-hardwareintheloop-
validation-yan-xu-38251642
Wireless Networks
Distributed
Optimization
in Networked
Systems
Algorithms and Applications
Wireless Networks
Series Editor
Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada
The purpose of Springer’s Wireless Networks book series is to establish the state
of the art and set the course for future research and development in wireless
communication networks. The scope of this series includes not only all aspects
of wireless networks (including cellular networks, WiFi, sensor networks, and
vehicular networks), but related areas such as cloud computing and big data.
The series serves as a central source of references for wireless networks research
and development. It aims to publish thorough and cohesive overviews on specific
topics in wireless networks, as well as works that are larger in scope than survey
articles and that contain more detailed background information. The series also
provides coverage of advanced and timely topics worthy of monographs, contributed
volumes, textbooks and handbooks.
** Indexing: Wireless Networks is indexed in EBSCO databases and DPLB **
Qingguo Lü • Xiaofeng Liao • Huaqing Li •
Shaojiang Deng • Shanfu Gao
Distributed Optimization
in Networked Systems
Algorithms and Applications
Qingguo Lü Xiaofeng Liao
College of Computer Science College of Computer Science
Chongqing University Chongqing University
Chongqing, China Chongqing, China
Shanfu Gao
College of Computer Science
Chongqing University
Chongqing, China
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To My Family
Q. Lü
To my family
X. Liao
To my family
H. Li
To my family
S. Deng
To my family
S. Gao
Preface
In recent years, the Internet of Things (IoT) and big data have been interconnected
to a wide and deep extent through the sensing, computing, communication, and
control of intelligent information. Networked systems are playing an increasingly
important role in the interconnected information environment, profoundly affecting
computer science, artificial intelligence, and other related fields. The core of such
systems composed of many nodes is to efficiently accomplish certain global goals by
collaborating with each other, while making separate decisions based on different
preferences, thus solving large-scale complex problems that are difficult for indi-
vidual nodes to perform, with strong resistance to interference and environmental
adaptability. In addition, such systems require participating nodes to access only
their own local information. This may be due to the consideration of security and
privacy issues in the network, or simply because the network is too large, making
the aggregation of global information to a central node practically impossible or
very inefficient. Currently, as a hot research topic with wide applicability and great
application value across multiple disciplines, distributed optimization of networked
systems has laid an important foundation for promoting and leading the frontier
development in computer science and artificial intelligence. However, networked
systems cover a large number of intelligent devices (nodes), and the network
environment is often dynamic and changing, making it extremely hard to optimize
and analyze them. It is problematic for existing theories and methods to effectively
address the new needs and challenges of optimization brought about by the rapid
development of technologies related to networked systems. Hence, it is urgent to
develop new theories and methods of distributed optimization over networks.
Analysis and synthesis including distributed unconstrained optimization, dis-
tributed constrained optimization, distributed nonsmooth optimization, distributed
online optimization, distributed economic dispatch in smart grids, undirected
networks, directed networks, time-varying networks, consensus control protocol,
gradient tracking technique, event-triggered communication strategy, Nesterov
and heavy-ball accelerated mechanisms, variance-reduction technique, differential
privacy strategy, gradient descent algorithm, accelerated algorithm, stochastic gra-
dient algorithm, and online algorithm are all thoroughly studied. This monograph
vii
viii Preface
This book was supported in part by the Natural Science Foundation of Chongqing
under Grant CSTB2022NSCQ-MSX1627, in part by the Chongqing Postdoctoral
Science Foundation under Grant 2021XM1006, in part by the China Postdoctoral
Science Foundation under Grant 2021M700588, in part by the National Natural
Science Foundation of China under Grant 62173278, in part by the Science and
Technology Research Program of Chongqing Municipal Education Commission
under Grant KJQN202100228, in part by the project of Key Laboratory of Industrial
Internet of Things & Networked Control, Ministry of Education under Grant
2021FF09, in part by the project funded by Hubei Province Key Laboratory
of Intelligent Information Processing and Real-time Industrial System (Wuhan
University of Science and Technology) under Grant ZNXX2022004, in part by
the project funded by Hubei Key Laboratory of Intelligent Robot (Wuhan Institute
of Technology) under Grant HBIR202205, in part by the Science and Technology
Research Program of Chongqing Municipal Education Commission under Grant
KJQN202100228, and in part by National Key R&D Program of China under
Grant 2018AAA0100101. We would like to begin by acknowledging Yingjue Chen
and Keke Zhang who have unselfishly given their valuable time in arranging raw
materials. Their assistance has been invaluable to the completion of this book. The
authors are especially grateful to their families for their encouragement and never
ending support when it was most required. Finally, we would like to thank the editors
at Springer for their professional and efficient handling of this book.
ix
Contents
xi
xii Contents
xvii
xviii List of Figures
DT S
Fig. 9.4 (a) The pseudo-individual average subregrets (Rj,av (k))
between k and k + 1 for DTS with communication delays.
(b) The pseudo-individual average regrets (Rj (T )/T ) for
DTS with communication delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Fig. 9.5 The outputs xi (t) and xi (t) related to the adjacent relations
fit and fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Chapter 1
Accelerated Algorithms for Distributed
Convex Optimization
1.1 Introduction
In the past decades, with the development of artificial intelligence and the emer-
gence of 5G, a number of researchers are already interested in distributed optimiza-
tion. This chapter considers a class of widely concerned distributed optimization
problems with each node cooperatively attempting to optimize a global cost function
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_1
2 1 Accelerated Algorithms for Distributed Convex Optimization
in the context of local interactions and local computations [1]. Instances of such
formulation characterized by distributed computing have several important and
widespread applications in various fields, including wireless sensor networks for
decision-making and information-processing [2], distributed resource allocation
in smart grids [3], distributed learning in robust control [4], and time-varying
formation control [5, 6], among many others [7–13]. Unlike traditional centralized
optimization, distributed optimization involves multiple nodes that gain access to
their private local information over networks, and typically no central coordinator
(node) can acquire the entire information over the networks.
Recently, an increasing number of distributed algorithms have been emerged
according to various locally computational schemes for individual nodes. Some
known approaches for different networks are usually dependent on the distributed
(sub)gradient descent with extensions to figure out interaction delays, asynchronous
updates and stochastic (sub)gradient scenarios, etc. [14–22]. It is noteworthy in this
aspect that these algorithms are intuitive and flexible for the cost functions and
networks, and however the convergence rates are considerably slow owing to the
utilization of diminishing step-size, which is required to guarantee convergence to
an exact optimal solution [14]. The convergence rate of the known algorithms, even
for strong convex functions, is only sublinear [15]. The convergence rate reaches
to be linear of an algorithm with a constant step-size at the cost of a sub-optimal
solution [20]. Methods that make up this exactness-speed dilemma, such as the
distributed alternating direction method of multipliers (ADMMs) [23, 24] and the
distributed dual decomposition [25], are based on Lagrangian dual, which have nice
provable convergence rates (linear convergence rate for strong convex functions)
[26]. In addition, extensions of various real-world factors including stochastic errors
[27], privacy preserving [28] and techniques including proximal (sub)gradient [29],
and formation-containment control [30] have been extensively studied. However,
due to the need to deal with sub-problems in each iteration, the computational
complexity is considerably high. To overcome these difficulties effectively, quite a
few approaches have been proposed, which achieve linear convergence for smooth
and strongly convex cost functions [31–38]. Nonetheless, these approaches [31–38]
are just suitable for undirected networks.
Distributed optimization over directed networks was firstly studied in [39], where
(sub)gradient-push (SP) method was employed to eliminate the requirement of
network balancing, i.e., with column-stochastic weights. Since SP is established on
(sub)gradient descent with diminishing step-size, it also encounters a slow sublinear
convergence rate. To accelerate convergence, Xi and Khan [40] proposed a linearly
convergent distributed method (DEXTRA) with constant step-size by combining
push-sum strategy with the protocol (EXTRA) in [31]. Further, Xi et al. [41] (fixed
directed network) and Nedic et al. [42] (time-varying directed networks) combined
the push-sum strategy with distributed inexact gradient tracking with constant step-
size (ADD-OPT [41] and Push-DIGing [42]) to acquire linear convergence to the
exact optimal solution. Then, Lü et al. [43, 44] extended the work of [42] to non-
uniform step-sizes and showed linear convergence. A different class of approaches
which do not utilize push-sum mechanism have been recently proposed in [45–
1.1 Introduction 3
50], where both row- and column-stochastic weights are adopted simultaneously to
acquire linear convergence over directed networks. It is noteworthy that although
these approaches [39–50] avoid the construction of doubly-stochastic weights, they
just require nodes to possess (at least) its own out-degree information exactly.
Therefore, all the nodes in the networks [39–50] can adjust their outgoing weights
to ensure that the sum of each column of weight matrix is one. This requirement,
however, is likely to be unrealistic in broadcast-based interaction schemes (i.e., the
node neither accesses its out-neighbors nor regulates its outgoing weights).
In this chapter, the algorithm that we will construct depends crucially on the
gradient tracking and is a variation of methods appeared in [47–55]. To be specific,
Qu and Li [54] combined the gradient tracking with distributed Nesterov gradient
descent (DNGD) method [55] and thereby investigated two accelerated distributed
Nesterov methods, i.e., Acc-DNGD-SC and Acc-DNGD-NSC, which exhibited fast
convergence rate compared with the centralized gradient descent (CGD) method for
different cost functions. Note that although the convergence rates are improved, the
two approaches in [54] just assume that the interaction networks are undirected,
which also involve the applicability of the methods in many fields, such as
wireless sensor networks. To remove this deficiency, Xin et al. [48] established an
acceleration and generalization of first-order methods with the gradient tracking
and the momentum term, i.e., ABm, which overcame the conservatism (eigenvector
estimation or doubly-stochastic weights) in the related work by implementing both
row- and column-stochastic weights. In this setting, some interesting generalized
methods [46] (random link failures) and [49] (interaction delays) were proposed.
Regrettably, the construction of column-stochastic weights demands each node to
possess at least its out-degree information, which is arduous to be implemented, for
example, in broadcast-based interaction scenarios. In light of this challenge, Xin et
al. [52] investigated the case of row-stochastic weight matrix which was required to
restrict global information on the network and proposed a fast distributed method
(FROST) under non-uniform step-sizes motivated by the idea of [51]. Related works
also involve the issues of demand response and economic scheduling in power
systems [53, 56]. However, these methods [51–53, 56] do not adopt momentum
terms [54, 55, 57], where nodes acquire more information from in-neighbors in
the network for fast convergence. Moreover, two accelerated methods based on
Nesterov’s momentum for the distributed optimization over arbitrary networks were
presented in [50]. Unfortunately, the related work [50] does not consider the non-
uniform step-sizes and lack of a rigorous theoretical analysis of the methods. Hence,
it is of great significance to discuss such a challenging issue due to its practicality.
The main interest of this chapter is to study the distributed convex optimiza-
tion problem over a directed network. To solve this issue, a linearly convergent
algorithm is designed, for which the non-uniform step-sizes, momentum terms,
and row-stochastic matrix are utilized. We hope to develop a broad theory of the
distributed convex optimization, and the potential purpose of designing a distributed
optimization algorithm is to adapt and promote real scenarios. To conclude, this
4 1 Accelerated Algorithms for Distributed Convex Optimization
1.2 Preliminaries
1.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R and Rp denote the set of real numbers and p-dimensional real column vectors,
respectively. The subscripts i and j are utilized to denote the indices of the node
1.2 Preliminaries 5
and the superscript t denotes the iteration index of an algorithm; e.g., xit denotes
the variable of node i at time t. We let notations 1n and 0n denote two column
vectors with all entries equaling to one and zero, respectively. Let In and zij denote
the identity matrix of size n and the entry of matrix Z in its i-th row and j -th
column, respectively. The Euclidean norm for vectors and the induced 2-norm for
matrices are represented by the symbol || · ||2 . Let notation Z = diag{y} represent
the diagonal matrix of the vector y = [y1 , y2 , . . . , yn ]T , which follows that zii =
yi , ∀i = 1, . . . , n, and zij = 0, ∀i = j . We define the symbol diag{Z} as a diagonal
matrix whose diagonal elements correspond (same) to the matrix Z. The transposes
of a vector z and a matrix W are indicative of zT and W T , respectively. Let ei =
[0, . . . , 1i , . . . , 0]T . The gradient of f (z) (differentiable) at z is denoted as f (z) :
Rp → Rp . A non-negative square matrix Z ∈ Rn×n is row-stochastic if Z1n = 1n ,
column-stochastic if Z T 1n = 1n , and doubly stochastic if Z1n = 1n and Z T 1n =
1n .
Consider a set of n nodes connected over a directed network. The global objective
is to find x ∈ Rp that minimizes the average of all local cost functions, i.e.,
1
n
minp f (x) = fi (x), (1.1)
x∈R n
i=1
a path between any two nodes, then G is said to be strongly connected. In addition,
the following assumptions are adopted.
Assumption 1.1 ([51]) The network G corresponding to the set of nodes is directed
and strongly connected.
Remark 1.1 Assumption 1.1 is fundamental to assure that nodes in the network can
always affect others directly or indirectly when studying distributed optimization
problems [39–53].
On the basis of the above section, we first review the centralized Nesterov gradient
descent method (CNGD) and then propose the directed distributed Nesterov-like
gradient tracking algorithm, named as D-DNGT, to solve problem (1.1).
Here, CNGD derived from [57] is briefly introduced for L̄-smooth and μ̄-strongly
convex cost function. At each time t ≥ 0, CNGD kept three vectors y t , x t , v t ∈ Rp
1.3 Algorithm Development 7
x t +1 = y t − L̄1 ∇f (y t ) (1.4)
⎪
⎩ t +1
v = (1 − α)v t + αγμ̄ y t − γα ∇f (y t ),
x t +1 = y t − L̄1 ∇f (y t )
(1.5)
y t +1 = x t +1 + β(x t +1 − x t ),
√ √ √ √
where β = ( L̄ − μ̄)/( L̄ + μ̄). It is well known that among all centralized
gradient approaches, CNGD [57] achieved the optimal convergence rate in terms of
the first-order oracle complexity. Under Assumptions 1.2 and 1.3, it is deduced that
the convergence rate of CNGD (1.5) was O((1 − μ̄/L̄)t ) whose dependence on
condition number L̄/μ̄ improved over CGD’s rate O((1 − μ̄/L̄)t ) in the large L̄/μ̄
regime. In this chapter, we devote ourselves to the study of a directed distributed
Nesterov-like gradient tracking (D-DNGT) algorithm, which is not only suitable for
a directed network but also converges linearly and accurately to the optimal solution
to (1.1). To the best of our knowledge, this work has not yet been involved and is
worthwhile to study.
We now describe D-DNGT to distributedly deal with problem (1.1). Each node i ∈
V at time t ≥ 0 stores four variables: xit ∈ Rp , yit ∈ Rp , sti ∈ Rn , and zit ∈ Rp . For
t > 0, node i ∈ V updates its variables as follows:
⎧
⎪ n
⎪
⎪
⎪ xit +1 = rij yjt + βi (xit − xit −1) − αi zit
⎪
⎪ j =1
⎪
⎪ t +1 t +1 t +1
⎪
⎨ yi = xi + βi (xi − xi )
t
n
(1.6)
⎪
⎪ sti +1 = rij stj
⎪
⎪ j =1
⎪
⎪
⎪ t +1
⎪ n ∇f (y t+1 ) ∇f (y t )
⎪
⎩ zi = rij zjt + it+1i − [sit ] i ,
[s ]
j =1 i i i i
8 1 Accelerated Algorithms for Distributed Convex Optimization
and rii = 1 − j ∈Niin rij > , ∀i, where 0 < < 1. Each node i ∈ V starts with
1
initial states xi0 = yi0 ∈ Rp , s0i = ei , and zi0 = ∇fi (yi0 ).2
Denote R = [rij ] ∈ Rn×n as the collection of weights rij , i, j ∈ V in (1.7), which
is obviously row-stochastic. In essence, the update of zit in (1.6) is a distributed
inexact gradient tracking step, where each local cost function’s gradient is scaled
by [sti ]i , which is generated by the third update in (1.6). Actually, the update
of sti in (1.6) is a consensus iteration aiming to estimate the Perron eigenvector
w = [w1 , . . . , wn ]T (related to the eigenvalue 1) of the weight matrix R satisfying
1T w = 1. This iteration is similar to that employed in [51–53]. To sum up, D-DNGT
(1.6) transforms CNGD (1.5) into distributed ones via gradient tracking and can be
applied to a directed network.
Remark 1.3 For the sake of brevity, we mainly concentrate on the one dimensional
case, i.e., p = 1, and the multiple dimensional case is similarly proven.
Define x t = [x1t , . . . , xnt ]T ∈ Rn , y t = [y1t , . . . , ynt ]T ∈ Rn , zt = [z1t , . . . , znt ]T ∈
Rn , S t = [st1 , . . . , stn ]T ∈ Rn×n , ∇F (y t ) = [∇f 1 (y1t ), . . . , ∇f n (ynt )]T ∈ Rn and
S̃ t = diag{S t }. Therefore, the aggregated form of D-DNGT (1.6) can be written as
follows:
⎧ t +1
⎪
⎪ x = Ry t + Dβ (x t − x t −1 ) − Dα zt
⎨ t +1
y = x t +1 + Dβ (x t +1 − x t )
+1 , (1.8)
⎪S
⎪
t = RS t
⎩ t +1
z = Rzt + [S̃ t +1 ]−1 ∇F (y t +1) − [S̃ t ]−1 ∇F (y t )
1 It is worth noticing that the weights, rij , i, j ∈ V , associated with the network G given in (1.7) is
valid. For all i ∈ V , the conditions of the weights, rij , i, j ∈ V , in (1.7) can be satisfied when we
set rij = 1/|Niin |, ∀j ∈ Niin , and rij = 0, otherwise.
2 Suppose that each node possesses and achieves its unique identifier in the network, e.g., 1, . . . , n,
[45–50].
1.3 Algorithm Development 9
In this subsection, some distributed optimization methods which are not only
suitable for directed networks but also related to D-DNGT (1.6) are discussed,
based on an instinct explanation. In particular, we consider ADD-OPT/Push-DIGing
[41, 42], FROST [52] and ABm [48].3
(a) Relation to ADD-OPT/Push-DIGing ADD-OPT [41] (Push-DIGing [42] is
suitable for time-varying networks in comparison with ADD-OPT) kept updating
four variables xit , sit , yit and zit ∈ R for each node i. Starting from the initial states
si0 = 1, zi0 = ∇fi (yi0 ) and an arbitrary xi0 , the updating rule of ADD-OPT is given
by
⎧
⎪ t +1 n
⎪
⎪ xi = cij xjt − αzit
⎪
⎪
⎪
⎪ j =1
⎨ n
t +1
si = cij sjt , yit +1 = xit +1 /sit +1 , (1.9)
⎪
⎪ j =1
⎪
⎪
⎪
⎪ t +1
n
⎪
⎩ zi = cij zjt + ∇fi (yit +1 ) − ∇fi (yit )
j =1
3 Notice that some notations involved in the relevant method may contradict the notations described
in distributed optimization problem/algorithm/analysis throughout the chapter. Therefore, we
declare here that the symbols in this section should not be applied to other parts.
10 1 Accelerated Algorithms for Distributed Convex Optimization
where αi > 0 is a step-size locally chosen at each node i and the row-stochastic
weights R = [rij ] ∈ Rn×n comply with (1.7); the initialization xi0 ∈ R, s0i = ei , and
zi0 = ∇fi (xi0 ). FROST utilized row-stochastic weights with non-uniform step-sizes
among the nodes, and exhibited fast convergence over a directed network, which
converged at a linear rate to the optimal solution under Assumptions 1.1–1.3.
(c) Relation to ABm The ABm, investigated in [48], combined the gradient
tracking with a momentum term and utilized non-uniform step-sizes, which is
described as follows:
⎧
n
⎪
⎪ t +1
rij xjt − αi zit + βi (xit − xit −1 )
⎨ xi =
j =1
n , (1.11)
⎪
⎪ t +1
⎩ zi = cij zjt + ∇fi (xit +1 ) − ∇fi (xit )
j =1
initialized with zi0 = ∇fi (xi0 ) and an arbitrary xi0 at each node i, where as before
αi > 0 and βi ≥ 0 represent the local step-size and the momentum coefficient of
node i. By simultaneously implementing both row-stochastic (R = [rij ] ∈ Rn×n )
and column-stochastic (C = [cij ] ∈ Rn×n ) weights, it is deduced from [48] that
ABm reduces to AB [45] when βi = 0, ∀i, and AB lies at the heart of existing
methods that employ the gradient tracking [42, 43, 48].
Notice that, ADD-OPT/Push-DIGing, FROST and D-DNGT, described above,
have a non-linear term which is derived from the division by the eigenvector learning
term ((1.6), (1.9) and (1.10)). ABm eliminates this non-linear calculation and is
still suitable for the directed networks. However, ABm requires each node to gain
access to its out-degree information to build column-stochastic weights. It is a
challenge to establish directly in a distributed manner, which has been interpreted
earlier. It is worth highlighting that our algorithm, D-DNGT, extends CNGD to a
distributed form and is suitable for the directed networks in comparison with CNGD
[57] and Acc-DNGD-SC/Acc-DNGD-NSC [54]. In addition, D-DNGT combines
FROST with two kinds of momentum terms (heavy-ball momentum and Nesterov
momentum), which ensures that nodes acquire more information from in-neighbors
in the network than FROST to achieve much faster convergence.
1.4 Convergence Analysis 11
In this section, we will prove that D-DNGT (1.6) converges at a linear rate to optimal
solution x ∗ provided that the coefficients (non-uniform step-sizes and momentum
coefficients) are bounded with properly chosen constants. The following notations
and relations are employed.
Recalling that R is irreducible and row-stochastic with positive diagonals,
under Assumption 1.1, there exists a normalized left Perron eigenvector w =
[w1 , . . . , wn ]T ∈ Rn (wi > 0, ∀i) of R such that
Before showing the main results, we introduce some auxiliary results. First, the
following crucial lemma is given, which is a direct implication of Assumption 1.1
and (1.7) (see Section II-B in [32]).
Lemma 1.4 ([32]) Suppose that Assumption 1.1 holds. Considering the weight
matrix R = [rij ] ∈ Rn×n follows (1.7). Then, there are a norm || · || and a constant
0 < ρ < 1 such that
for all x ∈ Rn .
According to the result established in Lemma 1.4, in the following, we present
an additional lemma from the Markov chain and consensus theory [60].
4 Throughout the chapter, for any arbitrary matrix/vector/scalar Z, we utilize the symbol (Z)t to
represent the t-th power of Z to distinguish the iteration of variables.
12 1 Accelerated Algorithms for Distributed Convex Optimization
Lemma 1.5 ([60]) Let S t be generated by (1.8). Then, there exist 0 < θ < ∞ and
0 < λ < 1 such that
For convenience of the convergence analysis, we will make frequently use of the
following well-known lemma (see example [32] for a proof).
Lemma 1.8 ([32]) Suppose that Assumptions 1.2–1.3 hold. Since the global cost
function f is μ̄-strongly convex and L̄-smooth, then for all x ∈ R and 0 < ε < 2/L̄,
we get
where l = max{|1 − L̄ε|, |1 − μ̄ε|}, x ∗ is the optimal solution to (1.1) and ∇f (x)
is the gradient of f (x) at x.
Lemma 1.9 Suppose that Assumption 1.1 holds. Then, for all t > 0, we have the
following inequality:
||x t +1 − (R)∞ x t +1 ||
where the inequality in (1.13) is obtained from Lemma 1.4 and the fact that
(R)∞ R = (R)∞ . The desired result of Lemma 1.9 is then acquired.
The next lemma presents the bound of the optimality residual associated with the
weight average ||(R)∞ x t +1 − 1n x ∗ ||2 (Notice that (R)∞ x t +1 = 1n x̄ t +1 ).
Lemma 1.10 Suppose that Assumptions 1.2 and 1.3 hold. If 0 < n(wT α) < 2/L̄,
then, the following inequality holds for all t > 0:
||(R)∞ x t +1 − 1n x ∗ ||2
where κ3 = d1 ||(R)∞ ||2 and l1 = max{|1 − L̄n(wT α), |1 − μ̄n(wT α)}; θ and λ are
introduced in Lemma 1.5.
Proof Notice that (R)∞ R = (R)∞ . Recalling the updates of x t and y t in D-DNGT
(1.8), we get Lemma from 1.7 that
||(R)∞ x t +1 − 1n x ∗ ||2
= ||(R)∞ (x t + 2Dβ (x t − x t −1 ) − Dα zt + (Dα − Dα )(R)∞ zt ) − 1n x ∗ ||2
≤ ||(R)∞ x t − (R)∞ Dα (R)∞ [S̃ t ]−1 ∇F (y t ) − 1n x ∗ ||2
We now discuss the first term in the inequality of (1.15). Note that (R)∞ =
1n wT and ∇F (1n x̄ t ) = [∇f1 (x̄ t ), . . . , ∇fn (x̄ t )]T . By utilizing 1n wT Dα 1n wT =
(wT α)1n wT , one obtains
where ∇f (x̄ t ) = (1/n)1Tn ∇F (1n x̄ t ). By Lemma 1.8, when 0 < n(wT α) < 2/L̄,
Λ1 is bounded by
√
Λ1 ≤ l1 n||wT x t − x ∗ ||2 = l1 ||(R)∞ x t − 1n x ∗ ||2 , (1.17)
where l1 = max{|1 − L̄n(wT α)|, |1 − μ̄n(wT α)|}. Then, Λ2 can be bounded in the
following way:
where ∇F (x t ) = [∇f1 (x1t ), . . . , ∇fn (xnt )]T . Since ∇f (x̄ t ) = (1/n)1Tn ∇F (1n x̄ t ),
it yields from Assumption 1.2 that
Next, by employing Lemma 1.6 and the relation S ∞ [S̃ ∞ ]−1 = 1n 1Tn , we have
where ŝ = supt ≥0 ||S t ||2 and s̃ = supt ≥0||[S̃ t ]−1 ||2 . The lemma follows by plugging
(1.16)–(1.20) into (1.15).
For the bound of the estimate difference ||x t +1 − x t ||, the following lemma is
shown.
Lemma 1.11 Suppose that Assumption 1.2 holds. For all t > 0, it holds that
Proof Recalling that (R)∞ R = (R)∞ , we obtain from the updates of x t and y t in
D-DNGT (1.8) that
||zt +1 − (R)∞ zt +1 ||
≤ κ4 κ6 (1 + β̂)||x t − (R)∞ x t || + κ6 d2 (1 + β̂)α̂||zt ||2
||zt +1 − (R)∞ zt +1 ||
≤ ||In − (R)∞ ||||[S̃ t +1]−1 ∇F (y t +1) − [S̃ t ]−1 ∇F (y t )||
+ ρ||zt − (R)∞ zt ||, (1.24)
where we employ the triangle inequality and Lemma 1.4 to deduce the inequality. As
for the first term of the inequality in (1.24), we apply the update of y t in D-DNGT
(1.8) and the result in Lemma 1.6 to obtain
Combining Lemma 1.11 with (1.25), the result in Lemma 1.12 is obtained.
16 1 Accelerated Algorithms for Distributed Convex Optimization
The final lemma constitutes an inevitable bound on the estimate, ||zt ||2 , for
deriving the aforementioned linear system.
Lemma 1.13 Assume that Assumption 1.2 holds. Then, the following inequality can
be established for all t > 0,
In view of Lemma 1.7, using S ∞ [S̃ ∞ ]−1 = 1n 1Tn and (R)∞ = S ∞ , it suffices that
Substituting (1.28) and (1.29) into (1.27) yields the desired result in Lemma 1.13.
The proof is completed.
With the supporting relationships, i.e., the above Lemmas 1.9–1.13, in hands, the
main convergence results of D-DNGT are now established as follows.
For the sake of convenience, we define wmin = mini∈V {wi }, ν1 = κ2 d1 nL̂,
ν2 = κ2 nL̂, ν3 = κ2 d1 , ν4 = d1 nL̂, ν5 = d1 ŝ s̃ L̂, ν6 = d2 d1 nL̂, ν7 = d2 nL̂,
ν8 = d2 d1 , ν9 = κ4 κ6 , ν10 = κ6 d2 d1 nL̂, ν11 = κ6 d2 nL̂, ν12 = κ6 + κ5 κ6 , ν13 =
κ5 κ6 , ν14 = κ6 d2 d1 , ν15 = κ2 α̂ ŝ(s̃)2 θ , ν16 = ŝ(s̃)2 θ α̂, ν17 = d2 α̂ ŝ(s̃)2 θ , ν18 =
(2||In − (R)∞ || + κ6 (1 + β̂)α̂ ŝ)(s̃)2 θ d2 , ν19 = ν13 η3 + ν10 η3 α̂, ν20 = ν9 η1 +
ν10 η1 α̂ + ν11 η2 α̂ + ν12 η3 + ν10 η3 α̂ + ν14 η4 α̂ and ν21 = η4 (1 − ρ) − ν9 η1 − (ν10 η1 +
ν11 η2 + ν14 η4 )α̂. Then, the first result, i.e., Theorem 1.14, is introduced below.
1.4 Convergence Analysis 17
and γ41 = ν9 + ν10 α̂ + ν9 β̂ + ν10 α̂ β̂, γ42 = ν11 α̂ + ν11 α̂ β̂, γ43 = ν12 β̂ +
ν13 β̂ 2 + ν10 α̂ β̂ + ν10 α̂ β̂ 2 and γ44 = ρ + ν14 α̂ + ν14 α̂ β̂; φ1t = ν15 (λ)t ||∇F (y t )||2 ,
φ2t = ν16 (λ)t ||∇F (y t )||2 , φ3t = ν17 (λ)t ||∇F (y t )||2 and φ4t = ν18 (λ)t ||∇F (y t )||2 .
Assuming in addition that the largest step-size satisfies
1 η1 (1 − ρ)
0 < α̂ < min , ,
nL̄ ν1 η1 + ν2 η2 + ν3 η4
η3 − κ4 η1 η4 (1 − ρ) − ν9 η1
, , (1.31)
ν6 η1 + ν7 η2 + ν8 η4 ν10 η1 + ν11 η2 + ν14 η4
Then, the spectral radius of Γ , defined as ρ(Γ ), is strictly less than 1, where η1 , η2 ,
η3 , and η4 are arbitrary constants such that
ν4 η1 + κ3 η4 ν9 η1
η1 > 0, η2 > , η3 > κ4 η1 , η4 > . (1.33)
μ̄nwmin 1−ρ
18 1 Accelerated Algorithms for Distributed Convex Optimization
Proof First, plugging Lemma 1.13 into Lemmas 1.9–1.12 and rearranging the
acquired inequalities, it is immediately to verify (1.30). Next, we provide quite a
few conditions for the relation ρ(Γ ) < 1 to establish. According to Theorem 8.1.29
in [60], we know that, for a positive vector η = [η1 , . . . , η4 ]T ∈ R4 , if Γ η < η,
then ρ(Γ ) < 1 holds. By the definition of Γ , it is deduced that inequality Γ η < η
is equivalent to
⎧
⎪
⎪ (κ1 η3 + ν1 η3 α̂)β̂ < η1 (1 − ρ) − (ν1 η1 + ν2 η2 + ν3 η4 )α̂
⎪
⎨ (2κ η + ν η α̂)β̂ < η (1 − l ) − (ν η + κ η )α̂
2 3 5 3 2 1 4 1 3 4
(1.34)
⎪ (κ5 η3 + ν6 η3 α̂)β̂
⎪ < η3 − κ4 η1 − (ν6 η1 + ν7 η2 + ν8 η4 )α̂
⎪
⎩
2ν19 β̂ < −ν20 + (ν20 )2 + 4ν19 ν21 .
When 0 < α̂ < 1/nL̄, from Lemma 1.10, it yields that l1 = 1 − μ̄n(wT α) ≤
1 − μ̄nwmin α̂. To ensure the positivity of β̂ (the right hand sides of (1.34) are always
positive), (1.34) further implies that
⎧
⎪
⎪ α̂ < ν1 η1η+ν
1 (1−ρ)
2 η2 +ν3 η4
⎪
⎨ ν4 η1 +κ3 η4
η2 > μ̄nwmin
η3 −κ4 η1 (1.35)
⎪
⎪ α̂ < ν6 η1 +ν7 η2 +ν8 η4 , η3 > κ4 η1
⎪
⎩
α̂ < ν10 η1 +ν11 η2 +ν14 η4 , η4 > ν1−ρ
η4 (1−ρ)−ν9 η1 9 η1
.
Lemma 1.16 ([39]) Let {v t }, {ut }, {a t }, and {bt } be non-negative sequences such
that for all t ≥ 0,
v t +1 ≤ (1 + a t )v t − ut + b t .
∞ t
Also, let ∞ t =0 a < ∞ and
t
∞ t =0tb < ∞. Then, we get limt →∞ v = v for a
t
ϕ t +1 ≤ Γ ϕ t + P t Qt . (1.36)
t −1
ϕ ≤ (Γ ) ϕ +
t t 0
(Γ )t −k−1 P k Qk . (1.37)
k=0
Since the spectral radius of Γ is strictly less than 1, it can be concluded from
Lemma 1.16 in [52] that ||(Γ )t ||2 ≤ ϑ(δ0 )t and ||(Γ )t −k−1 P k ||2 ≤ ϑ(δ0 )t for
some ϑ > 0 and λ < δ0 < 1. Taking 2-norm on both sides of (1.37) yields that
t −1
||ϕ t ||2 ≤ ||(Γ )t ||2 ||ϕ 0 ||2 + ||(Γ )t −k−1 P k ||2 ||Qk ||2
k=0
t −1
≤ ϑ||ϕ 0 ||2 (δ0 )t + ϑ(δ0 )t ||Qk ||2 . (1.38)
k=0
t −1
||ϕ t ||2 ≤ (ϑ||ϕ 0 ||2 + (1 + d1 )(1 + β̂)L̂ϑ ||ϕ k ||2
k=0
∗
+ ϑt||∇F (1n x )||2 )(δ0 ) . t
(1.40)
t −1
Define v t = k=0 ||ϕ ||2 , ν22 = (1 + d1 )(1 + β̂)L̂ϑ and p = ϑ||ϕ ||2 +
k t 0
∗
ϑt||∇F (1n x )||2 , and then (1.40) implies that
which is equivalent to
achieve that v t converges and thus is bounded. Following from (1.41), we obtain that
limt →∞ ||ϕ t ||2 /(δ1 )t ≤ limt →∞ (ν22 v t + pt )(δ0 )t /(δ1 )t = 0 for all δ0 < δ1 < 1,
and thus there is a positive constant m and an arbitrarily small constant τ such that
for all t ≥ 0,
each node can choose a relatively wider step-size. This is in contrast to the earlier
work on non-uniform step-sizes within the framework of the gradient tracking
[33, 35, 43, 44], which is dependent on the heterogeneity (||(In − W )α||2 /||W α||2 ,
W is the weight matrix, in [35], and α̂/α̃, α̃ = mini∈V {αi }, in [33], [43, 44]) of
the step-sizes. Besides, the analysis showed that the algorithms in [33, 35, 43, 44]
could linearly converge to the optimal solution if and only if the heterogeneity and
the largest step-size are small. However, the largest step-size follows a bound which
is a function of the heterogeneity, and there is a trade-off between the tolerance of
heterogeneity and the largest step-size which can be achieved. Finally, the bounds of
non-uniform step-sizes in this chapter allow the existence (not all) of zero step-sizes
among the nodes if the largest step-size is positive and sufficiently small.
1.4.4 Discussion
The idea of D-DNGT can be applied to other directed distributed gradient tracking
methods to relax the condition of the weight matrices being only column-stochastic
[41, 42] or both row- and column-stochastic [45, 46]. Next, three possible Nesterov-
like optimization algorithms are presented. In this chapter, we only highlight and
verify their feasibilities by means of simulations. A rigorous theoretical analysis of
the three possible algorithms is left for the future work.
(a) D-DNGT with Only Column-Stochastic Weights [41, 42] here, we present
an extended algorithm, named as D-DNGT-C, by applying the momentum terms
into ADD-OPT [41]/Push-DIGing [42] (the weight matrices are only column-
stochastic). Specifically, the updates of D-DNGT-C are stated as follows:
⎧
⎪ t +1 n
⎪
⎪
⎪ x i = cij htj + βi (xit − xit −1) − αi zit
⎪
⎪ j =1
⎪
⎪ ht +1 = x t +1 + β (x t +1 − x t )
⎪
⎨ i i i i i
t +1 n
t +1 t +1 t +1 (1.44)
⎪
⎪ s = c s
ij j
t , y = h i /si
⎪
⎪
i
j =1
i
⎪
⎪
⎪
⎪ n
⎪ zit +1 =
⎩ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ),
j =1
initialized with xi0 = h0i = yi0 ∈ R, si0 = 1, and zi0 = ∇fi (yi0 ), where as before
C = [cij ] ∈ Rn×n is column-stochastic, and αi > 0 and βi ≥ 0 represent the local
step-size and the momentum coefficient of node i. Unlike ADD-OPT [41]/Push-
DIGing [42], D-DNGT-C, by means of column-stochastic weights, adds two types
of momentum terms (heavy-ball momentum and Nesterov momentum) to ensure
that nodes acquire more information from in-neighbors in the network to achieve
fast convergence.
22 1 Accelerated Algorithms for Distributed Convex Optimization
(b) D-DNGT with Both Row- and Column-Stochastic Weights [45, 46] consider
that D-DNGT with both row- and column-stochastic weights does not need the
eigenvector estimation in D-DNGT (1.6) or D-DNGT-C (1.44). Hence, an extended
algorithm (named as D-DNGT-RC), which utilizes both row-stochastic (R = [rij ] ∈
Rn×n ) and column-stochastic (C = [cij ] ∈ Rn×n ) weights, is presented as follows:
⎧
⎪ n
⎪
⎪ xit +1 = rij yjt + βi (xit − xit −1 ) − αi zit
⎪
⎪
⎨ j =1
yit +1 = xit +1 + βi (xit +1 − xit ) (1.45)
⎪
⎪
⎪
⎪ t +1
n
⎩ zi =
⎪ cij zjt + ∇fi (yit +1 ) − ∇fi (yit ),
j =1
where xi0 = yi0 ∈ R and zi0 = ∇fi (yi0 ), αi > 0 and βi ≥ 0 represent the
local step-size and the momentum coefficient of node i. D-DNGT-RC not only
reduces additional iterations of eigenvector learning but also guarantees that more
information nodes can be obtained from in-neighbors, which may exhibit fast
convergence than [45] and [46].
(c) D-DNGT-RC with Interaction Delays [49] note that nodes will confront arbi-
trary but uniformly bounded interaction delays in the process of gaining information
from in-neighbors [49]. Specifically, to solve problem (1.1), we denote ςijt 5 as an
arbitrary priori unknown delay induced by the interaction link (j, i) at time t ≥ 0.
Then, the updates of D-DNGT-RC with delay (D-DNGT-RC-D) become
⎧
⎪ n t −ς t
⎪
⎪ xit +1 = rij yj ij + βi (xit − xit −1 ) − αi zit
⎪
⎪
⎨ j =1
yit +1 = xit +1 + βi (xit +1 − xit ) (1.46)
⎪
⎪
⎪
⎪ t +1
n t −ς t
⎩ zi =
⎪ cij zj ij + ∇fi (yit +1) − ∇fi (yit ).
j =1
5For all t > 0, the interaction delays ςijt are assumed to be uniformly bounded. That is, there exists
some finite ς̂ > 0 such that 0 ≤ ςijt ≤ ς̂. In addition, each node is accessible to its own estimate
without delays, i.e., ςiit = 0, ∀i ∈ V and t > 0.
1.5 Numerical Examples 23
n
min f (x, v) = fi (x, v),
i=1
where x ∈ Rp and v ∈ R are the optimization variables for learning the separable
hyperplane. Here, the local cost function fi is given by
ω
mi
fi (x, v) = (||x||22 + v 2 ) + ln 1 + exp − cTij x + v bij ,
2
j =1
where each node i ∈ {1, . . . , n} privately knows mi training examples; cij , bij ∈
Rp × {−1, +1}, where cij is the p-dimensional feature vector of the j -th training
sample at the i-th node following from a Gaussian distribution with zero mean, and
bij is the label according to a Bernoulli distribution. In terms of parameter design,
we choose n = 10 and mi = 10 for all i and p = 2. The network topology as
the directed and strongly connected network is depicted in Fig. 1.1. In addition, we
utilize a simple uniform weighting strategy, rij = 1/|Niin |, ∀i, to regulate the row-
stochastic weights.
The simulation results are plotted in Figs. 1.2, 1.3 and 1.4. Figure 1.2 indicates
that D-DNGT with momentum terms promotes the convergence in comparison
with the applicable algorithms without momentum terms. Figure 1.3 means that
24 1 Accelerated Algorithms for Distributed Convex Optimization
Comparison (i)
0
-2
-4
-6
Residual
-8
-10
-12
-14
0 200 400 600 800 1000 1200 1400
Time[step]
Fig. 1.2 Performance comparisons between D-DNGT and the methods without momentum terms
D-DNGT with two momentum terms (heavy-ball momentum [48] and Nesterov
momentum [50, 54, 55]) improves the convergence when compared with the
applicable algorithms with single momentum term. We note that although the
eigenvector learning existed in D-DNGT may slow down convergence, D-DNGT
is more suitable for broadcast-based protocols than other optimization methods
(AB, ADD-OPT/Push-DIGing, ABm, and ABN ) because it only requires row-
stochastic weights. Finally, it is concluded from Fig. 1.4 that the algorithms with
momentum terms can successfully promote the convergence regardless of whether
the interaction links undergo interaction delays or the weight matrices are only
column-stochastic or both row- and column-stochastic.
1.5 Numerical Examples 25
Comparison (ii)
0
-2
-4
-6
Residual
-8
-10
-12
-14
0 200 400 600 800 1000 1200 1400
Time[step]
Fig. 1.3 Performance comparisons between D-DNGT and the methods with momentum terms
Comparison (iii)
0
-2
-4
-6
Residual
-8
-10
-12
-14
0 200 400 600 800 1000 1200 1400
Time[step]
Fig. 1.4 Performance comparisons between the extensions of D-DNGT and their closely related
methods
26 1 Accelerated Algorithms for Distributed Convex Optimization
1.6 Conclusion
References
1. S. Yang, Q. Liu, J. Wang, Distributed optimization based on a multiagent system in the presence
of communication delays. IEEE Trans. Syst., Man, Cybern., Syst. 47(5), 717–728 (2017)
2. J. Chen, A. Sayed, Diffusion adaptation strategies for distributed optimization and learning
over networks. IEEE Trans. Signal Process. 60(8), 4289–4305 (2012)
3. K. Li, Q. Liu, S. Yang, J. Cao, G. Lu, Cooperative optimization of dual multiagent system for
optimal resource allocation. IEEE Trans. Syst., Man, Cybern., Syst. 50(11), 4676–4687 (2020)
4. S. Wang, C. Li, Distributed robust optimization in networked system. IEEE Trans. Cybern.
47(8), 2321–2333 (2017)
5. X. Dong, G. Hu, Time-varying formation tracking for linear multi-agent systems with multiple
leaders. IEEE Trans. Autom. Control 62(7), 3658–3664 (2017)
6. X. Dong, G. Hu, Time-varying formation control for general linear multi-agent systems with
switching directed topologies. Automatica 73, 47–55 (2016)
7. C. Shi, G. Yang, Augmented Lagrange algorithms for distributed optimization over multi-agent
networks via edge-based method. Automatica 94, 55–62 (2018)
8. S. Zhu, C. Chen, W. Li, B. Yang, X. Guan, Distributed state estimation of sensor-network
systems subject to Markovian channel switching with application to a chemical process. IEEE
Trans. Syst. Man Cybern. Syst. 48(6), 864–874 (2018)
References 27
9. D. Jakovetic, A unification and generalization of exact distributed first order methods. IEEE
Trans. Signal Inform. Process. Over Netw. 5(1), 31–46 (2019)
10. Z. Wu, Z. Li, Z. Ding, Z. Li, Distributed continuous-time optimization with scalable adaptive
event-based mechanisms. IEEE Trans. Syst. Man Cybern. Syst. 50(9), 3252–3257 (2020)
11. K. Scaman, F. Back, S. Bubeck, Y. Lee, L. Massoulie, Optimal algorithms for smooth and
strongly convex distributed optimization in networks, in Proceedings of the 34th International
Conference on Machine Learning (PMLR), vol. 70 (2017), pp. 3027–3036
12. X. He, T. Huang, J. Yu, C. Li, Y. Zhang, A continuous-time algorithm for distributed
optimization based on multiagent networks. IEEE Trans. Syst. Man Cybern. Syst. 49(12),
2700–2709 (2019)
13. Y. Zhu, W. Ren, W. Yu, G. Wen, Distributed resource allocation over directed graphs via
continuous-time algorithms. IEEE Trans. Syst. Man Cybern. Syst. 51(2), 1097–1106 (2021)
14. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization. IEEE
Trans. Autom. Control 54(1), 48–61 (2009)
15. A. Nedic, A. Ozdaglar, P. Parrilo, Constrained consensus and optimization in multi-agent
networks. IEEE Trans. Autom. Control 55(4), 922–938 (2010)
16. H. Li, S. Liu, Y. Soh, L. Xie, Event-triggered communication and data rate constraint for
distributed optimization of multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 48(11),
1908–1919 (2018)
17. D. Yuan, Y. Hong, D. Ho, G. Jiang, Optimal distributed stochastic mirror descent for strongly
convex optimization. Automatica 90, 196–203 (2018)
18. I. Matei, J. Baras, Performance evaluation of the consensus-based distributed subgradient
method under random communication topologies. IEEE J. Sel. Topics Signal Process. 5(4),
754–771 (2011)
19. C. Xi, U. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans.
Autom. Control 62(8), 3986–3992 (2017)
20. D. Yuan, D. Ho, G. Jiang, An adaptive primal-dual subgradient algorithm for online distributed
constrained optimization. IEEE Trans. Cybern. 48(11), 3045–3055 (2018)
21. C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning,
IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018)
22. J. Zhu, C. Xu, J. Guan, D. Wu, Differentially private distributed online algorithms over time-
varying directed networks. IEEE Trans. Signal Inform. Process. Over Netw. 4(1), 4–17 (2018)
23. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in
decentralized consensus optimization. IEEE Trans. Signal Process. 62(7), 1750–1761 (2014)
24. J. Mota, J. Xavier, P. Aguiar, M. Puschel, D-ADMM: a communication-efficient distributed
algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)
25. H. Terelius, U. Topcu, R. Murray, Decentralized multi-agent optimization via dual decomposi-
tion. IFAC Proc. Volumes 44(1), 11245–11251 (2011)
26. E. Wei, A. Ozdaglar, On the O(1/k) convergence of asynchronous distributed alternating
direction method of multipliers, in 2013 IEEE Global Conference on Signal and Information
Processing (2013). https://doi.org/10.1109/GlobalSIP.2013.6736937
27. M. Hong, T. Chang, Stochastic proximal gradient consensus over random networks. IEEE
Trans. Signal Process. 65(11), 2933–2948 (2017)
28. H. Xiao, Y. Yu, S. Devadas, On privacy-preserving decentralized optimization through
alternating direction method of multipliers (2019). Preprint arXiv:1902.06101
29. A. Chen, A. Ozdaglar, A fast distributed proximal-gradient method, in 2012 50th Annual
Allerton Conference on Communication, Control, and Computing (Allerton) (2012). https://
doi.org/10.1109/Allerton.2012.6483273
30. X. Dong, Y. Hua, Y. Zhou, Z. Ren, Y. Zhong, Theory and experiment on formation-containment
control of multiple multirotor unmanned aerial vehicle systems. IEEE Trans. Autom. Sci. Eng.
16(1), 229–240 (2019)
31. W. Shi, Q. Ling, G. Wu, W Yin, EXTRA: an exact first-order algorithm for decentralized
consensus optimization. SIAM J. Optimi. 25(2), 944–966 (2015)
28 1 Accelerated Algorithms for Distributed Convex Optimization
32. G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans.
Control Netw. Syst. 5(3), 1245–1260 (2018)
33. A. Nedic, A. Olshevsky, W. Shi, C. Uribe, Geometrically convergent distributed optimization
with uncoordinated step-sizes, in 2017 American Control Conference (ACC) (2017). https://
doi.org/10.23919/ACC.2017.7963560
34. M. Maros, J. Jalden, A geometrically converging dual method for distributed optimization over
time-varying graphs. IEEE Trans. Autom. Control 66(6), 2465–2479 (2021)
35. J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over
stochastic networks. IEEE Trans. Autom. Control 63(2), 434–448 (2018)
36. S. Pu, A. Nedic, Distributed stochastic gradient tracking methods. Math. Program. 187(1),
409–457 (2021)
37. Y. Tian, Y. Sun, B. Du, G. Scutari, ASY-SONATA: Achieving geometric convergence for
distributed asynchronous optimization, in 2018 56th Annual Allerton Conference on Communi-
cation, Control, and Computing (Allerton) (2018). https://doi.org/10.1109/ALLERTON.2018.
8636055
38. M. Maros, J. Jalden, Panda: A dual linearly converging method for distributed optimization
over time-varying undirected graphs, in 2018 IEEE Conference on Decision and Control
(CDC) (2018). https://doi.org/10.1109/CDC.2018.8619626
39. A. Nedic, A. Olshevsky, Distributed optimization over time-varying directed graphs. IEEE
Trans. Autom. Control 60(3), 601–615 (2015)
40. C. Xi, U. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans.
Autom. Control 62(10), 4980–4993 (2017)
41. C. Xi, R. Xin, U. Khan, ADD-OPT: accelerated distributed directed optimization. IEEE Trans.
Autom. Control 63(5), 1329–1339 (2018)
42. A. Nedic, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization
over time-varying graphs. SIAM J. Optimi. 27(4), 2597–2633 (2017)
43. Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with time-
varying directed graphs and uncoordinated step-sizes. Inf. Sci. 422, 516–530 (2018)
44. Q. Lü, H. Li, Z. Wang, Q. Han, W. Ge, Performing linear convergence for distributed
constrained optimisation over time-varying directed unbalanced networks. IET Control Theory
Appl. 13(17), 2800–2810 (2019)
45. R. Xin, U. Khan, A linear algorithm for optimization over directed graphs with geometric
convergence. IEEE Control Syst. Lett. 2(3), 315–320 (2018)
46. S. Pu, W. Shi, J. Xu, A. Nedic, Push-pull gradient methods for distributed optimization in
networks. IEEE Trans. Autom. Control 66(1), 1–16 (2021)
47. F. Saadatniaki, R. Xin, U. Khan, Decentralized optimization over time-varying directed graphs
with row and column-stochastic matrices. IEEE Trans. Autom. Control 65(11), 4769–4780
(2020)
48. R. Xin, U. Khan, Distributed heavy-ball: a generalization and acceleration of first-order
methods with gradient tracking. IEEE Trans. Autom. Control 65(6), 2627–2633 (2020)
49. C. Zhao, X. Duan, Y. Shi, Analysis of consensus-based economic dispatch algorithm under
time delays. IEEE Trans. Syst. Man Cybern. Syst. 50(8), 2978–2988 (2020)
50. R. Xin, D. Jakovetic, U. Khan, Distributed Nesterov gradient methods over arbitrary graphs.
IEEE Signal Process. Lett. 26(8), 1247–1251 (2019)
51. C. Xi, V. Mai, E. Abed, U. Khan, Linear convergence in optimization over directed graphs with
row-stochastic matrices. IEEE Trans. Autom. Control 63(10), 3558–3565 (2018)
52. R. Xin, C. Xi, U. Khan, FROST-Fast row-stochastic optimization with uncoordinated step-
sizes. EURASIP J. Advanc. Signal Process. 2019(1), 1–14 (2019)
53. H. Li, Q. Lü, T. Huang, Convergence analysis of a distributed optimization algorithm with a
general unbalanced directed communication network. IEEE Trans. Netw. Sci. Eng. 6(3), 237–
248 (2019)
54. G. Qu, N. Li, Accelerated distributed Nesterov gradient descent. IEEE Trans. Autom. Control
65(6), 2566–2581 (2020)
References 29
55. D. Jakovetic, J. Xavier, J. Moura, Fast distributed gradient methods. IEEE Trans. Autom.
Control 59(5), 1131–1146 (2014)
56. H. Li, Q. Lü, X. Liao, T. Huang, Accelerated convergence algorithm for distributed constrained
optimization under time-varying general directed graphs. IEEE Trans. Syst. Man Cybern. Syst.
50(7), 2612–2622 (2020)
57. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Springer Science
& Business Media, Berlin, 2013)
58. H. Wang, X. Liao, T. Huang, C. Li, Cooperative distributed optimization in multiagent
networks with delays. IEEE Trans. Syst. Man Cybern. Syst. 45(2), 363–369 (2015)
59. A. Defazio, On the curved geometry of accelerated optimization (2018). Preprint
arXiv:1812.04634
60. R. Horn, C. Johnson, Matrix Analysis (Cambridge University Press, New York, 2013)
61. T. Yang, Q. Lin, Z. Li, Unified convergence analysis of stochastic momentum methods for
convex and non-convex optimization (2016). Preprint arXiv:1604.03257
Chapter 2
Projection Algorithms for Distributed
Stochastic Optimization
Abstract This chapter focuses on introducing and solving the problem of compos-
ite constrained convex optimization with a sum of smooth convex functions and
non-smooth regularization terms (1 norm) subject to locally general constraints.
Each of the smooth objective functions is further thought of as the average
of several constituent functions, which is motivated by the modern large-scale
information processing problems in machine learning (the samples of a training
dataset are randomly distributed across multiple computing nodes). We present a
novel computation-efficient distributed stochastic gradient algorithm that makes use
of both the variance-reduction methodology and the distributed stochastic gradient
projection method with constant step-size to solve the problem in a distributed
manner. Theoretical study shows that the suggested algorithm can discover the
precise optimal solution in expectation when each constituent function (smooth)
is strongly convex if the constant step-size is less than an explicitly calculated upper
constraint. Regarding the current distributed methods, the suggested technique
not only has a low computation cost in terms of the overall number of local
gradient evaluations but is also suited for addressing general restricted optimization
problems. Finally, the numerical proof is offered to show the suggested algorithm’s
attractive performance.
2.1 Introduction
Given the limited computational and storage capacity of nodes, it has become
unrealistic to deal with large-scale tasks centrally on a single compute node
[1]. Distributed optimization is a classic topic [2–9] yet has recently aroused
considerable interest in many emerging applications (large-scale tasks), such as
parameter estimation [3, 4], network attacks [5], machine learning [6], IoT networks
[7], and some others. At least two facts [8] have contributed to this resurgence
of interest: (a) recent developments in high-performance computing platforms
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 31
Q. Lü et al., Distributed Optimization in Networked Systems, Wireless Networks,
https://doi.org/10.1007/978-981-19-8559-1_2
32 2 Projection Algorithms for Distributed Stochastic Optimization
gradient tracking [38]. However, in practice, these methods converge slowly due
to the large variance coming from the stochastic gradient and the adoption of a
carefully tuned sequence of decaying step-sizes. To address this deficiency, various
variance-reduction techniques have been leveraged in developing the stochastic
gradient descent methods, which appear some representative centralized methods
such as S2GD [39], SAG [40], SAGA [41], SVRG [42, 43], and SARAH [44]. The
idea of the variance-reduction technique is to reduce the variance of the stochastic
gradient and substantially improve the convergence.
Motivated by the centralized variance reduced methods, the distributed variance
reduced methods have been extensively studied, which outperform their centralized
counterparts in handling large-scale tasks. Of relevance to our work are the recent
developments in [45] and [46]. The distributed stochastic averaging gradient method
(DSA) proposed in [45] incorporates the variance-reduction technique in SAGA
[41] to the algorithm design ideas of EXTRA [14], which not only obtains the
expected linear convergence of distributed stochastic optimization for the first
time but also performs better than the previous works [14, 35] in dealing with
machine learning problems. Similar works also involve the DSBA [47], diffusion-
AVRG [48], ADFS [49], SAL-Edge [50], GT-SAGA/GT-SVRG [2, 51, 52], and
Network-DANE [8] utilizing various strategies. However, to the best knowledge
of the authors, there are no methods to focus on solving general composite
constrained convex optimization problems. Recently, the distributed neurodynamic-
based consensus algorithm proposed in [46] is developed to solve the problem of
a sum of smooth convex functions and 1 norms subjected to the locally general
constraints (linear equality, convex inequality, and bounded constraints), which
generalizes the work in [53] to the case where the objective function and the
constraint conditions are wider. In particular, based on the Lyapunov stability theory,
the method in [46] can achieve consensus at the global optimal solution with
constant step-size. The work in [46] is insightful, but unfortunately, the algorithm
does not take into account the high computational cost of evaluating the full gradient
of the local objective function at each iteration.
In this chapter, we are concerned with solving the composite constrained convex
optimization problem with a sum of smooth convex functions and non-smooth
regularization terms (1 norm), where the smooth objective functions are further
composed of the average of several constituent functions and the locally general
constraints are constituted by linear equality, convex inequality, and bounded
constraints. To aim at this, a computation-efficient distributed stochastic gradient
algorithm is proposed, which is capable of adaptability and facilitating the real-
world applications. In general, the novelties of the present work are summarized as
follows:
(i) We propose and analyze a novel computation-efficient distributed stochastic
gradient algorithm by leveraging the variance-reduction technique and the
distributed stochastic gradient projection method with constant step-size. In
contrast with most existing distributed methods [29–33, 45, 47–51, 53], the
34 2 Projection Algorithms for Distributed Stochastic Optimization
2.2 Preliminaries
2.2.1 Notation
If not particularly stated, the vectors mentioned in this chapter are column vectors.
Let R, Rn , and Rm×n denote the set of real numbers, n-dimensional real column
vectors, and m × n real matrices, respectively. The n × n identity matrix is denoted
as In , and two column vectors of all ones and all zeros are denoted as 1 and 0
(appropriate dimensional), respectively. A quantity (probably a vector) of node i
is indexed by a subscript i; e.g., let xik be the estimate of node i at time k. We
use χmax (A) and χmin (A) to represent the largest and the smallest eigenvalues of
a real symmetric matrix A, respectively. We let the symbols x T and AT denote
the transposes of a vector x and a matrix A. The Euclidean norm (vectors) √ and
1 norm are denoted as || · || and || · ||1 , respectively. We let ||x||A = x T Ax,
where matrix A ∈ Rn×n is a positive semi-definite matrix. The Kronecker product
and the Cartesian product are represented by the symbols ⊗ and , respectively.
Given a random estimator x, the probability and expectation are represented by P[x]
and E[x], respectively. We utilize Z = diag{x} to represent the diagonal matrix of
vector x = [x1 , x2 , . . . , xn ]T , which satisfies that zii = xi , ∀i = 1, . . . , n, and
zij = 0, ∀i = j . Denote (·)+ = max{0, ·}.
For a set Ω ⊆ Rd , the projection of a vector x ∈ Rd onto Ω is denoted by
PΩ (x), i.e., PΩ (x) = arg miny∈Ω ||y −x||2. Notice that this projection always exists
and is unique if Ω is nonempty, closed, and convex [53]. Moreover, let Ω be a
nonempty closed convex set, then the projection operator PΩ (·) has the following
properties: (a) (y − PΩ (y))T (PΩ (y) − x) ≥ 0, for any x ∈ Ω and y ∈ Rd and (b)
||PΩ (y) − PΩ (x)|| ≤ ||y − x||, for any x, y ∈ Rd .
Random documents with unrelated
content Scribd suggests to you:
The Project Gutenberg eBook of Throne-
Makers
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Title: Throne-Makers
Language: English
W. R. T.
THRONE-MAKERS: PAGE
Bismarck 3
Napoleon III 44
Kossuth 79
Garibaldi 115
PORTRAITS:
Carlyle 163
Tintoret 193
Bryant 309
THRONE-MAKERS
BISMARCK
One by one the nations of the world come to their own, have free
play for their faculties, express themselves, and eventually pass
onward into silence. Our age has beheld the elevation of Prussia.
Well may we ask, “What has been her message? What the path by
which she climbed into preëminence?” That she would reach the
summit, the work of Frederick the Great in the last century, and of
Stein at the beginning of this, portended. It has been Bismarck’s
mission to amplify and complete their task. Through him Prussia has
come to her own. What, then, does she express?
The Prussians have excelled even the Romans in the art of turning
men into machines. Set a Yankee down before a heap of coal and
another of iron, and he will not rest until he has changed them into
an implement to save the labor of many hands; the Prussian takes
flesh and blood, and the will-power latent therein, and converts
them into a machine. Such soldiers, such government clerks, such
administrators, have never been manufactured elsewhere.
Methodical, punctilious, thorough, are those officers and officials.
The government which makes them relies not on sudden spurts, but
on the cumulative force of habit. It substitutes rule for whim; it
suppresses individual spontaneity, unless this can be transformed
into energy for the great machine to use. That Prussian system takes
a turnip-fed peasant, and in a few months makes of him a military
weapon, the length of whose stride is prescribed in centimetres—a
machine which presents arms to a passing lieutenant with as much
gravity and precision as if the fate of Prussia hinged on that special
act. It takes the average tradesman’s son, puts him into the
educational mill, and brings him out a professor,—equipped even to
the spectacles,—a nonpareil of knowledge, who fastens on some
subject, great or small, timely or remote, with the dispassionate
persistence of a leech; and who, after many years, revolutionizes our
theory of Greek roots, or of microbes, or of religion. Patient and
noiseless as the earthworm, this scholar accomplishes a similarly
incalculable work.
We are struck at once by the fact that until 1871 Germany had no
political unity. During the centuries when France, England, and Spain
were being welded into political units by their respective dynasties,
the great Teutonic race in Central Europe escaped the unifying
process. The Holy Roman Empire—at best a reminiscence—was too
weak to prevent the rise of many petty princedoms and duchies and
of a few large states, whose rulers were hereditary, whereas the
emperor was elective. Thus particularism—what we might call states’
rights—flourished, to the detriment of national union. At the end of
the last century, Germany had four hundred independent sovereigns:
the most powerful being the King of Prussia; the weakest, some
knight whose realm embraced but a few hundred acres, or some
free city whose jurisdiction was bounded by its walls. When
Napoleon, the great simplifier, reduced the number of little German
states, he had no idea of encouraging the formation of a strong,
coherent German Empire. To guard against this, which might
menace the supremacy of France, he created the kingdoms of
Bavaria and Westphalia, and set up the Confederation of the Rhine.
After his downfall the German Confederation was organized,—a
weak institution, consisting of thirty-nine members, whose common
affairs were regulated by a Diet which sat at Frankfort.
Representation in this Diet was so unequal that Austria and Prussia,
with forty-two million inhabitants, had only one eighth of the votes,
while the small states, with but twelve million inhabitants, had seven
eighths. Four tiny principalities, with two hundred and fifty thousand
inhabitants each, could exactly offset Prussia with eight millions. By
a similar anomaly, Nevada and New York have an equal
representation in the United States Senate.
From 1816 to 1848 Austria ruled the Diet. Yet Austria was herself
an interloper in any combination of German states, for her German
subjects, through whom she gained admission to the Diet,
numbered only four millions; but her prestige was augmented by the
backing of her thirty million non-German subjects besides. Prussia
fretted at this Austrian supremacy, fretted, and could not counteract
it. Beside the Confederation, which so loosely bound the German
particularists together, there was a Customs Union, which, though
simply commercial, fostered among the Germans the idea of
common interests. The spirit of nationality, potent everywhere,
awakened also in the Germans a vision of political unity, but for the
most part those who beheld the vision were unpractical; the men of
action, the rulers, opposed a scheme which enfolded among its
possibilities the curtailing of their autocracy through the adoption of
constitutional government. No state held more rigidly than Prussia
the tenets of absolutism.
Great, therefore, was the general surprise, and among Liberals the
joy, at the announcement, in February, 1847, that the King of Prussia
had consented to the creation of a Prussian Parliament. He granted
to it hardly more power than would suffice for it to assemble and
adjourn; but even this, to the Liberals thirsty for a constitution, was
as the first premonitory raindrops after a long drought. Among the
members of this Parliament, or Diet, was a tall, slim, blond-bearded,
massive-headed Brandenburger, thirty-two years old, who sat as
proxy for a country gentleman. A few of his colleagues recognized
him as Otto von Bismarck; the majority had never heard of him.
To the Diet of 1847 the mad squire came, and during several
sittings he held his peace. At last, however, when a Liberal deputy
declared that Prussia had risen in arms in 1813, in the hope of
getting a constitution quite as much as of expelling the French, the
blond Brandenburger got leave to speak. In a voice which seemed
incongruously small for his stature, but which carried far and
produced the effect of being the utterance of an inflexible will, he
deprecated the assertions just made, and declared that the desire to
shake off foreign tyranny was a sufficient motive for the uprising in
1813. These words set the House in confusion. Liberal deputies
hissed and shouted so that Bismarck could not go on; but, nothing
daunted, he took a newspaper out of his pocket and read it, there in
the tribune, till order was restored. Then, having added that
whoever deemed that motive inadequate held Prussia’s honor cheap,
he strode haughtily to his seat, amid renewed jeers and clamor. Such
was Bismarck’s parliamentary baptism of fire.
Before the session adjourned, the deputies had come to know him
well. They discovered that the mad squire, the blunt “captain of the
dikes,” was doubly redoubtable; he had strong opinions, and utter
fearlessness in proclaiming them.
The revolution of 1848 soon put them to the ordeal. The German
Liberals aimed at national unity under a constitution. Like their
brothers in Austria and Italy, they enjoyed a temporary triumph; but
they could not construct. Their Parliament became a cave of the
winds. Their schemes clashed. By the beginning of 1850 the old
order was restored.
His promotion had long been mooted. The new King William—a
practical, rigid monarch, with no Liberal visions, no desire to please
everybody—had been for eighteen months in conflict with his
Parliament. He had determined to reorganize the Prussian army; the
Liberals insisted that, as Parliament was expected to vote
appropriations, it should know how they were spent. William at last
turned to Bismarck to help him subjugate the unruly deputies, and
Bismarck, with a true vassal’s loyalty, declared his readiness to serve
as “lid to the saucepan.” Very soon the Liberals began to compare
him with Strafford, and the King with Charles I, but neither of them
quailed. “Death on the scaffold, under certain circumstances, is as
honorable,” Bismarck said, “as death on the battlefield. I can imagine
worse modes of death than the axe.” Hitherto he had strenuously
maintained the first article of his creed,—“I believe in the supremacy
of Prussia;” henceforth he upheld with equal vigor the second,—“I
believe in the autocracy of the King.”
In June, 1866, war came, with fury. One Prussian army crushed
with a single blow the German states which had promised to support
Austria; another marched into Bohemia, and, in seven days,
confronted the imperial forces at Sadowa. There was fought a great
battle, in which the Prussian crown prince repeated the master
stroke of Blücher at Waterloo, and then Austria, hopelessly beaten,
sued for peace.
ebookbell.com