0% found this document useful (0 votes)
7 views13 pages

Accelerating_Federated_Learning_via_Momentum_Gradient_Descent

Uploaded by

bangbanghappy666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

Accelerating_Federated_Learning_via_Momentum_Gradient_Descent

Uploaded by

bangbanghappy666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1754 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO.

8, AUGUST 2020

Accelerating Federated Learning via


Momentum Gradient Descent
Wei Liu , Li Chen , Yunfei Chen , Senior Member, IEEE,
and Wenyi Zhang , Senior Member, IEEE

Abstract—Federated learning (FL) provides a communication-efficient approach to solve machine learning problems concerning
distributed data, without sending raw data to a central server. However, existing works on FL only utilize first-order gradient descent
(GD) and do not consider the preceding iterations to gradient update which can potentially accelerate convergence. In this article, we
consider momentum term which relates to the last iteration. The proposed momentum federated learning (MFL) uses momentum
gradient descent (MGD) in the local update step of FL system. We establish global convergence properties of MFL and derive an upper
bound on MFL convergence rate. Comparing the upper bounds on MFL and FL convergence rates, we provide conditions in which MFL
accelerates the convergence. For different machine learning models, the convergence performance of MFL is evaluated based on
experiments with MNIST and CIFAR-10 datasets. Simulation results confirm that MFL is globally convergent and further reveal
significant convergence improvement over FL.

Index Terms—Accelerating convergence, distributed machine learning, federated learning, momentum gradient descent

1 INTRODUCTION
ECENTLY, data-intensive machine learning has been In order to overcome these challenges, cutting down
R applied in various fields, such as autonomous driving
[1], speech recognition [2], image classification [3] and disease
transmission distance and reducing the amount of uploaded
data from edge devices to the network center are two effec-
detection [4] since this technique provides beneficial solu- tive ways. To reduce transmission distance, mobile edge
tions to extract the useful information hidden in data. It now computing (MEC) in [8] is an emerging technique where the
becomes a common tendency that machine-learning systems computation and storage resources are pushed to proximity
are deployed in architectures that include tens of thousands of edge devices where the local task and data offloaded by
of processors [5]. Great amount of data is generated by vari- users can be processed. In this way, the distance of large-
ous parallel and distributed physical objects. scale data transmission is greatly shortened and the latency
Collecting data from edge devices to the central server has a significant reduction [9]. Using machine learning for
is necessary for distributed machine learning scenarios. the prediction of uploaded task execution time achieves a
In the process of distributed data collection, there exist shorter processing delay [10], and dynamic resource sched-
significant challenges such as energy efficiency problems uling was studied to optimize resources allocation of MEC
and system latency problems. The energy efficiency of system in [11]. To reduce the uploaded data size, model-
distributed data collection was considered in wireless sen- based compression approaches, where raw data are com-
sor networks (WSNs) due to limited battery capacity of pressed and represented by well-established model parame-
sensors [6]; In fifth-generation (5G) cellular networks, a ters, demonstrate significant compression performance [12].
round-trip delay from terminals through the network Lossy compression is also an effective strategy to decrease
back to terminals demands much lower latencies, poten- the uploaded data size [13], [14]. Compressed sensing, where
tially down to 1 ms, to facilitate human tactile to visual the sparse data of the edge can be efficiently sampled and
feedback control [7]. Thus, the challenges of data aggrega- reconstructed with transmitting a much smaller data size,
tion in distributed system urgently require communica- was applied to data acquisition of Internet of Things (IoT)
tion-efficient solutions. network [15]. All the aforementioned works need to collect
raw data from individual device.
To avoid collecting raw data for machine learning in dis-
 W. Liu, L. Chen, and W. Zhang are with the Department of Electronic tributed scenarios, a novel approach named Federated Learn-
Engineering and Information Science, University of Science and Technology ing (FL) has emerged as a promising solution [16]. The work
of China, Hefei, Anhui 230052, China.
E-mail: [email protected], {chenli87, wenyizha}@ustc.edu.cn. in [17] provided a fundamental architecture design of FL.
 Y. Chen is with the School of Engineering, University of Warwick, CV4 Considering the growing computation capability of edge
7AL Coventry, United Kingdom. E-mail: [email protected]. nodes (devices), FL decentralizes the centralized machine
Manuscript received 6 Oct. 2019; revised 14 Jan. 2020; accepted 15 Feb. 2020. learning task and assigns the decomposed computing tasks
Date of publication 19 Feb. 2020; date of current version 23 Mar. 2020. to the edge nodes where the raw data are stored and learned
(Corresponding author: Li Chen.)
Recommended for acceptance by J. Zhai. at the edge nodes. After a fixed iteration interval, each edge
Digital Object Identifier no. 10.1109/TPDS.2020.2975189 node transmits its learned model parameters to the central
1045-9219 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1755

server. This strategy can substantially decrease consump-


tion of communication resources and improve communica-
tion efficiency. To further improve the energy efficiency of
FL, an adaptive FL approach was proposed in [17], where
the aggregation frequency can be adjusted adaptively to
minimize the loss function under a fixed resource budget.
To reduce the uplink communication costs, the work in [18]
proposed structured and sketched updates method, and
compression techniques were adopted to reduce parameter
dimension in this work. In [19], gradient selection and adap-
tive adjustment of learning rate were used for efficient com-
pression. For security aggregation of high-dimensional
data, the work in [20] provided a communication-efficient
approach, where the server can compute the sum of model Fig. 1. The simplified structure of learning system for distributed user data.
parameters from edge nodes without knowing the contribu-
tion of each individual node. In [21], under unbalanced  Convergence analysis for MFL: We prove that the pro-
resource distribution in network edge, FL with client (edge posed MFL is globally convergent on convex optimi-
node) selection was proposed for actively managing the cli- zation problems, and derive its theoretical upper
ents aggregation according to their resources condition. In bound on convergence rate. We conduct a compara-
[22], non-i.i.d data distribution was studied. tive analysis of convergence performance between the
However, existing FL solutions generally use gradient proposed MFL and FL. It is proven that MFL improves
descent (GD) for loss function minimization. GD is a one-step convergence rate of FL under certain conditions.
method where the next iteration depends only on the current  Evaluation based on MNIST and CIFAR-10 datasets: We
gradient. Convergence rate of GD can be improved by evaluate the proposed MFL’s convergence perfor-
accounting for more preceding iterations [23]. Thus, by intro- mance via simulation based on MNIST and CIFAR-
ducing the last iteration, which is named the momentum 10 datasets with different machine learning models
term, momentum gradient descent (MGD) can accelerate the such as support vector machine (SVM), linear regres-
convergence [24], [25]. Due to the improved convergence of sion, logistic regression and convolutional neural
gradient methods brought by momentum, there are several network (CNN). Then an experimental comparison
works which apply stochastic gradient descent (SGD) with is made between FL and the proposed MFL. The sim-
momentum in the field of distributed machine learning. In ulation results show that MFL is convergent and con-
[26], momentum is applied to the update at each aggregation firm that MFL provides a significant improvement of
round for improving both optimization and generalization. In convergence rate.
[27], the linear convergence of distributed SGD with momen- The remaining part of this paper is organized as follows.
tum is proven. All these works with momentum are based on We introduce the system model to solve the learning prob-
stochastic GD generally. Compared with SGD, deterministic lem in distributed scenarios in Section 2 and subsequently
gradient descent (DGD) can realize more precise training elaborate the existing solutions in Section 3. In Section 4, we
results with improved generalization and fast convergence describe the design of MFL in detail. Then in Sections 5 and
under convex optimization [28]. 6, we present the convergence analysis of MFL and the com-
Motivated by the above observations, we propose a new parison between FL and MFL, respectively. Finally, we
federated learning design of Momentum Federated Learning show experimental results in Section 7 and draw a conclu-
(MFL) in this paper. In the proposed MFL design, we intro- sion in Section 8.
duce momentum term in FL local update and leverage MGD
(in our paper, MGD means DGD with momentum) to perform
local iterations. Furthermore, the global convergence of the 2 SYSTEM MODEL
proposed MFL is proven, and we derive the theoretical con- In this paper, considering a simplified system model, we
vergence bound of MFL. Compared with FL [17], the pro- discuss the distributed network as shown in Fig. 1. This
posed MFL has an accelerated convergence rate under certain model has N edge nodes and a central server. These N edge
conditions. On the basis of MNIST and CIFAR-10 datasets, we nodes, which have limited communication and computa-
numerically study the proposed MFL and obtain its loss func- tion resources, contain local datasets D1 ; D2 ; . . . ; Di ; . . . ; DN ,
tion curve. The experimental results show that MFL con- respectively. So the global dataset is D , D1 [ D2 [    [ DN .
verges faster than FL for different machine learning models. Assume that Di \ Dj ¼ ; for i 6¼ j. We define the number of
The contributions of this paper are summarized as follows: samples in node i as jDi j where j  j denotes the size of the
PN The total number of all nodes’ samples is jDj, and jDj ¼
set.
 MFL design: According to the characteristic that i¼1 jDi j. The central server connects all the edge nodes for
MGD facilitates machine learning convergence in information transmission.
the centralized situation, we propose MFL design We define the global loss function at the central server as
where MGD is adopted to optimize loss function in F ðwÞ, where w denotes the model parameter. Different
local updates. The proposed MFL can improve the machine learning models correspond to different F ðÞ and
convergence rate of distributed learning problem w. We use w to represent the optimal parameter for mini-
significantly. mizing the value of F ðwÞ. Based on the presented model,
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1756 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

the learning problem is to minimize F ðwÞ and it can be for- where dðtÞ is the momentum term which has the same dimen-
mulated as follows: sion as wðtÞ, g is the momentum attenuation factor, h is the
learning step size and t is the iteration index. By iterations of
w , arg min F ðwÞ: (1)
(4) and (5) with t, F ðwÞ can potentially converge to the mini-
mum faster compared with GD. The convergence range of
Because of the complexity of machine learning model and
MGD is 1 < g < 1 with a bounded h and if 0 < g < 1,
original dataset, finding a closed-form solution of the above
MGD has an accelerated convergence rate than GD under a
optimization problem is usually intractable. So algorithms
small h typically used in simulations [29, Result 3].
based on gradient iterations are used to solve (1). If raw user
data are collected and stored in the central server, we can use
centralized learning solutions to (1) while if raw user data 3.2 FL Solution
are distributed over the edge nodes, FL and the proposed In contrast with centralized learning solutions, FL avoids
MFL can be applied to solve this learning problem. collecting and uploading the distributed data because of the
Under the situation where FL or MFL solutions are used, limited communication resources at edge nodes and privacy
the local loss function of node i is denoted by Fi ðwÞ which protection for local data. It decouples the machine learning
is defined merely on Di . Then we define the global loss func- task from the central server to each edge node to avoid stor-
tion F ðwÞ on D as follows: ing user data in the server and reduce the communication
resources consumption. All of edge nodes make up a feder-
Definition 1 (Global loss function). Given the loss function ation in coordination with the central server.
Fi ðwÞ of edge node i, we define the global loss function on all The FL design and convergence analysis are presented in
the distributed datasets as [17] where FL network is studied thoroughly. In an FL sys-
PN tem, each edge node uses the same machine learning model.
i¼1 jDi jFi ðwÞ We use t to denote the global aggregation frequency, i.e., the
F ðwÞ , : (2)
jDj update interval. Each node i has its local model parameter
e i ðtÞ, where the iteration index is denoted by t ¼ 0; 1; 2; . . .
w
(in this paper, an iteration means a local update). We use ½k
3 EXISTING SOLUTIONS to denote the aggregation interval ½ðk  1Þt; kt for k ¼ 1;
In this section, we introduce two existing solutions to solve the 2; 3; . . . . At t ¼ 0, local model parameters of all nodes are ini-
learning problem expressed by (1). These two solutions are tialized to the same value. When t > 0, w e i ðtÞ is updated
centralized learning solution and FL solution, respectively. locally based on GD, which is the local update. After t local
updates, global aggregation is performed and all edge nodes
3.1 Centralized Learning Solution send the updated model parameters to the centralized server
synchronously.
Centralized machine learning is for machine learning model
The learning process of FL is described as follows.
embedded in the central server and each edge node needs to
send its raw data to the central sever. In this situation, edge
nodes will consume communication resources for data 3.2.1 Local Update
transmission, but without incurring computation resources When t 2 ½k, local updates are performed in each edge node
consumption. by
After the central server has collected all datasets from the
edge nodes, a usual way to solve the learning problem w e i ðt  1Þ  hrFi ðw
e i ðtÞ ¼ w e i ðt  1ÞÞ;
expressed by (1) is GD as a basic gradient method. Further,
MGD is an improved gradient method with adding a which follows GD exactly.
momentum term to speed up learning process [24].
3.2.2 Global Aggregation
3.1.1 GD When t ¼ kt, global aggregation is performed. Each node
The update rule for GD is as follows: sends w e i ðktÞ to the central server synchronously. The cen-
tral server takes a weighted average of the received parame-
wðtÞ ¼ wðt  1Þ  hrF ðwðt  1ÞÞ: (3) ters from N nodes to obtain the globally updated parameter
In (3), t denotes the iteration index and h > 0 is the learning wðktÞ by
step size. The model parameter w is updated along the direc- PN
tion of negative gradient. Using the above update rule, GD i¼1 e i ðktÞ
jDi jw
wðktÞ ¼ :
can solve the learning problem with continuous iterations. jDj

3.1.2 MGD Then wðktÞ is sent back to all edge nodes as their new
As an improvement of GD, MGD introduces the momen- parameters and edge nodes perform local updates for the
tum term and we present its update rules as follows: next iteration interval.
In [17, Lemma 2], the FL solution has been proven to be
dðtÞ ¼ gdðt  1Þ þ rF ðwðt  1ÞÞ (4) globally convergent for convex optimization problems and
exhibit good convergence performance. So FL is an effective
wðtÞ ¼ wðt  1Þ  hdðtÞ; (5) solution to the distributed learning problem presented in (1).
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1757

TABLE 1
MFL Notation Summary

Notation Definition
T ; K; N number of total local iterations; number of global
aggregations/number of intervals; number of edge nodes
t; k; t; ½k iteration index; interval index; aggregation frequency
with t ¼ T =K; the interval ½ðk  1Þt; kt
w ; wf globally optimal parameter of F ðÞ; the optimal
parameter that MFL can obtain in Algorithm 1
h; b; r; g the learning step size of MGD or GD; the b-smooth
parameter of Fi ðÞ; the r-Lipschitz parameter of Fi ðÞ;
the momentum attenuation factor which decides the
proportion of momentum term in MGD
Di ; D the local dataset of node i; the global dataset
di ; d the upper bound between rF ðwÞ and rFi ðwÞ; the
average of di over all nodes
Fi ðÞ; F ðÞ the loss function of node i; the global loss function
dðtÞ; wðtÞ the global momentum parameter at iteration round t;
the global model parameter at iteration round t Fig. 2. Comparison of MGD and GD.
e i ðtÞ; w
d e i ðtÞ the local momentum parameter of node i at iteration
round t; the local model parameter at iteration round t we use MGD to perform local updates of FL and this
d½k ðtÞ; w½k ðtÞ the momentum parameter of centralized MGD at
approach is named MFL.
iteration round t in ½k; the model parameter of
centralized MGD at iteration round t in ½k In the following subsection, we design the MFL learning
u½k ðtÞ; u; p the angle between vector rF ðw½k ðtÞÞ and d½k ðtÞ; u is the paradigm and propose the learning problem based on the
maximum of u½k ðtÞ for 1  k  K with t 2 ½k; p is the MFL design.
maximum ratio of kd½k ðtÞk and krF ðw½k ðtÞÞk for
1  k  K with t 2 ½k
4.2 MFL
In the MFL design, we use d e i ðtÞ and w e i ðtÞ to denote momen-
tum parameter and model parameter for node i, respec-
4 DESIGN OF MFL tively. All edge nodes are set to embed the same machine
In this section, we introduce the design of MFL to solve the learning model. So the local loss functions Fi ðwÞ are the
distributed learning problem shown in (1). We first discuss same for all nodes, and the dimensions of both the model
the motivation of our work. Then we present the design of parameters and the momentum parameters are consistent.
MFL in detail and the learning problem based on federated The parameters setup of MFL is similar to that of FL. We use
system. The main notations of MFL design and analysis are t to denote the local iteration index for t ¼ 0; 1; . . ., t to denote
summarized in Table 1. the aggregation frequency and ½k to denote the interval
½ðk  1Þt; kt where k denotes the interval index for
k ¼ 1; 2; . . .. At t ¼ 0, the momentum parameters and the
4.1 Motivation model parameters of all nodes are initialized to the same val-
Since MGD improves the convergence rate of GD [24], we ues, respectively. When t 2 ½k, d e i ðtÞ and w e i ðtÞ are updated
want to apply MGD to local update steps of FL and hope based on MGD, called local update steps. When t ¼ kt, MFL
that the proposed MFL will accelerate the convergence rate performs global aggregation steps where d e i ðtÞ and w e i ðtÞ are
for federated networks. sent to the central server synchronously. Then in the central
First, we illustrate the intuitive influence on optimization server, the global momentum parameter dðtÞ and the global
problem after introducing the momentum term into gradient model parameter wðtÞ are obtained by taking a weighted
updating methods. Considering GD, the update reduction of average of the received parameters, respectively, and are
the parameter is hrF ðwðt  1ÞÞ which is only proportional to sent back to all edge nodes for the next interval.
the gradient of wðt  1Þ. The update direction of GD is always The learning rules of MFL include the local update and
along gradient descent so that an oscillating update path the global aggregation steps. By continuous alternations of
could be caused, as shown by the GD update path in Fig. 2. local update and global aggregation, MFL can perform its
However, the update reduction of parameter for MGD is a learning process to minimize the global loss function F ðwÞ.
superposition of hrF ðwðt  1ÞÞ and gðwðt  2Þ  wðt  1ÞÞ We describe the MFL learning process as follows.
which is the momentum term. As shown by the MGD update First of all, we set initial values for d e i ð0Þ and w e i ð0Þ. Then
path in Fig. 2, utilizing the momentum term can deviate the 1) Local Update: When t 2 ½k, local update is performed at
direction of parameter update to the optimal decline signifi- each edge node by
cantly and mitigate the oscillation caused by GD. In Fig. 2, GD
has an oscillating update path and costs seven iterations to d e i ðt  1Þ þ rFi ðw
e i ðtÞ ¼ g d e i ðt  1ÞÞ (6)
reach the optimal point while MGD only needs three itera-
tions to do that, which demonstrates that mitigating the oscil- e i ðtÞ ¼ w
w e i ðtÞ:
e i ðt  1Þ  hd (7)
lation by MGD leads to a faster convergence rate.
Because edge nodes of distributed networks are usually According to (6) and (7), node i performs MGD to optimize
resource-constrained, solutions to convergence acceleration the loss function Fi ðÞ defined on its own dataset.
can attain higher resources utilization efficiency. Thus, moti- 2) Global Aggregation: When t ¼ kt, node i transmits
vated by the property that MGD improves convergence rate, w e i ðktÞ to the central server which takes weighted
e i ðktÞ and d
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1758 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

Fig. 3. Illustration of MFL local update and global aggregation steps from interval ½k to ½k þ 1.

averages of the received parameters from N nodes to obtain Thus, we have t  T and k  K with T ¼ Kt. Considering
the global parameters wðktÞ and dðktÞ, respectively. The that wðtÞ is unobservable for t 6¼ kt, we use wf to denote the
aggregation rules are presented as follows: achievable optimal model parameter defined on resource-
PN constrained MFL network. Hence, the learning problem is to
e i ðtÞ
jDi jd obtain wf within K global aggregations particularly, i.e.,
i¼1
dðtÞ ¼ (8)
jDj
wf , arg min F ðwÞ: (10)
PN
i¼1 jDi jw
e i ðtÞ w2fwðktÞ:k¼1;2;...;Kg
wðtÞ ¼ : (9)
jDj
The optimization algorithm of MFL is explained in Algorithm 1.
Then the central server sends dðktÞ and wðktÞ back to all
edge nodes where d e i ðktÞ ¼ dðktÞ and w
e i ðktÞ ¼ wðktÞ are
set to enable the local update in the next interval ½k þ 1. 5 CONVERGENCE ANALYSIS
Note that only if t ¼ kt, the value of the global parameters In this section, we first introduce some definitions and
wðtÞ and dðtÞ can be observed. But we define dðtÞ and wðtÞ assumptions for MFL convergence analysis. Then based on
for all t to facilitate the following analysis. A typical alterna- these preliminaries, global convergence properties of MFL
tion is shown in Fig. 3 which illustrates the learning steps of following Algorithm 1 are established and an upper bound
MFL in interval ½k and ½k þ 1. on MFL convergence rate is derived. Also MFL convergence
performance with related parameters is analyzed.
Algorithm 1. MFL The Dataset in Each Node Has Been
Set, and the Machine Learning Model Embedded in Edge 5.1 Preliminaries
Nodes has Been Chosen. We Have Set Appropriate Model First of all, to facilitate the analysis, we assume that Fi ðwÞ
Parameters h and g. satisfies the following conditions:
Input: Assumption 1. For Fi ðwÞ in node i, we assume the following
The limited number of local updates in each node T conditions:
A given aggregation frequency t
Output: 1) Fi ðwÞ is convex
The final global model weight vector wf 2) Fi ðwÞ is r-Lipschitz, i.e., jFi ðw1 Þ  Fi ðw2 Þj  rkw1 
1: Set the initial values of wf , w e i ð0Þ.
e i ð0Þ and d w2 k for some r > 0 and any w1 , w2
2: for t ¼ 1; 2; . . . ; T do 3) Fi ðwÞ is b-smooth, i.e., krFi ðw1 Þ  rFi ðw2 Þk 
3: Each node i performs local update in parallel according to bkw1  w2 k for some b > 0 and any w1 , w2
(6) and (7).==Local update 4) Fi ðwÞ is m-strong, i.e., aFi ðw1 Þþ ð1  aÞFi ðw2 Þ 
4: if t ¼¼ kt where k is a positive integer then Fi ðaw1 þ ð1  aÞw2 Þ þ að1aÞm2 kw1  w2 k2 , a 2 ½0; 1
5: e i ðtÞ
Set d dðtÞ and we i ðtÞ wðtÞ for all nodes where for some m > 0 and any w1 , w2 [30, Theorem 2.1.9]
dðtÞ and wðtÞ are obtained by (8) and (9) respectively.
==Global aggregation Because guaranteeing the global convergence of central-
Update wf arg minw2fwf ;wðktÞg F ðwÞ ized MGD requires that the objective function is strongly
6: end if convex [24], it is necessary to assume condition 4. Assump-
7: end for tion 1 is satisfied for some learning models such as SVM,
linear regression and logistic regression whose loss func-
The learning problem of MFL to attain the optimal model tions are presented in Table 2. Experimental results as pre-
parameter is presented as (1). However, the edge nodes have sented in Section 7.2.1 show that for non-convex models
limited computation resources with a finite number of local such as CNN whose loss function does not satisfy Assump-
iterations. We assume that T is the number of local updates tion 1, MFL also performs well. From Assumption 1, we can
and K is the corresponding number of global aggregations. obtain the following lemma:
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1759

TABLE 2
Loss Function of Three Machine Learning Models

Model Loss function


 2 1
P T
SVM 2 kwk þ 2jDi j j maxf0; 1  yj w xj g
1
P T 2
Linear regression 2jDi j j kyj  w xj k
Logistic regression P
 jD1i j j kyj log sðw; xj Þ þ ð1  yj Þlog ð1
sðw; xj ÞÞk where sðw; xj Þ is given as (23)

Lemma 1. F ðwÞ is convex, r-Lipschitz, b-smooth and m-strong. Fig. 4. Illustration of the difference between MGD and MFL in intervals.
Proof. According to the definition of F ðwÞ from (2), triangle
inequality and the definition of r-Lipschitz, b-smooth Compared with centralized MGD, MFL aggregation
and m-strong, we can derive that F ðwÞ is convex, r- interval with t > 1 brings global update delay because of
Lipschitz, b-smooth and m-strong directly. u
t the fact that centralized MGD performs global update on
every iteration while MFL is allowed to spread its global
Then we introduce the gradient divergence between
parameter to edge nodes after t local updates. Therefore,
rF ðwÞ and rFi ðwÞ for any node i. It comes from the nature
the convergence performance of MFL is worse than that
of the difference in datasets distribution.
of MGD, which is essentially from the imbalance between
Definition 2 (Gradient divergence). We define di as the several computation rounds and one communication
upper bound between rF ðwÞ and rFi ðwÞ for any node i, i.e., round in MFL design. The following subsection provides
the resulting convergence performance gap between these
krF ðwÞ  rFi ðwÞk  di : (11)
two approaches.
Also, we define the average gradient divergence
P 5.2 Gap between MFL and Centralized MGD in
i jDi jdi
d, : (12) Interval ½k
jDj
First, considering a special case, we consider the gap between
Boundedness of di and d. Based on condition 3 of Assumption MFL and centralized MGD for t ¼ 1. From intuitive perspec-
1, we let w2 ¼ wi where wi is the optimal value for minimiz- tive, MFL performs global aggregation after every local
ing Fi ðwÞ. Because Fi ðwÞ is convex, we have krFi ðw1 Þk  update and there does not exist global parameter update
bkw1  wi k for any w1 , which means krFi ðwÞk is finite for delay, i.e, the performance gap is zero. In Appendix A, which
any w. According to Definition 1 and the linearity of gradient can be found on the Computer Society Digital Library at
operator, global loss function rF ðwÞ is obtained by taking a http://doi.ieeecomputersociety.org/TPDS.2020.2975189., we
weighted average of rFi ðwÞ. Therefore, krF ðwÞk is finite, prove that MFL is equivalent to MGD for t ¼ 1 theoretically.
and krF ðwÞ  rFi ðwÞk has an upper bound, i.e., di is Now considering the general case for any t  1, the
bounded. Further, d is still bounded from the linearity in (12). upper bound of gap between wðtÞ and w½k ðtÞ can be derived
Since local update steps of MFL perform MGD, the upper as follows.
bounds of MFL and MGD convergence rates exhibit certain
connections in the same interval. For the convenience of Proposition 1 (Gap between MFL and centralized MGD
analysis, we use variables d½k ðtÞ and w½k ðtÞ to denote the in intervals). Given t 2 ½k, the gap between wðtÞ and w½k ðtÞ
momentum parameter and the model parameter of central- can be expressed by
ized MGD in each interval ½k, respectively. This centralized
MGD is defined on global dataset and updated based on kwðtÞ  w½k ðtÞk  hðt  ðk  1ÞtÞ; (15)
global loss function F ðwÞ. In interval ½k, the update rules of
centralized MGD follow:
where we define
d½k ðtÞ ¼ gd½k ðt  1Þ þ rF ðw½k ðt  1ÞÞ (13)
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
w½k ðtÞ ¼ w½k ðt  1Þ  hd½k ðtÞ: (14) ð1 þ g þhbÞ þ ð1 þ g þ hbÞ2  4g
A, ;
At the beginning of interval ½k, the momentum parameter 2g
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d½k ðtÞ and the model parameter w½k ðtÞ of centralized MGD
ð1 þ g þhbÞ  ð1 þ g þ hbÞ2  4g
are synchronized with the corresponding parameters of B, ;
MFL, i.e., 2g
d½k ððk  1ÞtÞ , dððk  1ÞtÞ
w½k ððk  1ÞtÞ , wððk  1ÞtÞ: A
E, ;
ðA  BÞðgA  1Þ
For each interval ½k, the centralized MGD is performed by B
F,
iterations of (13) and (14). In Fig. 4, we illustrate the distinc- ðA  BÞð1  gBÞ
tions between F ðwðtÞÞ and F ðw½k ðtÞÞ intuitively.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1760 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

and hðxÞ yields rF ðw½k ðtÞÞT d½k ðtÞ


cos u½k ðtÞ , ;
" # krF ðw½k ðtÞkkd½k ðtÞk
x 1 gðg x  1Þ  ðg  1Þx
x
hðxÞ ¼ hd EðgAÞ þ F ðgBÞ  
hb ðg  1Þ2 and u is defined as the maximum value of u½k ðtÞ for
(16) 1  k  K with t 2 ½k, i.e.,
for 0 < g < 1 and any x ¼ 0; 1; 2; . . .. u, max u½k ðtÞ:
Because F ðwÞ is r-Lipschitz from Lemma 1, it holds that 1kK;t2½k

Then we define
F ðwðtÞÞ  F ðw½k ðtÞÞ  rhðt  ðk  1ÞtÞ: (17)
kd½k ðtÞk
p, max ;
1kK;t2½k krF ðw½k ðtÞÞk
Proof. First, we derive an upper bound of kw e i ðtÞ  w½k ðtÞk
for node i. On the basis of this bound, we extend this and
result from the local cases to the global one to obtain the
1
final result. The detailed proving process is presented v , min :
in Appendix B, available in the online supplemental k kwððk  1ÞtÞ  w k2
material. u
t
Based on Proposition 1 which gives an upper bound of
Because hð1Þ ¼ hð0Þ ¼ 0 and hðxÞ increases with x for loss function difference between MFL and centralized MGD,
x  1, which are proven in Appendix C, available in the global convergence rate of MFL can be derived as follows.
online supplemental material, we always have hðxÞ  0 for
x ¼ 0; 1; 2; . . .. Lemma 2. If the following conditions are satisfied:
From Proposition 1, in any interval ½k, we have hð0Þ ¼ 0
for t ¼ ðk  1Þt, which fits the definition w½k ððk  1ÞtÞ ¼ 1) cos u  0, 0 < hb < 1 and 0  g < 1;
wððk  1ÞtÞ. We still have hð1Þ ¼ 0 for t ¼ ðk  1Þt þ 1. This There exists " > 0 which makes
means that there is no gap between MFL and centralized 2) F ðw½k ðktÞÞ  F ðw Þ  " for all k;
MGD when local update is only performed once after the 3) F ðwðT ÞÞ  F ðw Þ  ";
global aggregation. 4) va  rhðtÞ
t"2
> 0 hold,
It is easy to find that if t ¼ 1, t  ðk  1Þt is either 0 or 1. then we have
Because hð1Þ ¼ hð0Þ ¼ 0, the upper bound in (15) is zero,
and there is no gap between F ðwðtÞÞ and F ðw½k ðtÞÞ from 1
F ðwðT ÞÞ  F ðw Þ   ; (18)
(17). This is consistent with Appendix A, available in the T va  rhðtÞ
t" 2
online supplemental material, where MFL yields central-
ized MGD for t ¼ 1. In any interval ½k, we have t  ðk where we defined
1Þt 2 ½0; t. If t > 1, t  ðk  1Þt can be larger than 1. When
 
x > 1, we know that hðxÞ increases with x. According to bh bh2 g 2 p2
the definition of A, B, E and F , we can obtain gA > 1, a, h 1 þ hgð1  bhÞ cos u  :
2 2
gB < 1 and E; F > 0 easily. Because 0 < g < 1, the last
term will linearly decrease with x when x is large. There- Proof. The proof is presented in Appendix D, available in
fore, the first exponential term EðgAÞx in (16) will be domi- the online supplemental material. t
u
nant when x is large and the gap between wðtÞ and w½k ðtÞ
increases exponentially with t. On the basis of Lemma 2, we further derive the following
Also we find hðxÞ is proportional to the average gradient proposition which demonstrates the global convergence of
gap d. It is because the greater the local gradient divergences MFL and gives its upper bound on convergence rate.
at different nodes are, the larger the gap will be. So consid- Proposition 2 (MFL global convergence). Given cos u  0,
ering the extreme situation where all nodes have the same 0 < hb < 1, 0  g < 1 and a > 0, we have
data samples (d ¼ 0 because the local loss functions are the
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
same), the gap between wðtÞ and w½k ðtÞ is zero and MFL is 1 1 rhðtÞ
equivalent to centralized MGD. F ðwf Þ  F ðw Þ  þ þ þ rhðtÞ:
2T va 4T 2 v2 a2 vat
(19)
5.3 Global Convergence Proof. The specific proving process is shown in Appendix
We have derived an upper bound between F ðwðtÞÞ and E, available in the online supplemental material. u
t
F ðw½k ðtÞÞ for t 2 ½k. According to the definition of MFL, in
the beginning of each interval ½k, we set d½k ððk  1ÞtÞ ¼ According to the above Proposition 2, we get an upper
dððk  1ÞtÞ and w½k ððk  1ÞtÞ ¼ wððk  1ÞtÞ. The global bound of F ðwf Þ  F ðw Þ which is a function of T and t. From
upper bound on the convergence rate of MFL can be inequality (19), we can find that MFL linearly converges to a
qffiffiffiffiffiffiffiffi
rhðtÞ
derived based on Proposition 1. lower bound vat þ rhðtÞ. Because hðtÞ is related to t and d,
The following definitions are made to facilitate analysis. aggregation intervals (t > 1) and different data distribution
First, we use u½k ðtÞ to denote the angle between vectors collectively lead to that MFL does not converge to the
rF ðw½k ðtÞ and d½k ðtÞ for t 2 ½k, i.e., optimum.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1761

In the following, we discuss the influence of t on the con- for hðtÞ and hFL ðtÞ, we have gA ! hb þ 1 and gB ! 0.
vergence bound. If t ¼ 1, we have rhðtÞ ¼ 0 so that F ðwf Þ Because ABA
! 1 and AB B 1
! 0, we can further get E ! hb
F ðw Þ linearly converges to zero as T ! 1, and the conver- and F ! 0 from the definitions of E and F . So, according to
1
gence rate yields T va . Noting hðtÞ > 0 if t > 1, we can find (16), we have
that in q this case, F ðw f
Þ  F ðw Þ converges to a non-zero  
ffiffiffiffiffiffiffiffi 1 1
bound rhðtÞ lim hðtÞ ¼ hd ðhb þ 1Þt   t
vat þ rhðtÞ as T ! 1. On the one hand, if there g!0 hb hb
does not exist communication resources limit, setting aggre- d
gation frequency t ¼ 1 and performing global aggregation ¼ ðð1 þ hbÞt  1Þ  hdt ¼ hFL ðtÞ:
b
after each local update can reach the optimal convergence
performance of MFL. On the other hand, aggregation inter- Hence, by the above analysis under g ! 0, we can find that
val (t > 1) can let MFL effectively utilize the communica- MFL and FL have the same upper bound on convergence
tion resources of each node, but bring about a decline of rate. This fact is consistent with the property that if g ¼ 0,
convergence performance. MFL degenerates into FL and has the same convergence rate
with FL.
To avoid complicated calculations over the expressions
6 COMPARISON BETWEEN FL AND MFL
of f1 ðT Þ and f2 ðT Þ, we have the following lemma.
In this section, we make a comparison of convergence per-
formance between MFL and FL. Lemma 3. If there exists T1  1 which satisfies that 2T1va domi-
1
A closed-form solution of the upper bound on FL conver- nates in f1 ðT Þ and 2h’T dominates in f2 ðT Þ for T < T1 , i.e.,
gence rate has been derived in [17, Theorem 2]. It is presented ( rffiffiffiffiffiffiffiffiffiffiffiffi)
as follows. 1 rhðtÞ
max rhðtÞ;
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2T va vat
f  1 1 rhFL ðtÞ and
F ðwFL Þ  F ðw Þ  þ þ þ rhFL ðtÞ:
2h’T 4h2 ’2 T 2 h’t ( sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi)
1 rhFL ðtÞ
(20) max rhFL ðtÞ; ;
2h’T h’t
According to [17],
then we have
d
hFL ðtÞ ¼ ððhb þ 1Þt  1Þ  hdt;
b 1
f1 ðT Þ
T va
and ’ ¼ vFL ð1  hb 2 Þ where the expression of vFL is consis-
tent with v. Differing from that of v, wððk  1ÞtÞ in the defi- and
nition of vFL is the global model parameter of FL.
1
We assume that both MFL and FL solutions are applied in f2 ðT Þ
T h’
the system model proposed in Fig. 1. They are trained based
on the same training dataset with the same machine learning for T < T1 .
model. The loss functions Fi ðÞ and global loss functions F ðÞ
of MFL and FL are the same, respectively. The corresponding Proof. We can find such a T1 . For example, considering (21),
parameters of MFL and FL are equal including t, h, r, d and b. if h ! 0, we have a ! 0 from the definition of a and
We set the same initial value wð0Þ of MFL and FL. Because hðtÞ ! 0 from Appendix F, available in the online supple-
1
both MFL and FL are convergent, we have v ¼ kwð0Þw  k2 . mental material.
qffiffiffiffiffiffiffiffiffiffiffiffi
ffi So we can easily derive varhðtÞ ! 0
Then according to the definitions of v and vFL , we have varhðtÞ
! 0. Then we
and t can find ffi T1  1 which satisfies
qffiffiffiffiffiffiffiffiffiffiffiffi
w ¼ wFL . Therefore, the corresponding parameters of MFL 1
varhðtÞ and 1 varhðtÞ
for T < T1 . Hence, we
2T 2T t
and FL are the same and we can compare the convergence
1 1
rates between FL and MFL conveniently. have 2T va dominates in f1 ðT Þ and f1 ðT Þ T va . For the
For convenience, we use f1 ðT Þ and f2 ðT Þ to denote the same reason, considering (22), if h ! 0, we have h’ ! 0
upper bounds on convergence rates of MFL and FL, respec- from the definition of ’ and hFL ðtÞ ! 0 from its defini-
tively. Then we have tion.
q So we can easily derive h’rhFL ðtÞ ! 0 and
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h’rhFL ðtÞ
! 0. Then for T < T1 , 2T 1
h’rhFL ðtÞ and
1 1 rhðtÞ t ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q
f1 ðT Þ , þ þ þ rhðtÞ (21) 1 h’rhFL ðtÞ 1
. Hence, we have 2h’T dominates in f2 ðT Þ
2T va 4T 2 v2 a2 vat 2T t
1
and f2 ðT Þ T h’. u
t
and Based on Lemma 3, we have the following proposition.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 rhFL ðtÞ Proposition 3 (Accelerated convergence of MFL). If the
f2 ðT Þ , þ þ þ rhFL ðtÞ: (22) following conditions are satisfied:
2h’T 4h2 ’2 T 2 h’t
1) hb  1;
We consider the special case of g ! 0. For va and h’, we can 2) T < T1 ;
2 Þ ¼ h’ from the definition of a. Then
obtain va ! vhð1  bh 3) 0 < g < minf2ð1hbÞ
bhp2
cos u
; 1g,
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1762 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

MFL converges faster than FL, i.e., xj is a 784-dimensional input vector which is vectorized
from 28 28 pixel matrix and yj is the scalar label corre-
f1 ðT Þ < f2 ðT Þ:
sponding to xj . SVM, linear and logistic regression are used
Proof. From condition 1 and condition 2, we have f1 ðT Þ to classify whether the digit is even or odd. If the image of
1 1 xj represents an even number, then we set yj ¼ 1. Other-
T va and f2 ðT Þ T h’. Due to the definitions of a and ’,
wise, yj ¼ 1. But for logistic regression, we set yj ¼ 1 for
inequality 0 < g < 2ð1bhÞ
bhp2
cos u
is equivalent to va > h’. the even number and yj ¼ 0 for the odd.
1
So if va > h’, it is obvious that T va < T 1h’, i.e., f1 ðT Þ < CNN is trained based on MNIST and CIFAR-10 datasets.
f2 ðT Þ. However, 0 < g < 1 is the condition of MFL con- The CIFAR-10 dataset includes 50,000 color images for
vergence. Hence, condition 3 is the range of MFL conver- training and 10,000 color images for testing, and has 10 dif-
gence acceleration after combining with MFL convergence ferent types of objects [34]. We use CNN to perform the clas-
guarantee 0 < g < 1. u
t sification among the 10 different labels under MNIST and
CIFAR-10 datasets, respectively.
For experimental setups, we set 4 edge nodes in FL and
7 SIMULATION AND DISCUSSION MFL and the training models are distributed into all the
In this section, we build and evaluate MFL systems based on edge nodes. The same initializations of model parameters
MNIST and CIFAR-10 datasets. We first describe the simula- are performed and the same data distributions are set for
tion environment and the relevant setups of parameters. Sec- MFL and FL. Also di ð0Þ ¼ 0 is set for node i. We set the
ond, we present and evaluate the comparative simulation learning step size h ¼ 0:002 which is sufficiently small, SVM
results of MFL, FL and MGD under different machine learn- parameter  ¼ 0:3 and the total number of local iterations
ing models, which include SVM, linear regression, logistic T ¼ 1; 000 for the following simulations.
regression and CNN. Finally, extensive experiments are
implemented to explore the impacts of g, t and non-i.i.d data 7.2 Simulation Evaluation
distribution on MFL convergence performance, and to inves- In this subsection, we verify the convergence acceleration of
tigate the communication efficiency of MFL compared with MFL and explore the effects of non-i.i.d data distribution, g
that of FL. and t on MFL convergence by simulation evaluation. We
further investigate the communication efficiency of MFL
7.1 Simulation Setup compared with that of FL.
Using the Python, we build a federated network framework
where distributed edge nodes coordinate with the central
7.2.1 Convergence
server. In our network, the number of edge nodes can be
chosen arbitrarily. SVM, linear regression, logistic regres- In our first simulation, the models of SVM, linear regression,
sion and CNN are applied to model training. Loss functions logistic regression and CNN are trained and we verify the
of the first three models at node i are presented as in Table 2 accelerated convergence of MFL. We set aggregation fre-
[31], and the loss function of CNN is cross-entropy (see [32] quency t ¼ 4 and momentum attenuation factor g ¼ 0:5.
for details). Note that jDi j is the number of training samples MFL, FL and MGD are performed based on the four machine
in node i and the loss function of logistic regression is cross- learning models. MGD is implemented based on the global
entropy. For logistic regression, model output sðw; xj Þ is dataset which is obtained by gathering the distributed
sigmoid function for non-linear transform. It is defined by data on all nodes. The global loss functions of the three
solutions are defined based on the same global training and
1 testing data.
sðw; xj Þ , Tx : (23)
1 þ ew j The curves of loss function values and accuracy with iter-
ative times are presented in Fig. 5. We can see that the loss
In our experiments, training and testing samples are ran- function curves for all the learning models are gradually
domly allocated to each node, which means that the informa- convergent with iterative times. Similarly, the test accuracy
tion of each node is uniform and the data distribution at edge curves for SVM and CNN gradually rise until convergence
nodes is i.i.d. (Only in Section 7.2.2, non-i.i.d data distribu- with iterative times. Therefore, convergence of MFL is veri-
tion is used and the rest of experiments use i.i.d data distri- fied. We also see that the descent speeds of MFL loss func-
bution). We use FL and centralized MGD as benchmarks for tion curves on the four learning models are always faster
comparison of MFL. If based on SVM, linear and logistic than those of FL while the centralized MGD convergence
regression models, the deterministic gradient methods are speeds are the fastest. So compared with FL, MFL provides
performed for MFL, FL and centralized MGD. However, if a significant improvement on convergence rate. MGD con-
based on CNN model, the stochastic gradient methods are verges with the fastest speed because MFL and FL suffer
used for MFL, FL and centralized MGD due to the large the delay in global gradient update for t ¼ 4. Finally, com-
training data size. paring the results of CNN and SVM, we can conclude that
SVM, linear and logistic regression are trained and tested based on CNN model, MFL still shows similar convergence
on MNIST dataset [33] which contains 50,000 training hand- performance compared with what MFL shows in convex
written digits and 10,000 testing handwritten digits. In our model training. So the proposed MFL can perform well in
experiments, we only utilize 5,000 training samples and neural networks with non-convex loss functions.
5,000 testing samples because of the limited processing Because linear and logistic regression can not provide the
capacities of GD and MGD. In this dataset, the jth sample testing accuracy curves, we focus on the SVM model in the
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1763

Fig. 5. Loss function values and testing accuracy under FL, MFL and MGD. (a) and (b) are the loss function and test accuracy curves of SVM, respectively;
(c) and (d) are the loss function curves of linear regression and logistic regression, respectively. (e) and (f) are the loss function and test accuracy curves of
CNN trained on MNIST, respectively. (g) and (h) are the loss function and test accuracy curves of CNN trained on CIFAR-10, respectively.

Fig. 6. (a) and (b) are loss function and testing accuracy curves under different data distribution cases, respectively.

following experiments and further explore the impact of to a non-uniform information distribution, which
MFL parameters on convergence rate. means characteristics brought by each node are not
uniform. Thus, this case corresponds to totally non-i.
7.2.2 Effect of Non-i.i.d. Data Distribution i.d data distribution.
In the experiment, we consider three cases to distribute the  Case 3: In this case, the first half of N nodes perform
data samples into different nodes. The three data distribu- random allocation rule of Case 1 to obtain uniform
tion cases at edge nodes are representative for uniform, information and the second half of the nodes per-
totally non-uniform information and the mixture of the pre- form allocation rule of Case 2. This case is a combina-
vious two cases, respectively. The specific settings of the tion of uniform and non-uniform information. We
three cases are as follows: use this case to explore the effect of the mixture of i.i.
d and non-i.i.d data distribution.
 Case 1: For the uniform information distribution, In the experiment, SVM is used for the training of MFL
each data sample is randomly allocated to a node. In network under the above data distribution cases. We set
this case, we think that the data on each node have aggregation frequency t ¼ 4 and momentum attenuation
uniform characteristics. Therefore, this case satisfies factor g ¼ 0:5 for general MFL algorithm.
i.i.d data distribution as a benchmark. The experimental results are shown in Fig. 6. The two sub-
 Case 2: All data samples at an individual node have figures show the MFL convergence influence made by differ-
the same label (If there are more labels than nodes, ent data distributions. We see that the loss function and
each node could have samples with more than one testing accuracy curves of MFL are always convergent
label but not the total number of labels). Because the whether data distribution at edge nodes is i.i.d or non-i.i.d,
global dataset has multiple labels, this case will lead which means even though under non-uniform information
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1764 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

Fig. 7. The influence of g on MFL convergence. (a) Loss function values with iterative times under different g; (b) Testing accuracy with iterative times
under different g; (c) Loss function values with g when T ¼ 1000.

distribution, MFL training still achieves expected conver- monotonically with g so the convergence rate of MFL
gence and shows its robustness. We also see that Case 2 and increases with g. While g > 0:95, the loss function values of
Case 3 have worse performance than Case 1, because each MFL start to increase with a gradual deterioration of MFL
node in Case 2 and Case 3 has totally or partially non-uni- convergence performance, and in this situation, MFL can not
form information. Further, Case 2 shows the worst conver- remain convergence. If the g values are chosen to be close to 1,
gence performance. The comparison results illustrate that best around 0.9, MFL reaches the optimal convergence rate.
non-i.i.d data distribution still remains MFL convergence
but decreases MFL convergence performance. 7.2.4 Effect of t
Finally, we evaluate the effect of different t on loss function
7.2.3 Effect of g of MFL. We record the final loss function values with t based
We evaluate the impact of g on the convergence rate of loss on the three cases of T ¼ 1; 000 for FL, T ¼ 1; 000 for MFL
function. In this simulation, we still set aggregation fre- and T ¼ 60; 000 for MFL. We set g ¼ 0:5 for MFL. The curves
quency t ¼ 4. for the three cases are presented in Fig. 8. Comparing FL
The experimental results are shown in Fig. 7. Subfigure (a) with MFL for T ¼ 1; 000, we see that the final loss function
and (b) show that how different values of g affect the conver- values of MFL are smaller than those of FL for any t. As
gence curves of loss function and testing accuracy, respec- declared in Proposition 3, under a small magnitude of T and
tively. We can see that if g ¼ 0, the loss function and accuracy h ¼ 0:002 which is close to 0, MFL always converges much
curves of MFL overlap with the corresponding ones of FL faster than FL. Further, for T ¼ 1; 000, the effect of t on con-
because MFL is equivalent to FL for g ¼ 0. When g increases vergence is slight because the curves of FL and MFL are rela-
1
from 0 to 0.9, we can see the convergence rates on both loss tively plain. This can be explained by Lemma 3, where 2h’T
1
function curves and accuracy curves also gradually increase. and 2vaT dominate the convergence upper-bound when the
Subfigure (c) shows the change of final loss function value magnitude of T is small. While T ¼ 60; 000, change of t
(T ¼ 1000) with 0 < g < 1. From this subfigure, we can find affects convergence significantly and the final loss function
that the final loss function values of MFL are always values gradually increase with t. As the cases of T ¼ 1; 000
smaller than those of FL with 0 < g < 1. Compared with FL, for MFL and FL, the case of T ¼ 60; 000 for MFL has a slight
convergence performance of MFL is improved. This is effect on convergence if t < 100. But if t > 100, MFL con-
because 2ð1bhÞ cos u
> 1 and according to Proposition 3, the vergence performance is getting worse with t. According to
bhp2 the above analysis of t, setting an appropriate aggregation
accelerated convergence range of MFL is 0 < g < 1. We can
see that when 0 < g < 0:95, the loss function values decrease frequency will reduce convergence performance slightly
with a decline of communication cost (in our cases, t ¼ 100).

7.2.5 MFL Communication Efficiency


In the experiment, we evaluate the communication efficiency
of MFL compared with FL under different values of g.
Because the momentum and weight are transmitted between
the edge nodes and the central server, we can simply assume
that the communication size of MFL is twice that of FL due to
the additional momentum parameters and the same dimen-
sion of momentum and weight. Therefore, we set that com-
munication budget of one global aggregation for MFL is 1, so
communication budget of one global aggregation for FL is 0.5.
The experiment is based on SVM and we set aggregation fre-
Fig. 8. Loss function values with t. quency t ¼ 4. The experiment consumes 125 communication
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1765

[4] A. Esteva et al., “Dermatologist-level classification of skin cancer


with deep neural networks,” Nature, vol. 542, no. 7639, 2017,
Art. no. 115.
[5] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, per-
spectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260,
2015.
[6] R. Subramanian and F. Fekri, “Sleep scheduling and lifetime max-
imization in sensor networks: Fundamental limits and optimal
solutions,” in Proc. Int. Conf. Inf. Process. Sensor Netw., 2006,
pp. 218–225.
[7] G. P. Fettweis, “The tactile internet: Applications and challenges,”
IEEE Veh. Technol. Magazine, vol. 9, no. 1, pp. 64–70, Mar. 2014.
[8] M. Patel et al., “Mobile-edge computing introductory technical
white paper,” White Paper, Mobile-Edge Computing (MEC) Industry
Initiative, pp. 1089–7801, 2014.
[9] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey
on mobile edge computing: The communication perspective,”
Fig. 9. Comparison of communication efficiency between MFL and FL. IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, Fourth
Quarter 2017.
[10] M. Hu, L. Zhuang, D. Wu, Y. Zhou, X. Chen, and L. Xiao, “Learning
driven computation offloading for asymmetrically informed edge
budgets for MFL or FL, so MFL performs 125 global aggrega- computing,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 8,
tions and FL performs 250 global aggregations. pp. 1802–1815, Aug. 2019.
The experimental results are shown in Fig. 9. We see that [11] X. Wang et al., “Dynamic resource scheduling in mobile edge
cloud with cloud radio access network,” IEEE Trans. Parallel Dis-
the communication efficiency is improved with increasing trib. Syst., vol. 29, no. 11, pp. 2429–2445, Nov. 2018.
g, because g affects the convergence rate of MFL signifi- [12] N. Q. Viet Hung, H. Jeung, and K. Aberer, “An evaluation
cantly as presented in Section 7.2.3. We can see that MFL of model-based approaches to sensor data compression,”
with a large value of g performs better than FL and a small IEEE rans. Knowl. Data Eng., vol. 25, no. 11, pp. 2434–2447,
Nov. 2013.
g results to worse performance based on the same commu- [13] S. Di, D. Tao, X. Liang, and F. Cappello, “Efficient lossy compression
nication cost. For example, if g ¼ 0:6, MFL has a better con- for scientific data based on pointwise relative error bound,” IEEE
vergence with respect to communication cost than FL. Thus, Trans. Parallel Distrib. Syst., vol. 30, no. 2, pp. 331–345, Feb. 2019.
[14] S. Di and F. Cappello, “Optimization of error-bounded lossy com-
MFL shows higher communication efficiency compared pression for hard-to-compress HPC data,” IEEE Trans. Parallel
with FL for g ¼ 0:6. Distrib. Syst., vol. 29, no. 1, pp. 129–143, Jan. 2018.
[15] G. Yang, V. Y. F. Tan, C. K. Ho, S. H. Ting, and Y. L. Guan, “Wireless
compressive sensing for energy harvesting sensor nodes,” IEEE
8 CONCLUSION Trans. Signal Process., vol. 61, no. 18, pp. 4491–4505, Sep. 2013.
[16] H. B. McMahan, B. McMahan, E. Moore, D. Ramage, S. Hampson,
In this paper, we have proposed MFL which performs MGD and B. A. Y. Arcas, “Communication-efficient learning of deep
in local update steps to solve the distributed machine learning networks from decentralized data,” in Proc. 20th Int. Conf. Artif.
Intell. Stat., vol. 54, pp. 1273–1282, Apr. 2017.
problem. First, we have established global convergence prop- [17] S. Wang et al., “Adaptive federated learning in resource con-
erties of MFL and derived an upper bound on MFL conver- strained edge computing systems,” IEEE J. Sel. Areas Commun.,
gence rate. This theoretical upper bound shows that the vol. 37, no. 6, pp. 1205–1221, Jun. 2019.
[18] J. Konecnỳ, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh,
sequence generated by MFL linearly converges to the global and D. Bacon, “Federated learning: Strategies for improving com-
optimum point under certain conditions. Then, compared munication efficiency,” in Proc. NIPS Workshop Private Multi-Party
with FL, MFL provides accelerated convergence performance Mach. Learn., 2016. [Online]. Available: https://arxiv.org/abs/
under the given conditions as presented in Proposition 3. 1610.05492
[19] C. Hardy, E. Le Merrer, and B. Sericola, “Distributed deep learn-
Finally, based on MNIST and CIFAR-10 datasets, our simula- ing on edge-devices: feasibility via adaptive compression,” in
tion results have verified the MFL convergence and con- Proc. 16th IEEE Int. Symp. Netw. Comput. Appl., 2017, pp. 1–8.
firmed the accelerated convergence of MFL. [20] K. Bonawitz et al., “Practical secure aggregation for privacy-
preserving machine learning,” in Proc. ACM SIGSAC Conf. Comput.
Commun. Secur., 2017, pp. 1175–1191.
ACKNOWLEDGMENTS [21] T. Nishio and R. Yonetani, “Client selection for federated
learning with heterogeneous resources in mobile edge,”
This work was supported by the National Key Research in Proc. IEEE Int. Conf. Commun. (ICC), pp. 1–7, May 2019.
and Development Program of China under Grant [Online]. Available: https://arxiv.org/abs/1804.08333
[22] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra,
2018YFA0701603, the National Natural Science Foundation “Federated learning with non-IID data,” 2018. [Online]. Available:
of China under Grant 61722114, and USTC Research Funds https://arxiv.org/abs/1806.00582
of the Double First-Class Initiative (No. YD3500002001). [23] A. S. Nemirovsky and D. B. Yudin, “Problem complexity and
method efficiency in optimization,” New York: Wiley, 1983.
[24] B. T. Polyak, “Some methods of speeding up the convergence of
iteration methods,” USSR Comput. Math. Math. Phys., vol. 4, no. 5,
REFERENCES pp. 1–17, 1964.
[1] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learn- [25] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson, “Global
ing affordance for direct perception in autonomous driving,” in convergence of the heavy-ball method for convex optimization,”
Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2722–2730. in Proc. Eur. Control Conf., 2015, pp. 310–315.
[2] L. Deng et al., “Recent advances in deep learning for speech [26] J. Wang, V. Tantia, N. Ballas, and M. Rabbat, “SlowMo: Improving
research at microsoft,” in Proc. IEEE Int. Conf. Acoust. Speech Signal communication-efficient distributed SGD with slow momentum,”
Process., 2013, pp. 8604–8608. 2019. [Online]. Available: https://arxiv.org/abs/1910.00643
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- [27] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of com-
cation with deep convolutional neural networks,” in Proc. Int. munication efficient momentum SGD for distributed non-convex
Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105. optimization,” in ICML, 2019, pp. 7184–7193.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1766 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020

[28] R. Johnson and T. Zhang, “Accelerating stochastic gradient Yunfei Chen (Senior Member, IEEE) received
descent using predictive variance reduction,” in Proc. 26th Int. the BE and ME degrees in electronics engineering
Conf. Neural Inf. Process. Syst., 2013, pp. 315–323. from Shanghai Jiaotong University, Shanghai,
[29] N. Qian, “On the momentum term in gradient descent learning P.R.China, in 1998 and 2001, respectively, and the
algorithms,” Neural Netw., vol. 12, no. 1, pp. 145–151, 1999. PhD degree from the University of Alberta, in 2006.
[30] Y. Nesterov, Lectures on Convex Optimization, vol. 137. Berlin, He is currently working as an associate professor
Germany: Springer, 2018. with the University of Warwick, United Kingdom.
[31] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learn- His research interests include wireless communi-
ing: From Theory to Algorithms. Cambridge, U.K.: Cambridge Univ. cations, cognitive radios, wireless relaying and
Press, 2014. energy harvesting.
[32] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
Cambridge, MA, USA: MIT Press, 2016.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86,
Wenyi Zhang (Senior Member, IEEE) received the
no. 11, pp. 2278–2324, Nov. 1998.
bachelor’s degree in automation from Tsinghua
[34] A. Krizhevsky, “Learning multiple layers of features from tiny
University, in 2001, and the master’s and PhD
images,” Tech. Rep., University of Toronto, 2009.
degrees in electrical engineering from the Univer-
sity of Notre Dame, in 2003 and 2006, respectively.
Wei Liu received the BE degree in electronic He is currently a professor with the Department of
information engineering from the University of Electronic Engineering and Information Science,
Science and Technology of China, Hefei, China, University of Science and Technology of China. He
in 2018. He is currently working toward the ME was affiliated with the Communication Science
degree with the Department of Electronic Engi- Institute, University of Southern California, as a
neering and Information Science, University of postdoctoral research associate, and with Qual-
Science and Technology of China. His research comm Incorporated, Corporate Research and Development. His research
interests include distributed machine learning interest includes wireless communications and networking, information
and accelerated computation. theory, and statistical signal processing. He was an editor for the
IEEE Communications Letters, and is currently an editor for the IEEE
Transactions on Wireless Communications.

" For more information on this or any other computing topic,


Li Chen received the BE degree in electrical and
information engineering from the Harbin Institute please visit our Digital Library at www.computer.org/csdl.
of Technology, Harbin, China, in 2009, and the
PhD degree in electrical engineering from the Uni-
versity of Science and Technology of China, Hefei,
China, in 2014. He is currently a faculty member
with the Department of Electronic Engineering
and Information Science, University of Science
and Technology of China. His research interests
include wireless IoT communications and wireless
optical communications.

Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.

You might also like