Accelerating_Federated_Learning_via_Momentum_Gradient_Descent
Accelerating_Federated_Learning_via_Momentum_Gradient_Descent
8, AUGUST 2020
Abstract—Federated learning (FL) provides a communication-efficient approach to solve machine learning problems concerning
distributed data, without sending raw data to a central server. However, existing works on FL only utilize first-order gradient descent
(GD) and do not consider the preceding iterations to gradient update which can potentially accelerate convergence. In this article, we
consider momentum term which relates to the last iteration. The proposed momentum federated learning (MFL) uses momentum
gradient descent (MGD) in the local update step of FL system. We establish global convergence properties of MFL and derive an upper
bound on MFL convergence rate. Comparing the upper bounds on MFL and FL convergence rates, we provide conditions in which MFL
accelerates the convergence. For different machine learning models, the convergence performance of MFL is evaluated based on
experiments with MNIST and CIFAR-10 datasets. Simulation results confirm that MFL is globally convergent and further reveal
significant convergence improvement over FL.
Index Terms—Accelerating convergence, distributed machine learning, federated learning, momentum gradient descent
1 INTRODUCTION
ECENTLY, data-intensive machine learning has been In order to overcome these challenges, cutting down
R applied in various fields, such as autonomous driving
[1], speech recognition [2], image classification [3] and disease
transmission distance and reducing the amount of uploaded
data from edge devices to the network center are two effec-
detection [4] since this technique provides beneficial solu- tive ways. To reduce transmission distance, mobile edge
tions to extract the useful information hidden in data. It now computing (MEC) in [8] is an emerging technique where the
becomes a common tendency that machine-learning systems computation and storage resources are pushed to proximity
are deployed in architectures that include tens of thousands of edge devices where the local task and data offloaded by
of processors [5]. Great amount of data is generated by vari- users can be processed. In this way, the distance of large-
ous parallel and distributed physical objects. scale data transmission is greatly shortened and the latency
Collecting data from edge devices to the central server has a significant reduction [9]. Using machine learning for
is necessary for distributed machine learning scenarios. the prediction of uploaded task execution time achieves a
In the process of distributed data collection, there exist shorter processing delay [10], and dynamic resource sched-
significant challenges such as energy efficiency problems uling was studied to optimize resources allocation of MEC
and system latency problems. The energy efficiency of system in [11]. To reduce the uploaded data size, model-
distributed data collection was considered in wireless sen- based compression approaches, where raw data are com-
sor networks (WSNs) due to limited battery capacity of pressed and represented by well-established model parame-
sensors [6]; In fifth-generation (5G) cellular networks, a ters, demonstrate significant compression performance [12].
round-trip delay from terminals through the network Lossy compression is also an effective strategy to decrease
back to terminals demands much lower latencies, poten- the uploaded data size [13], [14]. Compressed sensing, where
tially down to 1 ms, to facilitate human tactile to visual the sparse data of the edge can be efficiently sampled and
feedback control [7]. Thus, the challenges of data aggrega- reconstructed with transmitting a much smaller data size,
tion in distributed system urgently require communica- was applied to data acquisition of Internet of Things (IoT)
tion-efficient solutions. network [15]. All the aforementioned works need to collect
raw data from individual device.
To avoid collecting raw data for machine learning in dis-
W. Liu, L. Chen, and W. Zhang are with the Department of Electronic tributed scenarios, a novel approach named Federated Learn-
Engineering and Information Science, University of Science and Technology ing (FL) has emerged as a promising solution [16]. The work
of China, Hefei, Anhui 230052, China.
E-mail: [email protected], {chenli87, wenyizha}@ustc.edu.cn. in [17] provided a fundamental architecture design of FL.
Y. Chen is with the School of Engineering, University of Warwick, CV4 Considering the growing computation capability of edge
7AL Coventry, United Kingdom. E-mail: [email protected]. nodes (devices), FL decentralizes the centralized machine
Manuscript received 6 Oct. 2019; revised 14 Jan. 2020; accepted 15 Feb. 2020. learning task and assigns the decomposed computing tasks
Date of publication 19 Feb. 2020; date of current version 23 Mar. 2020. to the edge nodes where the raw data are stored and learned
(Corresponding author: Li Chen.)
Recommended for acceptance by J. Zhai. at the edge nodes. After a fixed iteration interval, each edge
Digital Object Identifier no. 10.1109/TPDS.2020.2975189 node transmits its learned model parameters to the central
1045-9219 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1755
the learning problem is to minimize F ðwÞ and it can be for- where dðtÞ is the momentum term which has the same dimen-
mulated as follows: sion as wðtÞ, g is the momentum attenuation factor, h is the
learning step size and t is the iteration index. By iterations of
w , arg min F ðwÞ: (1)
(4) and (5) with t, F ðwÞ can potentially converge to the mini-
mum faster compared with GD. The convergence range of
Because of the complexity of machine learning model and
MGD is 1 < g < 1 with a bounded h and if 0 < g < 1,
original dataset, finding a closed-form solution of the above
MGD has an accelerated convergence rate than GD under a
optimization problem is usually intractable. So algorithms
small h typically used in simulations [29, Result 3].
based on gradient iterations are used to solve (1). If raw user
data are collected and stored in the central server, we can use
centralized learning solutions to (1) while if raw user data 3.2 FL Solution
are distributed over the edge nodes, FL and the proposed In contrast with centralized learning solutions, FL avoids
MFL can be applied to solve this learning problem. collecting and uploading the distributed data because of the
Under the situation where FL or MFL solutions are used, limited communication resources at edge nodes and privacy
the local loss function of node i is denoted by Fi ðwÞ which protection for local data. It decouples the machine learning
is defined merely on Di . Then we define the global loss func- task from the central server to each edge node to avoid stor-
tion F ðwÞ on D as follows: ing user data in the server and reduce the communication
resources consumption. All of edge nodes make up a feder-
Definition 1 (Global loss function). Given the loss function ation in coordination with the central server.
Fi ðwÞ of edge node i, we define the global loss function on all The FL design and convergence analysis are presented in
the distributed datasets as [17] where FL network is studied thoroughly. In an FL sys-
PN tem, each edge node uses the same machine learning model.
i¼1 jDi jFi ðwÞ We use t to denote the global aggregation frequency, i.e., the
F ðwÞ , : (2)
jDj update interval. Each node i has its local model parameter
e i ðtÞ, where the iteration index is denoted by t ¼ 0; 1; 2; . . .
w
(in this paper, an iteration means a local update). We use ½k
3 EXISTING SOLUTIONS to denote the aggregation interval ½ðk 1Þt; kt for k ¼ 1;
In this section, we introduce two existing solutions to solve the 2; 3; . . . . At t ¼ 0, local model parameters of all nodes are ini-
learning problem expressed by (1). These two solutions are tialized to the same value. When t > 0, w e i ðtÞ is updated
centralized learning solution and FL solution, respectively. locally based on GD, which is the local update. After t local
updates, global aggregation is performed and all edge nodes
3.1 Centralized Learning Solution send the updated model parameters to the centralized server
synchronously.
Centralized machine learning is for machine learning model
The learning process of FL is described as follows.
embedded in the central server and each edge node needs to
send its raw data to the central sever. In this situation, edge
nodes will consume communication resources for data 3.2.1 Local Update
transmission, but without incurring computation resources When t 2 ½k, local updates are performed in each edge node
consumption. by
After the central server has collected all datasets from the
edge nodes, a usual way to solve the learning problem w e i ðt 1Þ hrFi ðw
e i ðtÞ ¼ w e i ðt 1ÞÞ;
expressed by (1) is GD as a basic gradient method. Further,
MGD is an improved gradient method with adding a which follows GD exactly.
momentum term to speed up learning process [24].
3.2.2 Global Aggregation
3.1.1 GD When t ¼ kt, global aggregation is performed. Each node
The update rule for GD is as follows: sends w e i ðktÞ to the central server synchronously. The cen-
tral server takes a weighted average of the received parame-
wðtÞ ¼ wðt 1Þ hrF ðwðt 1ÞÞ: (3) ters from N nodes to obtain the globally updated parameter
In (3), t denotes the iteration index and h > 0 is the learning wðktÞ by
step size. The model parameter w is updated along the direc- PN
tion of negative gradient. Using the above update rule, GD i¼1 e i ðktÞ
jDi jw
wðktÞ ¼ :
can solve the learning problem with continuous iterations. jDj
3.1.2 MGD Then wðktÞ is sent back to all edge nodes as their new
As an improvement of GD, MGD introduces the momen- parameters and edge nodes perform local updates for the
tum term and we present its update rules as follows: next iteration interval.
In [17, Lemma 2], the FL solution has been proven to be
dðtÞ ¼ gdðt 1Þ þ rF ðwðt 1ÞÞ (4) globally convergent for convex optimization problems and
exhibit good convergence performance. So FL is an effective
wðtÞ ¼ wðt 1Þ hdðtÞ; (5) solution to the distributed learning problem presented in (1).
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1757
TABLE 1
MFL Notation Summary
Notation Definition
T ; K; N number of total local iterations; number of global
aggregations/number of intervals; number of edge nodes
t; k; t; ½k iteration index; interval index; aggregation frequency
with t ¼ T =K; the interval ½ðk 1Þt; kt
w ; wf globally optimal parameter of F ðÞ; the optimal
parameter that MFL can obtain in Algorithm 1
h; b; r; g the learning step size of MGD or GD; the b-smooth
parameter of Fi ðÞ; the r-Lipschitz parameter of Fi ðÞ;
the momentum attenuation factor which decides the
proportion of momentum term in MGD
Di ; D the local dataset of node i; the global dataset
di ; d the upper bound between rF ðwÞ and rFi ðwÞ; the
average of di over all nodes
Fi ðÞ; F ðÞ the loss function of node i; the global loss function
dðtÞ; wðtÞ the global momentum parameter at iteration round t;
the global model parameter at iteration round t Fig. 2. Comparison of MGD and GD.
e i ðtÞ; w
d e i ðtÞ the local momentum parameter of node i at iteration
round t; the local model parameter at iteration round t we use MGD to perform local updates of FL and this
d½k ðtÞ; w½k ðtÞ the momentum parameter of centralized MGD at
approach is named MFL.
iteration round t in ½k; the model parameter of
centralized MGD at iteration round t in ½k In the following subsection, we design the MFL learning
u½k ðtÞ; u; p the angle between vector rF ðw½k ðtÞÞ and d½k ðtÞ; u is the paradigm and propose the learning problem based on the
maximum of u½k ðtÞ for 1 k K with t 2 ½k; p is the MFL design.
maximum ratio of kd½k ðtÞk and krF ðw½k ðtÞÞk for
1 k K with t 2 ½k
4.2 MFL
In the MFL design, we use d e i ðtÞ and w e i ðtÞ to denote momen-
tum parameter and model parameter for node i, respec-
4 DESIGN OF MFL tively. All edge nodes are set to embed the same machine
In this section, we introduce the design of MFL to solve the learning model. So the local loss functions Fi ðwÞ are the
distributed learning problem shown in (1). We first discuss same for all nodes, and the dimensions of both the model
the motivation of our work. Then we present the design of parameters and the momentum parameters are consistent.
MFL in detail and the learning problem based on federated The parameters setup of MFL is similar to that of FL. We use
system. The main notations of MFL design and analysis are t to denote the local iteration index for t ¼ 0; 1; . . ., t to denote
summarized in Table 1. the aggregation frequency and ½k to denote the interval
½ðk 1Þt; kt where k denotes the interval index for
k ¼ 1; 2; . . .. At t ¼ 0, the momentum parameters and the
4.1 Motivation model parameters of all nodes are initialized to the same val-
Since MGD improves the convergence rate of GD [24], we ues, respectively. When t 2 ½k, d e i ðtÞ and w e i ðtÞ are updated
want to apply MGD to local update steps of FL and hope based on MGD, called local update steps. When t ¼ kt, MFL
that the proposed MFL will accelerate the convergence rate performs global aggregation steps where d e i ðtÞ and w e i ðtÞ are
for federated networks. sent to the central server synchronously. Then in the central
First, we illustrate the intuitive influence on optimization server, the global momentum parameter dðtÞ and the global
problem after introducing the momentum term into gradient model parameter wðtÞ are obtained by taking a weighted
updating methods. Considering GD, the update reduction of average of the received parameters, respectively, and are
the parameter is hrF ðwðt 1ÞÞ which is only proportional to sent back to all edge nodes for the next interval.
the gradient of wðt 1Þ. The update direction of GD is always The learning rules of MFL include the local update and
along gradient descent so that an oscillating update path the global aggregation steps. By continuous alternations of
could be caused, as shown by the GD update path in Fig. 2. local update and global aggregation, MFL can perform its
However, the update reduction of parameter for MGD is a learning process to minimize the global loss function F ðwÞ.
superposition of hrF ðwðt 1ÞÞ and gðwðt 2Þ wðt 1ÞÞ We describe the MFL learning process as follows.
which is the momentum term. As shown by the MGD update First of all, we set initial values for d e i ð0Þ and w e i ð0Þ. Then
path in Fig. 2, utilizing the momentum term can deviate the 1) Local Update: When t 2 ½k, local update is performed at
direction of parameter update to the optimal decline signifi- each edge node by
cantly and mitigate the oscillation caused by GD. In Fig. 2, GD
has an oscillating update path and costs seven iterations to d e i ðt 1Þ þ rFi ðw
e i ðtÞ ¼ g d e i ðt 1ÞÞ (6)
reach the optimal point while MGD only needs three itera-
tions to do that, which demonstrates that mitigating the oscil- e i ðtÞ ¼ w
w e i ðtÞ:
e i ðt 1Þ hd (7)
lation by MGD leads to a faster convergence rate.
Because edge nodes of distributed networks are usually According to (6) and (7), node i performs MGD to optimize
resource-constrained, solutions to convergence acceleration the loss function Fi ðÞ defined on its own dataset.
can attain higher resources utilization efficiency. Thus, moti- 2) Global Aggregation: When t ¼ kt, node i transmits
vated by the property that MGD improves convergence rate, w e i ðktÞ to the central server which takes weighted
e i ðktÞ and d
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1758 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Fig. 3. Illustration of MFL local update and global aggregation steps from interval ½k to ½k þ 1.
averages of the received parameters from N nodes to obtain Thus, we have t T and k K with T ¼ Kt. Considering
the global parameters wðktÞ and dðktÞ, respectively. The that wðtÞ is unobservable for t 6¼ kt, we use wf to denote the
aggregation rules are presented as follows: achievable optimal model parameter defined on resource-
PN constrained MFL network. Hence, the learning problem is to
e i ðtÞ
jDi jd obtain wf within K global aggregations particularly, i.e.,
i¼1
dðtÞ ¼ (8)
jDj
wf , arg min F ðwÞ: (10)
PN
i¼1 jDi jw
e i ðtÞ w2fwðktÞ:k¼1;2;...;Kg
wðtÞ ¼ : (9)
jDj
The optimization algorithm of MFL is explained in Algorithm 1.
Then the central server sends dðktÞ and wðktÞ back to all
edge nodes where d e i ðktÞ ¼ dðktÞ and w
e i ðktÞ ¼ wðktÞ are
set to enable the local update in the next interval ½k þ 1. 5 CONVERGENCE ANALYSIS
Note that only if t ¼ kt, the value of the global parameters In this section, we first introduce some definitions and
wðtÞ and dðtÞ can be observed. But we define dðtÞ and wðtÞ assumptions for MFL convergence analysis. Then based on
for all t to facilitate the following analysis. A typical alterna- these preliminaries, global convergence properties of MFL
tion is shown in Fig. 3 which illustrates the learning steps of following Algorithm 1 are established and an upper bound
MFL in interval ½k and ½k þ 1. on MFL convergence rate is derived. Also MFL convergence
performance with related parameters is analyzed.
Algorithm 1. MFL The Dataset in Each Node Has Been
Set, and the Machine Learning Model Embedded in Edge 5.1 Preliminaries
Nodes has Been Chosen. We Have Set Appropriate Model First of all, to facilitate the analysis, we assume that Fi ðwÞ
Parameters h and g. satisfies the following conditions:
Input: Assumption 1. For Fi ðwÞ in node i, we assume the following
The limited number of local updates in each node T conditions:
A given aggregation frequency t
Output: 1) Fi ðwÞ is convex
The final global model weight vector wf 2) Fi ðwÞ is r-Lipschitz, i.e., jFi ðw1 Þ Fi ðw2 Þj rkw1
1: Set the initial values of wf , w e i ð0Þ.
e i ð0Þ and d w2 k for some r > 0 and any w1 , w2
2: for t ¼ 1; 2; . . . ; T do 3) Fi ðwÞ is b-smooth, i.e., krFi ðw1 Þ rFi ðw2 Þk
3: Each node i performs local update in parallel according to bkw1 w2 k for some b > 0 and any w1 , w2
(6) and (7).==Local update 4) Fi ðwÞ is m-strong, i.e., aFi ðw1 Þþ ð1 aÞFi ðw2 Þ
4: if t ¼¼ kt where k is a positive integer then Fi ðaw1 þ ð1 aÞw2 Þ þ að1aÞm2 kw1 w2 k2 , a 2 ½0; 1
5: e i ðtÞ
Set d dðtÞ and we i ðtÞ wðtÞ for all nodes where for some m > 0 and any w1 , w2 [30, Theorem 2.1.9]
dðtÞ and wðtÞ are obtained by (8) and (9) respectively.
==Global aggregation Because guaranteeing the global convergence of central-
Update wf arg minw2fwf ;wðktÞg F ðwÞ ized MGD requires that the objective function is strongly
6: end if convex [24], it is necessary to assume condition 4. Assump-
7: end for tion 1 is satisfied for some learning models such as SVM,
linear regression and logistic regression whose loss func-
The learning problem of MFL to attain the optimal model tions are presented in Table 2. Experimental results as pre-
parameter is presented as (1). However, the edge nodes have sented in Section 7.2.1 show that for non-convex models
limited computation resources with a finite number of local such as CNN whose loss function does not satisfy Assump-
iterations. We assume that T is the number of local updates tion 1, MFL also performs well. From Assumption 1, we can
and K is the corresponding number of global aggregations. obtain the following lemma:
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1759
TABLE 2
Loss Function of Three Machine Learning Models
Lemma 1. F ðwÞ is convex, r-Lipschitz, b-smooth and m-strong. Fig. 4. Illustration of the difference between MGD and MFL in intervals.
Proof. According to the definition of F ðwÞ from (2), triangle
inequality and the definition of r-Lipschitz, b-smooth Compared with centralized MGD, MFL aggregation
and m-strong, we can derive that F ðwÞ is convex, r- interval with t > 1 brings global update delay because of
Lipschitz, b-smooth and m-strong directly. u
t the fact that centralized MGD performs global update on
every iteration while MFL is allowed to spread its global
Then we introduce the gradient divergence between
parameter to edge nodes after t local updates. Therefore,
rF ðwÞ and rFi ðwÞ for any node i. It comes from the nature
the convergence performance of MFL is worse than that
of the difference in datasets distribution.
of MGD, which is essentially from the imbalance between
Definition 2 (Gradient divergence). We define di as the several computation rounds and one communication
upper bound between rF ðwÞ and rFi ðwÞ for any node i, i.e., round in MFL design. The following subsection provides
the resulting convergence performance gap between these
krF ðwÞ rFi ðwÞk di : (11)
two approaches.
Also, we define the average gradient divergence
P 5.2 Gap between MFL and Centralized MGD in
i jDi jdi
d, : (12) Interval ½k
jDj
First, considering a special case, we consider the gap between
Boundedness of di and d. Based on condition 3 of Assumption MFL and centralized MGD for t ¼ 1. From intuitive perspec-
1, we let w2 ¼ wi where wi is the optimal value for minimiz- tive, MFL performs global aggregation after every local
ing Fi ðwÞ. Because Fi ðwÞ is convex, we have krFi ðw1 Þk update and there does not exist global parameter update
bkw1 wi k for any w1 , which means krFi ðwÞk is finite for delay, i.e, the performance gap is zero. In Appendix A, which
any w. According to Definition 1 and the linearity of gradient can be found on the Computer Society Digital Library at
operator, global loss function rF ðwÞ is obtained by taking a http://doi.ieeecomputersociety.org/TPDS.2020.2975189., we
weighted average of rFi ðwÞ. Therefore, krF ðwÞk is finite, prove that MFL is equivalent to MGD for t ¼ 1 theoretically.
and krF ðwÞ rFi ðwÞk has an upper bound, i.e., di is Now considering the general case for any t 1, the
bounded. Further, d is still bounded from the linearity in (12). upper bound of gap between wðtÞ and w½k ðtÞ can be derived
Since local update steps of MFL perform MGD, the upper as follows.
bounds of MFL and MGD convergence rates exhibit certain
connections in the same interval. For the convenience of Proposition 1 (Gap between MFL and centralized MGD
analysis, we use variables d½k ðtÞ and w½k ðtÞ to denote the in intervals). Given t 2 ½k, the gap between wðtÞ and w½k ðtÞ
momentum parameter and the model parameter of central- can be expressed by
ized MGD in each interval ½k, respectively. This centralized
MGD is defined on global dataset and updated based on kwðtÞ w½k ðtÞk hðt ðk 1ÞtÞ; (15)
global loss function F ðwÞ. In interval ½k, the update rules of
centralized MGD follow:
where we define
d½k ðtÞ ¼ gd½k ðt 1Þ þ rF ðw½k ðt 1ÞÞ (13)
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
w½k ðtÞ ¼ w½k ðt 1Þ hd½k ðtÞ: (14) ð1 þ g þhbÞ þ ð1 þ g þ hbÞ2 4g
A, ;
At the beginning of interval ½k, the momentum parameter 2g
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
d½k ðtÞ and the model parameter w½k ðtÞ of centralized MGD
ð1 þ g þhbÞ ð1 þ g þ hbÞ2 4g
are synchronized with the corresponding parameters of B, ;
MFL, i.e., 2g
d½k ððk 1ÞtÞ , dððk 1ÞtÞ
w½k ððk 1ÞtÞ , wððk 1ÞtÞ: A
E, ;
ðA BÞðgA 1Þ
For each interval ½k, the centralized MGD is performed by B
F,
iterations of (13) and (14). In Fig. 4, we illustrate the distinc- ðA BÞð1 gBÞ
tions between F ðwðtÞÞ and F ðw½k ðtÞÞ intuitively.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1760 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Then we define
F ðwðtÞÞ F ðw½k ðtÞÞ rhðt ðk 1ÞtÞ: (17)
kd½k ðtÞk
p, max ;
1kK;t2½k krF ðw½k ðtÞÞk
Proof. First, we derive an upper bound of kw e i ðtÞ w½k ðtÞk
for node i. On the basis of this bound, we extend this and
result from the local cases to the global one to obtain the
1
final result. The detailed proving process is presented v , min :
in Appendix B, available in the online supplemental k kwððk 1ÞtÞ w k2
material. u
t
Based on Proposition 1 which gives an upper bound of
Because hð1Þ ¼ hð0Þ ¼ 0 and hðxÞ increases with x for loss function difference between MFL and centralized MGD,
x 1, which are proven in Appendix C, available in the global convergence rate of MFL can be derived as follows.
online supplemental material, we always have hðxÞ 0 for
x ¼ 0; 1; 2; . . .. Lemma 2. If the following conditions are satisfied:
From Proposition 1, in any interval ½k, we have hð0Þ ¼ 0
for t ¼ ðk 1Þt, which fits the definition w½k ððk 1ÞtÞ ¼ 1) cos u 0, 0 < hb < 1 and 0 g < 1;
wððk 1ÞtÞ. We still have hð1Þ ¼ 0 for t ¼ ðk 1Þt þ 1. This There exists " > 0 which makes
means that there is no gap between MFL and centralized 2) F ðw½k ðktÞÞ F ðw Þ " for all k;
MGD when local update is only performed once after the 3) F ðwðT ÞÞ F ðw Þ ";
global aggregation. 4) va rhðtÞ
t"2
> 0 hold,
It is easy to find that if t ¼ 1, t ðk 1Þt is either 0 or 1. then we have
Because hð1Þ ¼ hð0Þ ¼ 0, the upper bound in (15) is zero,
and there is no gap between F ðwðtÞÞ and F ðw½k ðtÞÞ from 1
F ðwðT ÞÞ F ðw Þ ; (18)
(17). This is consistent with Appendix A, available in the T va rhðtÞ
t" 2
online supplemental material, where MFL yields central-
ized MGD for t ¼ 1. In any interval ½k, we have t ðk where we defined
1Þt 2 ½0; t. If t > 1, t ðk 1Þt can be larger than 1. When
x > 1, we know that hðxÞ increases with x. According to bh bh2 g 2 p2
the definition of A, B, E and F , we can obtain gA > 1, a, h 1 þ hgð1 bhÞ cos u :
2 2
gB < 1 and E; F > 0 easily. Because 0 < g < 1, the last
term will linearly decrease with x when x is large. There- Proof. The proof is presented in Appendix D, available in
fore, the first exponential term EðgAÞx in (16) will be domi- the online supplemental material. t
u
nant when x is large and the gap between wðtÞ and w½k ðtÞ
increases exponentially with t. On the basis of Lemma 2, we further derive the following
Also we find hðxÞ is proportional to the average gradient proposition which demonstrates the global convergence of
gap d. It is because the greater the local gradient divergences MFL and gives its upper bound on convergence rate.
at different nodes are, the larger the gap will be. So consid- Proposition 2 (MFL global convergence). Given cos u 0,
ering the extreme situation where all nodes have the same 0 < hb < 1, 0 g < 1 and a > 0, we have
data samples (d ¼ 0 because the local loss functions are the
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
same), the gap between wðtÞ and w½k ðtÞ is zero and MFL is 1 1 rhðtÞ
equivalent to centralized MGD. F ðwf Þ F ðw Þ þ þ þ rhðtÞ:
2T va 4T 2 v2 a2 vat
(19)
5.3 Global Convergence Proof. The specific proving process is shown in Appendix
We have derived an upper bound between F ðwðtÞÞ and E, available in the online supplemental material. u
t
F ðw½k ðtÞÞ for t 2 ½k. According to the definition of MFL, in
the beginning of each interval ½k, we set d½k ððk 1ÞtÞ ¼ According to the above Proposition 2, we get an upper
dððk 1ÞtÞ and w½k ððk 1ÞtÞ ¼ wððk 1ÞtÞ. The global bound of F ðwf Þ F ðw Þ which is a function of T and t. From
upper bound on the convergence rate of MFL can be inequality (19), we can find that MFL linearly converges to a
qffiffiffiffiffiffiffiffi
rhðtÞ
derived based on Proposition 1. lower bound vat þ rhðtÞ. Because hðtÞ is related to t and d,
The following definitions are made to facilitate analysis. aggregation intervals (t > 1) and different data distribution
First, we use u½k ðtÞ to denote the angle between vectors collectively lead to that MFL does not converge to the
rF ðw½k ðtÞ and d½k ðtÞ for t 2 ½k, i.e., optimum.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1761
In the following, we discuss the influence of t on the con- for hðtÞ and hFL ðtÞ, we have gA ! hb þ 1 and gB ! 0.
vergence bound. If t ¼ 1, we have rhðtÞ ¼ 0 so that F ðwf Þ Because ABA
! 1 and AB B 1
! 0, we can further get E ! hb
F ðw Þ linearly converges to zero as T ! 1, and the conver- and F ! 0 from the definitions of E and F . So, according to
1
gence rate yields T va . Noting hðtÞ > 0 if t > 1, we can find (16), we have
that in q this case, F ðw f
Þ F ðw Þ converges to a non-zero
ffiffiffiffiffiffiffiffi 1 1
bound rhðtÞ lim hðtÞ ¼ hd ðhb þ 1Þt t
vat þ rhðtÞ as T ! 1. On the one hand, if there g!0 hb hb
does not exist communication resources limit, setting aggre- d
gation frequency t ¼ 1 and performing global aggregation ¼ ðð1 þ hbÞt 1Þ hdt ¼ hFL ðtÞ:
b
after each local update can reach the optimal convergence
performance of MFL. On the other hand, aggregation inter- Hence, by the above analysis under g ! 0, we can find that
val (t > 1) can let MFL effectively utilize the communica- MFL and FL have the same upper bound on convergence
tion resources of each node, but bring about a decline of rate. This fact is consistent with the property that if g ¼ 0,
convergence performance. MFL degenerates into FL and has the same convergence rate
with FL.
To avoid complicated calculations over the expressions
6 COMPARISON BETWEEN FL AND MFL
of f1 ðT Þ and f2 ðT Þ, we have the following lemma.
In this section, we make a comparison of convergence per-
formance between MFL and FL. Lemma 3. If there exists T1 1 which satisfies that 2T1va domi-
1
A closed-form solution of the upper bound on FL conver- nates in f1 ðT Þ and 2h’T dominates in f2 ðT Þ for T < T1 , i.e.,
gence rate has been derived in [17, Theorem 2]. It is presented ( rffiffiffiffiffiffiffiffiffiffiffiffi)
as follows. 1 rhðtÞ
max rhðtÞ;
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2T va vat
f 1 1 rhFL ðtÞ and
F ðwFL Þ F ðw Þ þ þ þ rhFL ðtÞ:
2h’T 4h2 ’2 T 2 h’t ( sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi)
1 rhFL ðtÞ
(20) max rhFL ðtÞ; ;
2h’T h’t
According to [17],
then we have
d
hFL ðtÞ ¼ ððhb þ 1Þt 1Þ hdt;
b 1
f1 ðT Þ
T va
and ’ ¼ vFL ð1 hb 2 Þ where the expression of vFL is consis-
tent with v. Differing from that of v, wððk 1ÞtÞ in the defi- and
nition of vFL is the global model parameter of FL.
1
We assume that both MFL and FL solutions are applied in f2 ðT Þ
T h’
the system model proposed in Fig. 1. They are trained based
on the same training dataset with the same machine learning for T < T1 .
model. The loss functions Fi ðÞ and global loss functions F ðÞ
of MFL and FL are the same, respectively. The corresponding Proof. We can find such a T1 . For example, considering (21),
parameters of MFL and FL are equal including t, h, r, d and b. if h ! 0, we have a ! 0 from the definition of a and
We set the same initial value wð0Þ of MFL and FL. Because hðtÞ ! 0 from Appendix F, available in the online supple-
1
both MFL and FL are convergent, we have v ¼ kwð0Þw k2 . mental material.
qffiffiffiffiffiffiffiffiffiffiffiffi
ffi So we can easily derive varhðtÞ ! 0
Then according to the definitions of v and vFL , we have varhðtÞ
! 0. Then we
and t can find ffi T1 1 which satisfies
qffiffiffiffiffiffiffiffiffiffiffiffi
w ¼ wFL . Therefore, the corresponding parameters of MFL 1
varhðtÞ and 1 varhðtÞ
for T < T1 . Hence, we
2T 2T t
and FL are the same and we can compare the convergence
1 1
rates between FL and MFL conveniently. have 2T va dominates in f1 ðT Þ and f1 ðT Þ T va . For the
For convenience, we use f1 ðT Þ and f2 ðT Þ to denote the same reason, considering (22), if h ! 0, we have h’ ! 0
upper bounds on convergence rates of MFL and FL, respec- from the definition of ’ and hFL ðtÞ ! 0 from its defini-
tively. Then we have tion.
q So we can easily derive h’rhFL ðtÞ ! 0 and
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h’rhFL ðtÞ
! 0. Then for T < T1 , 2T 1
h’rhFL ðtÞ and
1 1 rhðtÞ t ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q
f1 ðT Þ , þ þ þ rhðtÞ (21) 1 h’rhFL ðtÞ 1
. Hence, we have 2h’T dominates in f2 ðT Þ
2T va 4T 2 v2 a2 vat 2T t
1
and f2 ðT Þ T h’. u
t
and Based on Lemma 3, we have the following proposition.
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 1 rhFL ðtÞ Proposition 3 (Accelerated convergence of MFL). If the
f2 ðT Þ , þ þ þ rhFL ðtÞ: (22) following conditions are satisfied:
2h’T 4h2 ’2 T 2 h’t
1) hb 1;
We consider the special case of g ! 0. For va and h’, we can 2) T < T1 ;
2 Þ ¼ h’ from the definition of a. Then
obtain va ! vhð1 bh 3) 0 < g < minf2ð1hbÞ
bhp2
cos u
; 1g,
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1762 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
MFL converges faster than FL, i.e., xj is a 784-dimensional input vector which is vectorized
from 28 28 pixel matrix and yj is the scalar label corre-
f1 ðT Þ < f2 ðT Þ:
sponding to xj . SVM, linear and logistic regression are used
Proof. From condition 1 and condition 2, we have f1 ðT Þ to classify whether the digit is even or odd. If the image of
1 1 xj represents an even number, then we set yj ¼ 1. Other-
T va and f2 ðT Þ T h’. Due to the definitions of a and ’,
wise, yj ¼ 1. But for logistic regression, we set yj ¼ 1 for
inequality 0 < g < 2ð1bhÞ
bhp2
cos u
is equivalent to va > h’. the even number and yj ¼ 0 for the odd.
1
So if va > h’, it is obvious that T va < T 1h’, i.e., f1 ðT Þ < CNN is trained based on MNIST and CIFAR-10 datasets.
f2 ðT Þ. However, 0 < g < 1 is the condition of MFL con- The CIFAR-10 dataset includes 50,000 color images for
vergence. Hence, condition 3 is the range of MFL conver- training and 10,000 color images for testing, and has 10 dif-
gence acceleration after combining with MFL convergence ferent types of objects [34]. We use CNN to perform the clas-
guarantee 0 < g < 1. u
t sification among the 10 different labels under MNIST and
CIFAR-10 datasets, respectively.
For experimental setups, we set 4 edge nodes in FL and
7 SIMULATION AND DISCUSSION MFL and the training models are distributed into all the
In this section, we build and evaluate MFL systems based on edge nodes. The same initializations of model parameters
MNIST and CIFAR-10 datasets. We first describe the simula- are performed and the same data distributions are set for
tion environment and the relevant setups of parameters. Sec- MFL and FL. Also di ð0Þ ¼ 0 is set for node i. We set the
ond, we present and evaluate the comparative simulation learning step size h ¼ 0:002 which is sufficiently small, SVM
results of MFL, FL and MGD under different machine learn- parameter ¼ 0:3 and the total number of local iterations
ing models, which include SVM, linear regression, logistic T ¼ 1; 000 for the following simulations.
regression and CNN. Finally, extensive experiments are
implemented to explore the impacts of g, t and non-i.i.d data 7.2 Simulation Evaluation
distribution on MFL convergence performance, and to inves- In this subsection, we verify the convergence acceleration of
tigate the communication efficiency of MFL compared with MFL and explore the effects of non-i.i.d data distribution, g
that of FL. and t on MFL convergence by simulation evaluation. We
further investigate the communication efficiency of MFL
7.1 Simulation Setup compared with that of FL.
Using the Python, we build a federated network framework
where distributed edge nodes coordinate with the central
7.2.1 Convergence
server. In our network, the number of edge nodes can be
chosen arbitrarily. SVM, linear regression, logistic regres- In our first simulation, the models of SVM, linear regression,
sion and CNN are applied to model training. Loss functions logistic regression and CNN are trained and we verify the
of the first three models at node i are presented as in Table 2 accelerated convergence of MFL. We set aggregation fre-
[31], and the loss function of CNN is cross-entropy (see [32] quency t ¼ 4 and momentum attenuation factor g ¼ 0:5.
for details). Note that jDi j is the number of training samples MFL, FL and MGD are performed based on the four machine
in node i and the loss function of logistic regression is cross- learning models. MGD is implemented based on the global
entropy. For logistic regression, model output sðw; xj Þ is dataset which is obtained by gathering the distributed
sigmoid function for non-linear transform. It is defined by data on all nodes. The global loss functions of the three
solutions are defined based on the same global training and
1 testing data.
sðw; xj Þ , Tx : (23)
1 þ ew j The curves of loss function values and accuracy with iter-
ative times are presented in Fig. 5. We can see that the loss
In our experiments, training and testing samples are ran- function curves for all the learning models are gradually
domly allocated to each node, which means that the informa- convergent with iterative times. Similarly, the test accuracy
tion of each node is uniform and the data distribution at edge curves for SVM and CNN gradually rise until convergence
nodes is i.i.d. (Only in Section 7.2.2, non-i.i.d data distribu- with iterative times. Therefore, convergence of MFL is veri-
tion is used and the rest of experiments use i.i.d data distri- fied. We also see that the descent speeds of MFL loss func-
bution). We use FL and centralized MGD as benchmarks for tion curves on the four learning models are always faster
comparison of MFL. If based on SVM, linear and logistic than those of FL while the centralized MGD convergence
regression models, the deterministic gradient methods are speeds are the fastest. So compared with FL, MFL provides
performed for MFL, FL and centralized MGD. However, if a significant improvement on convergence rate. MGD con-
based on CNN model, the stochastic gradient methods are verges with the fastest speed because MFL and FL suffer
used for MFL, FL and centralized MGD due to the large the delay in global gradient update for t ¼ 4. Finally, com-
training data size. paring the results of CNN and SVM, we can conclude that
SVM, linear and logistic regression are trained and tested based on CNN model, MFL still shows similar convergence
on MNIST dataset [33] which contains 50,000 training hand- performance compared with what MFL shows in convex
written digits and 10,000 testing handwritten digits. In our model training. So the proposed MFL can perform well in
experiments, we only utilize 5,000 training samples and neural networks with non-convex loss functions.
5,000 testing samples because of the limited processing Because linear and logistic regression can not provide the
capacities of GD and MGD. In this dataset, the jth sample testing accuracy curves, we focus on the SVM model in the
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
LIU ET AL.: ACCELERATING FEDERATED LEARNING VIA MOMENTUM GRADIENT DESCENT 1763
Fig. 5. Loss function values and testing accuracy under FL, MFL and MGD. (a) and (b) are the loss function and test accuracy curves of SVM, respectively;
(c) and (d) are the loss function curves of linear regression and logistic regression, respectively. (e) and (f) are the loss function and test accuracy curves of
CNN trained on MNIST, respectively. (g) and (h) are the loss function and test accuracy curves of CNN trained on CIFAR-10, respectively.
Fig. 6. (a) and (b) are loss function and testing accuracy curves under different data distribution cases, respectively.
following experiments and further explore the impact of to a non-uniform information distribution, which
MFL parameters on convergence rate. means characteristics brought by each node are not
uniform. Thus, this case corresponds to totally non-i.
7.2.2 Effect of Non-i.i.d. Data Distribution i.d data distribution.
In the experiment, we consider three cases to distribute the Case 3: In this case, the first half of N nodes perform
data samples into different nodes. The three data distribu- random allocation rule of Case 1 to obtain uniform
tion cases at edge nodes are representative for uniform, information and the second half of the nodes per-
totally non-uniform information and the mixture of the pre- form allocation rule of Case 2. This case is a combina-
vious two cases, respectively. The specific settings of the tion of uniform and non-uniform information. We
three cases are as follows: use this case to explore the effect of the mixture of i.i.
d and non-i.i.d data distribution.
Case 1: For the uniform information distribution, In the experiment, SVM is used for the training of MFL
each data sample is randomly allocated to a node. In network under the above data distribution cases. We set
this case, we think that the data on each node have aggregation frequency t ¼ 4 and momentum attenuation
uniform characteristics. Therefore, this case satisfies factor g ¼ 0:5 for general MFL algorithm.
i.i.d data distribution as a benchmark. The experimental results are shown in Fig. 6. The two sub-
Case 2: All data samples at an individual node have figures show the MFL convergence influence made by differ-
the same label (If there are more labels than nodes, ent data distributions. We see that the loss function and
each node could have samples with more than one testing accuracy curves of MFL are always convergent
label but not the total number of labels). Because the whether data distribution at edge nodes is i.i.d or non-i.i.d,
global dataset has multiple labels, this case will lead which means even though under non-uniform information
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.
1764 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 31, NO. 8, AUGUST 2020
Fig. 7. The influence of g on MFL convergence. (a) Loss function values with iterative times under different g; (b) Testing accuracy with iterative times
under different g; (c) Loss function values with g when T ¼ 1000.
distribution, MFL training still achieves expected conver- monotonically with g so the convergence rate of MFL
gence and shows its robustness. We also see that Case 2 and increases with g. While g > 0:95, the loss function values of
Case 3 have worse performance than Case 1, because each MFL start to increase with a gradual deterioration of MFL
node in Case 2 and Case 3 has totally or partially non-uni- convergence performance, and in this situation, MFL can not
form information. Further, Case 2 shows the worst conver- remain convergence. If the g values are chosen to be close to 1,
gence performance. The comparison results illustrate that best around 0.9, MFL reaches the optimal convergence rate.
non-i.i.d data distribution still remains MFL convergence
but decreases MFL convergence performance. 7.2.4 Effect of t
Finally, we evaluate the effect of different t on loss function
7.2.3 Effect of g of MFL. We record the final loss function values with t based
We evaluate the impact of g on the convergence rate of loss on the three cases of T ¼ 1; 000 for FL, T ¼ 1; 000 for MFL
function. In this simulation, we still set aggregation fre- and T ¼ 60; 000 for MFL. We set g ¼ 0:5 for MFL. The curves
quency t ¼ 4. for the three cases are presented in Fig. 8. Comparing FL
The experimental results are shown in Fig. 7. Subfigure (a) with MFL for T ¼ 1; 000, we see that the final loss function
and (b) show that how different values of g affect the conver- values of MFL are smaller than those of FL for any t. As
gence curves of loss function and testing accuracy, respec- declared in Proposition 3, under a small magnitude of T and
tively. We can see that if g ¼ 0, the loss function and accuracy h ¼ 0:002 which is close to 0, MFL always converges much
curves of MFL overlap with the corresponding ones of FL faster than FL. Further, for T ¼ 1; 000, the effect of t on con-
because MFL is equivalent to FL for g ¼ 0. When g increases vergence is slight because the curves of FL and MFL are rela-
1
from 0 to 0.9, we can see the convergence rates on both loss tively plain. This can be explained by Lemma 3, where 2h’T
1
function curves and accuracy curves also gradually increase. and 2vaT dominate the convergence upper-bound when the
Subfigure (c) shows the change of final loss function value magnitude of T is small. While T ¼ 60; 000, change of t
(T ¼ 1000) with 0 < g < 1. From this subfigure, we can find affects convergence significantly and the final loss function
that the final loss function values of MFL are always values gradually increase with t. As the cases of T ¼ 1; 000
smaller than those of FL with 0 < g < 1. Compared with FL, for MFL and FL, the case of T ¼ 60; 000 for MFL has a slight
convergence performance of MFL is improved. This is effect on convergence if t < 100. But if t > 100, MFL con-
because 2ð1bhÞ cos u
> 1 and according to Proposition 3, the vergence performance is getting worse with t. According to
bhp2 the above analysis of t, setting an appropriate aggregation
accelerated convergence range of MFL is 0 < g < 1. We can
see that when 0 < g < 0:95, the loss function values decrease frequency will reduce convergence performance slightly
with a decline of communication cost (in our cases, t ¼ 100).
[28] R. Johnson and T. Zhang, “Accelerating stochastic gradient Yunfei Chen (Senior Member, IEEE) received
descent using predictive variance reduction,” in Proc. 26th Int. the BE and ME degrees in electronics engineering
Conf. Neural Inf. Process. Syst., 2013, pp. 315–323. from Shanghai Jiaotong University, Shanghai,
[29] N. Qian, “On the momentum term in gradient descent learning P.R.China, in 1998 and 2001, respectively, and the
algorithms,” Neural Netw., vol. 12, no. 1, pp. 145–151, 1999. PhD degree from the University of Alberta, in 2006.
[30] Y. Nesterov, Lectures on Convex Optimization, vol. 137. Berlin, He is currently working as an associate professor
Germany: Springer, 2018. with the University of Warwick, United Kingdom.
[31] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learn- His research interests include wireless communi-
ing: From Theory to Algorithms. Cambridge, U.K.: Cambridge Univ. cations, cognitive radios, wireless relaying and
Press, 2014. energy harvesting.
[32] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
Cambridge, MA, USA: MIT Press, 2016.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86,
Wenyi Zhang (Senior Member, IEEE) received the
no. 11, pp. 2278–2324, Nov. 1998.
bachelor’s degree in automation from Tsinghua
[34] A. Krizhevsky, “Learning multiple layers of features from tiny
University, in 2001, and the master’s and PhD
images,” Tech. Rep., University of Toronto, 2009.
degrees in electrical engineering from the Univer-
sity of Notre Dame, in 2003 and 2006, respectively.
Wei Liu received the BE degree in electronic He is currently a professor with the Department of
information engineering from the University of Electronic Engineering and Information Science,
Science and Technology of China, Hefei, China, University of Science and Technology of China. He
in 2018. He is currently working toward the ME was affiliated with the Communication Science
degree with the Department of Electronic Engi- Institute, University of Southern California, as a
neering and Information Science, University of postdoctoral research associate, and with Qual-
Science and Technology of China. His research comm Incorporated, Corporate Research and Development. His research
interests include distributed machine learning interest includes wireless communications and networking, information
and accelerated computation. theory, and statistical signal processing. He was an editor for the
IEEE Communications Letters, and is currently an editor for the IEEE
Transactions on Wireless Communications.
Authorized licensed use limited to: Anhui Normal University. Downloaded on August 09,2023 at 09:09:00 UTC from IEEE Xplore. Restrictions apply.