zhang -2021
zhang -2021
Abstract—Wireless traffic prediction is essential for cellular and provides the decision basis of communication network
networks to realize intelligent network operations, such as load- management and optimization [8]. With the predicted traffic
aware resource management and predictive control. Existing data, proactive measures can be taken to mitigate the net-
prediction approaches usually adopt centralized training archi-
tectures and require the transferring of huge amounts of traffic work congestion and outage caused by burst transmissions.
data, which may raise delay and privacy concerns for certain Moreover, the heterogeneous service requirements, which are
scenarios. In this work, we propose a novel wireless traffic expected to become common in 6G communication networks
prediction framework named Dual Attention-Based Federated [9], can be well satisfied with a lower cost by wireless traffic
Learning (FedDA), by which a high-quality prediction model is prediction. This will lead to a significant improvement in the
trained collaboratively by multiple edge clients. To simultane-
ously capture the various wireless traffic patterns and keep raw QoS from both network’s and user’s perspectives.
data locally, FedDA first groups the clients into different clusters Currently, most of wireless traffic prediction approaches
by using a small augmentation dataset. Then, a quasi-global are focusing on centralized learning strategy and involves
model is trained and shared among clients as prior knowledge, transferring huge amount of raw data to a datacenter to
aiming to solve the statistical heterogeneity challenge confronted learn a generalized prediction model. However, frequently
with federated learning. To construct the global model, a dual
attention scheme is further proposed by aggregating the intra- transmission of training data and signaling overhead could
and inter-cluster models, instead of simply averaging the weights easily exhaust the network capacity and yield negative impacts
of local models. We conduct extensive experiments on two real- on payload transmissions. Thus new wireless traffic prediction
world wireless traffic datasets and results show that FedDA approaches that can cope with the above challenges are
outperforms state-of-the-art methods. The average mean squared needed.
error performance gains on the two datasets are up to 10% and
30%, respectively. The emergence and success of federated learning (FL) [10–
Index Terms—wireless traffic prediction, federated learning, 14] make the prediction problem possible while keep data
deep neural networks locally. In FL setting, many clients e.g., mobile devices, base
stations (BSs), or companies, collaboratively train a predic-
I. I NTRODUCTION tion model under the orchestration of a central server. Only
Since the commercialization of fifth generation (5G) com- intermediate gradients or model parameters obtained by local
munication networks in 2019, the preliminary research on training are sent to the central server, instead of the raw data.
the potential features and enabling technologies for the sixth There are enough reasons to support FL in the next-generation
generation (6G) communications has attracted extensive at- communications [15]. First, the advances of edge computing
tention in academia and industry [1, 2]. There are a set of have paved the way for easily implementing FL in reality.
emerging technologies and novel paradigms, e.g., terahertz As edge clients equip abundant computing resources [16], the
spectrum, space-air-ground communications, large reflecting centralized powerful datacenter is no longer essential and the
surfaces, and cognitive radios [3]. In addition, the communica- delay of transferring raw data can be considerably reduced. In
tion research community has nevertheless reached a consensus addition, FL facilitates unprecedented large-scale flexible data
that artificial intelligence (AI) is the key to implement novel collection and model training. The edge clients can proactively
paradigms, coordinate heterogeneous networks, organize var- collect data during day hours, then jointly update the global
ious communication resources, and enable the truly smart 6G model during night hours, to improve efficiency and accuracy
communications in the 2030s [4]. In particular, AI aided by big for next-day usage.
data and high-rate real-time transmission capability is expected Despite the promising application prospect, accurate wire-
to be the most efficient approach to reduce network overhead less traffic prediction under the FL settings is still a major
and elevate the quality of service (QoS) of both access and research challenge, especially the network-wide prediction.
core networks [5]. This is because user mobility can cause sophisticated spatio-
To facilitate the fusion of AI and communication networks, temporal coupling among wireless traffic, which can hardly be
wireless traffic prediction is indispensable. Wireless traffic captured and modeled. Furthermore, different BSs may have
prediction [6, 7] estimates traffic data volume in the future distinct traffic patterns which makes the traffic data highly
heterogeneous and learning and prediction on this kind of FL. Section III gives the system model and problem state-
heterogeneous data is very challenging. ment. In section IV, we introduce our proposed dual atten-
Therefore, to cope with the wireless traffic prediction is- tion mechanism in detail, including the data augmentation-
sues for future communication networks, we propose a novel assisted client clustering, mathematical expression of dual
wireless traffic prediction framework named the dual attention- attention, and the corresponding optimization techniques. Sec-
based federated learning (FedDA), by which a high-quality tion V presents and discusses all the experiments. Finally,
prediction model is trained collaboratively by multiple BSs. Section VI concludes this paper. Our code is available at
The FedDA framework relies on a set of state-of-the-art https://github.com/chuanting/FedDA.
training paradigms, including a data augmentation assisted
clustering strategy, an intermediate and auxiliary training II. R ELATED W ORK
model, a dual attention-based model aggregation, and a hier- As the present work is closely related to wireless traffic
archical aggregation structure. Specifically, the processing of prediction and FL, we review the most related achievements
FedDA can be split into three stages to ensure high-accuracy, and milestones of these two research topics in this section.
transferable, and secure training process for wireless traffic
prediction. A. Wireless Traffic Prediction
We first introduce an augmentation assisted clustering strat- Recently, wireless traffic prediction has received a lot of
egy to group all BSs, i.e., clients in the context of FL, attention as many tasks in wireless communications require
into a number of clusters depending on their augmented accurate traffic modeling and prediction capabilities. Wireless
traffic patterns and geographic locations. Then, leveraging the traffic prediction is essentially a time series prediction prob-
augmented data collected from distributed BSs, a quasi-global lem. The methods to solve it can be roughly classified into
prediction model can be constructed at the central server. This three categories, i.e., simple methods, parametric methods, and
quasi-global model is used to mitigate the generalization diffi- non-parametric methods.
culty of the global model caused by the statistical heterogene- Historical average and naı̈ve methods are representatives of
ity among traffic data collected from different clusters. Finally, the first category [17]. The former predicts all future values
instead of simply averaging the model weights collected from as the average of the historical data, while the latter takes the
local clients to yield the global model, a dual attention-based last observation as the future. This kind of prediction method
model aggregation mechanism and a hierarchical aggregation involves no complex computations, and thus makes it quite
structure are adopted at the central server. By introducing the simple and easy to implement. However, as simple methods
dual attention and hierarchical settings, an adequate equilib- fail to capture the hidden patterns of wireless traffic, their
rium between generality and specialty can be achieved. prediction performance is relatively poor.
Following the proposed FedDA framework and the descrip- For the second category, i.e., parametric methods, the wire-
tions given above, the contributions of this paper can be less traffic is modeled and predicted based on tools from
summarized as follows: statistics and probability theory. The most classical method
• We design a data augmentation-assisted iterative clus- is AutoRegressive Integrated Moving Average (ARIMA) [18].
tering strategy, which takes the augmented data and To characterize the self-similarity and bursty of wireless traffic,
geographic locations of clients as clustering reference to ARIMA and its variants were explored in [19, 20]. In a recent
simultaneously capture various traffic patterns of clients study, [21] first decomposed the wireless traffic into regularity
and protect data privacy. and randomness components. Then the authors demonstrated
• We introduce a quasi-global model, which is an inter- that the regularity component can be predicted through the
mediate and auxiliary tool to mitigate the generalization ARIMA model, but the prediction of random components is
difficulty of the global model caused by the statistical het- impossible. Besides the ARIMA model, the α-stable model
erogeneity among traffic patterns collected from different [22], entropy theory [23], and covariance functions [24] were
clients. also explored to perform wireless traffic prediction.
• We propose the FedDA framework consisting of two As machine learning and AI techniques [25] continue their
advanced settings for aggregations, which are the dual fast evolving, the non-parametric methods have made them-
attention-based model aggregation mechanism and the selves strong competitors to parametric methods for wireless
hierarchical aggregation structure. In this way, the central traffic prediction. Particularly, recent years have witnessed an
server can capture not only the cluster-specific data obvious trend in solving wireless traffic prediction problems
patterns but also ensure the transferability of the global based on deep neural networks. In [26], the authors proposed
model. a deep belief networks-based prediction method for wireless
• We verify the effectiveness and efficiency of the FedDA mesh networks. In [6], A hybrid deep learning framework
framework by testing on two real-world datasets and was designed on the basis of autoencoder and Long-Short
compare the experimental results with those generated Term Memory networks (LSTM) to simultaneously capture
by existing algorithms. the spatial and temporal dependence among different cells.
The rest of this paper is organized as follows. Section Aiming to perform prediction on multiple cells, the researcher
II reviews related works on wireless traffic prediction and also introduced a multi-task learning framework by using
LSTM [27]. Besides, the city-scale wireless traffic predictions
are also investigated in [7, 28], in which the authors introduced
novel prediction frameworks by modeling spatio-temporal
dependence over cross-domain datasets.
All aforementioned works mainly focus on wireless traffic
prediction in the centralized way. Our proposed framework in
this paper differs from the above works, and we are trying to
solve the wireless traffic prediction problem by a distributed
architecture and federated learning.
B. Federated Learning
FL provides a distributed training architecture that can be
jointly applied with many machine learning algorithms, in Central server Base station Edge server Wireless traffic
the αm and β, the parameters of the output model can be series dataset and normally has better prediction perfor-
updated by the gradient descent algorithm. We first calculate mance than linear models and shallow-learning models.
the derivative of (4) with respect to wO t
and obtain the • FedAvg [10]: FedAvg is first proposed in the pioneering
where γ is a predetermined step size that controls how much C. Experimental Settings and Overall Results
wO should move in the direction of the opposite gradient in Without loss of generality, we randomly select 100 cells
every iteration. The whole procedure of our proposed FedDA from each dataset and carry out experiments on the three
is summarized in Algorithm 2. kinds of wireless traffic of these cells. The traffic from the
first seven weeks is used to train prediction models and the
V. E XPERIMENTS traffic from the last week is used for test. When constructing
In this section, extensive experiments are conducted to training samples using sliding window scheme, the length of
validate the effectiveness and efficiency of FedDA. We begin closeness dependence p and periodicity dependence q are both
with a brief introduction of the dataset, evaluation metrics, set to 3. Considering that the edge client has limited computing
and baseline methods. Then, experimental settings, such as power and thermal constraints, a relatively lightweight LSTM
learning rate and batch size, are given. After that, we present network is adopted. Specifically, the network has two LSTM
experimental results, including the overall prediction perfor- layers and each layer has 64 hidden neurons, followed by
mance of various methods and the influence of learning (hyper- a linear layer that maps the features to predictions. All
)parameters on prediction performance. baselines except shallow learning algorithms share the same
network architecture for the sake of fairness. Unless otherwise
A. Dataset and Evaluation Metrics specified, we take 100 communication rounds between local
The datasets used in this paper come from the Big Data clients and the central server and report results on the final
Challenge[34] launched by Telecom Italia and mainly are Call model. The regularization term ρ is determined through a grid
Detail Records (CDR) of two Italian areas, i.e., the city of search with values ranging from −0.3 to 0.3 and step size
TABLE I: Prediction performance comparisons among different methods in terms of MSE and MAE on two datasets (‘↑’
denotes the performance gain of FedDA over FedAtt).
Milano Trento
Methods MSE MAE MSE MAE
SMS Call Internet SMS Call Internet SMS Call Internet SMS Call Internet
Lasso 0.7580 0.3003 0.4380 0.6231 0.4684 0.5475 4.7363 1.6277 5.9121 1.3182 0.8258 1.5391
SVR 0.4144 0.0919 0.1036 0.3528 0.1852 0.2220 5.2285 1.7919 5.9080 1.0390 0.5656 1.0470
LSTM 0.5608 0.1379 0.1697 0.4287 0.2458 0.2936 3.6947 1.1378 4.6976 0.9426 0.5013 1.1193
FedAvg 0.3744 0.0776 0.1096 0.3386 0.1838 0.2319 2.2287 1.6048 4.7988 0.7416 0.5319 1.0668
FedAtt 0.3667 0.0774 0.1096 0.3375 0.1837 0.2321 2.1558 1.5967 4.7645 0.7444 0.5306 1.0629
FedDA (ϕ=1) 0.3559 0.0752 0.1118 0.3353 0.1820 0.2367 2.1468 1.4925 4.4335 0.7478 0.5140 1.0212
FedDA (ϕ=10) 0.3481 0.0753 0.1062 0.3321 0.1810 0.2275 2.0719 1.1699 3.9266 0.7320 0.4543 0.9504
FedDA (ϕ=100) 0.3322 0.0659 0.1033 0.3214 0.1741 0.2211 1.9703 1.0592 2.4473 0.6920 0.4281 0.7471
↑ (ϕ=100) +9.4% +14.9% +5.8% +4.8% +5.2% +4.7% +8.6% +33.7% +48.6% +7.0% +19.3% +29.7%
1 1
4 6 FedDA
SMS volume
Percentage
FedDA
SMS volume
Percentage
Percentage
Call volume
Percentage
Call volume
1
0.5 2 0.5
0
0
0 0
-1 0 0.5 1 1.5 0 1 2 3
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Absolute error
Absolute error
Time index Time index
2 1 1
Internet volume
Internet volume
Percentage
Percentage
0 0.5 0.5
0
0 0
-2 0 0.5 1 1.5 -2 0 0.5 1
0 20 40 60 80 100 120 140 160 Absolute error 0 20 40 60 80 100 120 140 160 Absolute error
Time index Time index
(a) Results on the Milano dataset for randomly selected cells. (b) Results on the Trento dataset for randomly selected cells.
Fig. 4: Comparisons between predictions and the real values and the corresponding error analysis.
0.1. The cluster size C is set to 16. Similar to the standard terms of the metric of MAE, though the improvements are
settings in FL [10], the values of local epochs and local batch not as remarkable as MSE, it is still obvious that an average
size are set to 1 and 20, respectively. In each communication of 4.9% (18.7%) performance gain can be achieved for the
round, 10% percent of the cells are involved in model training. Milano (Trento) dataset. We can also notice that the prediction
We use stochastic gradient descent (SGD) as the optimizer to performance of FedDA improves consistently with the increase
update our models with learning rate 0.01. of shared augmentation data size. This is because the quasi-
global model can capture the traffic patterns better when more
The experimental results of different prediction methods
data samples are available. The success of FedDA can be
are presented in Table I. Note that in Table I, our proposed
attributed to the following reasons:
method have three variations on the basis of how many
augmented data samples shared, i.e., ϕ=1, ϕ=10, and ϕ=100. • Compared with fully distributed algorithms (SVR and
In reality, the amount of transfered data samples can be flexibly LSTM), which only consider temporal dependence of
adjusted according to network situation. It can be seen from wireless traffic, FedDA can capture both the spatial
this table that our proposed method, FedDA, outperforms dependence and the temporal dependence by means of
all the baseline methods for all kinds of wireless traffic in model fusion, and is thus more robust;
both datasets, even with only 1% of the augmentation data • Compared with conventional FL algorithms (FedAvg
is shared. Here and in the following experiments, we report and FedAtt), the introduced clustering strategy makes
the results of FedDA with ϕ=100 unless otherwise specified. the learning process of FedDA case-specific, and the
Specifically, for the SMS, CALL, and Internet service traffic dual attention scheme greatly reduces data heterogeneity.
of the Milano dataset, compared with the best-performing Therefore, FedDA has a high generalization ability;
method in baselines, namely FedAtt, FedDA can offer MSE • FedDA can balance between capturing the unique charac-
gains of 9.4%, 14.9%, and 5.8%, respectively. Likewise, for teristics of a cluster and the shared macro traffic patterns
the Trento dataset, FedDA yields performance gains of 8.6% among different clusters, and thus has have more accurate
(SMS), 33.7% (CALL), and 48.6% (Internet), respectively. In predictions.
Milano-SMS Milano-Call Milano-Internet Milano-SMS Milano-Call Milano-Internet
0.8 1.0 1.0 0.38
0.40 FedDA
R squared score
R squared score
R squared score
0.6 0.36 0.10 FedAvg
0.30 FedAtt
MSE
MSE
MSE
0.4 0.5 0.5 0.34
0.08 0.20
0.2 FedDA 0.32
FedAtt
0.0 0.0 0.0 0.10
0.30 0.06
0 50 100 0 50 100 0 50 100 -0.2 0 0.2 -0.2 0 0.2 -0.2 0 0.2
Communication rounds Communication rounds Communication rounds
Trento-SMS Trento-Call Trento-Internet Milano-SMS Milano-Call Milano-Internet
0.8 0.8 0.36 0.24 0.50
0.6 FedDA
R squared score
R squared score
R squared score
0.6 0.6 0.35 0.22 FedAvg
0.40
0.4 FedAtt
MAE
MAE
MAE
0.4 0.4 0.34 0.20
0.2 0.30
0.2 FedDA 0.2 0.33 0.18
FedAtt 0.0
0.0 0.0 0.32 0.16 0.20
0 50 100 0 50 100 0 50 100 -0.2 0 0.2 -0.2 0 0.2 -0.2 0 0.2
Communication rounds Communication rounds Communication rounds
Fig. 5: Prediction accuracy versus communication rounds. Fig. 6: Influence of ρ on prediction performance for the Milano
dataset.
Besides, the shown results also indicate that in comparison Trento-SMS Trento-Call Trento-Internet
5.0 10.0
with fully distributed algorithms, FL-based algorithms can FedDA 1.6
8.0
achieve better predictions, and FedAtt achieves the second 4.0 FedAvg
FedAtt
1.4
MSE
MSE
MSE
best prediction performance, followed by the classic FedAvg 3.0
1.2
6.0
MAE
MAE
1.2
0.45
real values of different algorithms are given in Fig. 4. The 0.8 1.0
Normalized MAE
Normalized MAE
Normalized MSE
Normalized MSE
C=32 C=32 C=32
0.98
0.98 0.80 0.90
C=1
0.97 C=16
C=32
0.97
0.96 0.70 0.85
0.96
0.95
0.60 0.80
0.94 0.95
(a) Results on the Milano dataset. (b) Results on the Trento dataset.
respectively. In both figures, the value of ρ is ranging from performance degrades slightly when C=16, it improves when
−0.3 to 0.3 with a step size of 0.1 and the obtained MSE C=32. For the Trento dataset, introducing the clustering strat-
(MAE) results are given in the upper (lower) three subfigures. egy can always yield better prediction performance than not,
Besides FedDA, the other two FL-based methods, i.e., FedAvg especially for the Internet traffic, on which the improvement
and FedAtt, are also included for comparison purpose but their is up to 50%. Overall, the results in Fig. 8 demonstrate the
MSE and MAE results keep constant as they are not affected superiority of introducing the clustering into FedDA. This is
by ρ. We can tell from Fig. 6 and Fig. 7 that for FedDA, because the cluster size C controls the specialty of FedDA
with the increase of ρ, the values of MSE and MAE first and thereby affects the global model. If no clustering strategy
decrease rapidly, and then slowly increase. For the other two is involved, all data is mixed together to generate the global
baselines, FedAtt achieves slightly better results than FedAvg. model. In this case, some unique traffic patterns hidden in
Though ρ has a great influence on the prediction performance, the data cannot be captured by FedDA and hence leads to
FedDA generally can yield lower MSE and MAE values than performance degradation.
FedAvg and FedAtt. Take the SMS traffic of the Milano
dataset as an example, the obtained MSE results are always VI. C ONCLUSION
better than FedAvg and FedAtt regardless of the choice of ρ;
while for the metric of MAE, similar conclusion hold except In this work, we investigated the wireless traffic prediction
that ρ=−0.3, by which FedDA achieves worse results than problem and proposed a novel framework called FedDA.
FedAvg and FedAtt. Nonetheless, the optimal values of ρ can To deal with the heterogeneity of wireless traffic data, we
be determined by using a grid search strategy during model proposed a data-sharing strategy in FedDA by transferring a
training and the cost is low. The results in these two figures small augmented traffic dataset to the central server, by which
demonstrate that the dual attention scheme in FedDA can a quasi-global model is obtained and shared among all BSs.
indeed improve prediction performance by introducing prior Besides, we also introduced an iterative clustering algorithm
knowledge and the influence varies on different datasets. to cluster BSs into different groups, by considering both the
wireless traffic pattern and the geo-location information. To
F. Effect of C on Prediction Performance enhance the generalization ability of the global model, we
The cluster size C determines how many cells are involved proposed a dual attention-based model aggregation scheme by
in model aggregation and it also affects the final prediction paying attention to the unequal contributions of different local
performance. Thus in this subsection, we explore how the models and the quasi-global model. The aggregation scheme
cluster size affects the prediction performance of FedDA and is applied over a hierarchical architecture so as to capture both
the obtained MSE and MAE results are plotted in Fig. 8. In intra-cluster and inter-cluster patterns of wireless data traffic.
particlular, Fig. 8a (Fig. 8b) shows the results on the Milano Finally, we verified the effectiveness and efficiency of FedDA
(Trento) dataset. We consider three scenarios, i.e., C=1, C=16, on two real-world datasets.
and C=32. Note that C=1 means no clustering adopted in
prediction. In addition, to make the comparison clearer, the R EFERENCES
MSE and MAE results of C=16 and C=32 are normalized [1] K. David and H. Berndt. 6G vision and requirements: Is there
based on the results of C=1. We can observe that the choices any need for beyond 5G? IEEE Vehicular Technology Magazine,
of C yield different influences on the prediction performance 13(3):72–80, 2018.
of FedDA. In most cases, introducing the clustering strategy [2] B. Zong, C. Fan, X. Wang, X. Duan, B. Wang, and J. Wang.
can indeed lead to lower prediction errors. Specifically, we 6G technologies: Key drivers, core requirements, system archi-
tectures, and enabling technologies. IEEE Vehicular Technology
can observe from Fig. 8a that FedDA achieves considerable Magazine, 14(3):18–27, 2019.
performance improvements when cluster size is 16 or 32, for [3] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini. What should
the SMS and Call traffic. For the Internet traffic, though the 6G be? Nature Electronics, 3(1):20–29, 2020.
[4] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y. A. Zhang. lar networks. IEEE Transactions on Wireless Communications,
The roadmap to 6G: AI empowered wireless networks. IEEE 16(6):3899–3912, 2017.
Communications Magazine, 57(8):84–90, 2019. [23] R. Li, Z. Zhao, X. Zhou, J. Palicot, and H. Zhang. The
[5] M. Yao, M. Sohul, V. Marojevic, and J. H. Reed. Artificial prediction analysis of cellular radio access network traffic: From
intelligence defined 5G radio access networks. IEEE Commu- entropy theory to networking practice. IEEE Communications
nications Magazine, 57(3):14–20, 2019. Magazine, 52(6):234–240, 2014.
[6] J. Wang, J. Tang, Z. Xu, Y. Wang, G. Xue, X. Zhang, and [24] X. Chen, Y. Jin, S. Qiang, W. Hu, and K. Jiang. Analyzing
D. Yang. Spatiotemporal modeling and prediction in cellular and modeling spatio-temporal dependence of cellular traffic
networks: A big data enabled deep learning approach. In 2017 at city scale. In 2015 IEEE International Conference on
IEEE Conference on Computer Communications (INFOCOM), Communications (ICC), pages 3585–3591, 2015.
pages 1–9, 2017. [25] H. Liu, B. Xu, D. Lu, and G. Zhang. A path planning approach
[7] C. Zhang, H. Zhang, J. Qiao, D. Yuan, and M. Zhang. Deep for crowd evacuation in buildings based on improved artificial
transfer learning for intelligent cellular traffic prediction based bee colony algorithm. Applied Soft Computing, 68:360–376,
on cross-domain big data. IEEE Journal on Selected Areas in 2018.
Communications, 37(6):1389–1401, 2019. [26] L. Nie, D. Jiang, S. Yu, and H. Song. Network traffic prediction
[8] Y. Xu, F. Yin, W. Xu, J. Lin, and S. Cui. Wireless traffic based on deep belief network in wireless mesh backbone net-
prediction with scalable gaussian process: Framework, algo- works. In 2017 IEEE Wireless Communications and Networking
rithms, and verification. IEEE Journal on Selected Areas in Conference (WCNC), pages 1–5, 2017.
Communications, 37(6):1291–1306, 2019. [27] C. Qiu, Y. Zhang, Z. Feng, P. Zhang, and S. Cui. Spatio-
[9] N. Kato, B. Mao, F. Tang, Y. Kawamoto, and J. Liu. Ten temporal wireless traffic prediction with recurrent neural net-
challenges in advancing machine learning technologies toward work. IEEE Wireless Communications Letters, 7(4):554–557,
6G. IEEE Wireless Communications, 27(3):96–103, 2020. 2018.
[10] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. [28] C. Zhang, H. Zhang, D. Yuan, and M. Zhang. Citywide cellular
y Arcas. Communication-efficient learning of deep networks traffic prediction based on densely connected convolutional
from decentralized data. In Proceedings of the 20th Interna- neural networks. IEEE Communications Letters, 22(8):1656–
tional Conference on Artificial Intelligence and Statistics, pages 1659, 2018.
1273–1282, 2017. [29] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chan-
[11] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine dra. Federated learning with non-iid data. arXiv preprint
learning: Concept and applications. ACM Transactions on arXiv:1806.00582, 2018.
Intelligent Systems and Technology, 10(2):1–19, 2019. [30] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and
[12] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated V. Smith. Federated optimization in heterogeneous networks.
learning: Challenges, methods, and future directions. arXiv In Proceedings of Machine Learning and Systems 2020, 2020.
preprint arXiv:1908.07873, 2019. [31] S. Ji, S. Pan, G. Long, X. Li, J. Jiang, and Z. Huang. Learning
[13] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Inger- private neural language modeling with attentive aggregation.
man, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. In 2019 International Joint Conference on Neural Networks
McMahan, et al. Towards federated learning at scale: System (IJCNN), pages 1–8, 2019.
design. arXiv preprint arXiv:1902.01046, 2019. [32] J. Zhang, Y. Zheng, and D. Qi. Deep spatio-temporal residual
[14] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, networks for citywide crowd flows prediction. In Proceedings
A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cum- of the Thirty-First AAAI Conference on Artificial Intelligence,
mings, et al. Advances and open problems in federated learning. AAAI’17, page 1655–1661. AAAI Press, 2017.
arXiv preprint arXiv:1912.04977, 2019. [33] Q. Wen, L. Sun, X. Song, J. Gao, X. Wang, and H. Xu. Time
[15] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. series data augmentation for deep learning: A survey. arXiv
Hong. Federated learning over wireless networks: Optimization preprint arXiv:2002.12478, 2020.
model design and analysis. In 2019 IEEE Conference on [34] G. Barlacchi, M. D. Nadai, R. Larcher, A. Casella, C. Chitic,
Computer Communications (INFOCOM), pages 1387–1395, G. Torrisi, F. Antonelli, A. Vespignani, A. Pentland, and
2019. B. Lepri. A multi-source dataset of urban life in the ity of
[16] H. Liu, B. Liu, H. Zhang, L. Li, X. Qin, and G. Zhang. Crowd Milan and the Province of Trentino. Scientific Data, 2:150055,
evacuation simulation approach based on navigation knowledge 2015.
and two-layer control mechanism. Information Sciences, 436- [35] Telecom Italia. Telecommunications - SMS, Call, Internet - TN,
437:247–267, 2018. 2015. URL https://doi.org/10.7910/DVN/QLCABU.
[17] R. J. Hyndman and G. Athanasopoulos. Forecasting: principles [36] Telecom Italia. Telecommunications - SMS, Call, Internet - MI,
and practice. OTexts, 2018. 2015. URL https://doi.org/10.7910/DVN/EGZHFV.
[18] J. M. Hamilton. Time Series Analysis, volume 2. Princeton New [37] H. Feng, Y. Shu, S. Wang, and M. Ma. Svm-based models for
Jersey, 1994. predicting wlan traffic. In 2006 IEEE International Conference
[19] Y. Shu, M. Yu, O. Yang, J. Liu, and H. Feng. Wireless traffic on Communications (ICC), volume 2, pages 597–602, 2006.
modeling and prediction using seasonal arima models. IEICE [38] A. Cameron and F. A. G. Windmeijer. R-squared measures for
Transactions on Communications, 88(10):3992–3999, 2005. count data regression models with applications to health-care
[20] B. Zhou, D. He, and Z. Sun. Traffic predictability based on utilization. Journal of Business & Economic Statistics, 14(2):
ARIMA/GARCH model. In 2006 2nd Conference on Next 209–220, 1996.
Generation Internet Design and Engineering, pages 200–207,
Apr. 2006.
[21] F. Xu, Y. Lin, J. Huang, D. Wu, H. Shi, J. Song, and Y. Li. Big
data driven mobile traffic understanding and forecasting: A time
series approach. IEEE Transactions on Services Computing, 9
(5):796–805, 2016.
[22] R. Li, Z. Zhao, J. Zheng, C. Mei, Y. Cai, and H. Zhang. The
learning and prediction of application-level traffic data in cellu-