0% found this document useful (0 votes)
1 views

zhang -2021

The document presents a novel framework called Dual Attention-Based Federated Learning (FedDA) for wireless traffic prediction, which addresses the challenges of centralized training by enabling collaborative model training among multiple edge clients while keeping raw data local. FedDA employs a dual attention mechanism to aggregate models from different clusters of clients, improving prediction accuracy and mitigating issues related to statistical heterogeneity in traffic data. Experimental results demonstrate that FedDA outperforms existing methods, achieving significant reductions in prediction error on real-world datasets.

Uploaded by

amirabourechak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

zhang -2021

The document presents a novel framework called Dual Attention-Based Federated Learning (FedDA) for wireless traffic prediction, which addresses the challenges of centralized training by enabling collaborative model training among multiple edge clients while keeping raw data local. FedDA employs a dual attention mechanism to aggregate models from different clusters of clients, improving prediction accuracy and mitigating issues related to statistical heterogeneity in traffic data. Experimental results demonstrate that FedDA outperforms existing methods, achieving significant reductions in prediction error on real-world datasets.

Uploaded by

amirabourechak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Dual Attention-Based Federated Learning for

Wireless Traffic Prediction


Chuanting Zhang, Shuping Dang, Basem Shihada, Mohamed-Slim Alouini
Computer, Electrical and Mathematical Science and Engineering Division,
King Abdullah University of Science and Technology
Thuwal, Saudi Arabia
Email: {chuanting.zhang, shuping.dang, basem.shihada, slim.alouini}@kaust.edu.sa
arXiv:2110.05183v1 [cs.DC] 11 Oct 2021

Abstract—Wireless traffic prediction is essential for cellular and provides the decision basis of communication network
networks to realize intelligent network operations, such as load- management and optimization [8]. With the predicted traffic
aware resource management and predictive control. Existing data, proactive measures can be taken to mitigate the net-
prediction approaches usually adopt centralized training archi-
tectures and require the transferring of huge amounts of traffic work congestion and outage caused by burst transmissions.
data, which may raise delay and privacy concerns for certain Moreover, the heterogeneous service requirements, which are
scenarios. In this work, we propose a novel wireless traffic expected to become common in 6G communication networks
prediction framework named Dual Attention-Based Federated [9], can be well satisfied with a lower cost by wireless traffic
Learning (FedDA), by which a high-quality prediction model is prediction. This will lead to a significant improvement in the
trained collaboratively by multiple edge clients. To simultane-
ously capture the various wireless traffic patterns and keep raw QoS from both network’s and user’s perspectives.
data locally, FedDA first groups the clients into different clusters Currently, most of wireless traffic prediction approaches
by using a small augmentation dataset. Then, a quasi-global are focusing on centralized learning strategy and involves
model is trained and shared among clients as prior knowledge, transferring huge amount of raw data to a datacenter to
aiming to solve the statistical heterogeneity challenge confronted learn a generalized prediction model. However, frequently
with federated learning. To construct the global model, a dual
attention scheme is further proposed by aggregating the intra- transmission of training data and signaling overhead could
and inter-cluster models, instead of simply averaging the weights easily exhaust the network capacity and yield negative impacts
of local models. We conduct extensive experiments on two real- on payload transmissions. Thus new wireless traffic prediction
world wireless traffic datasets and results show that FedDA approaches that can cope with the above challenges are
outperforms state-of-the-art methods. The average mean squared needed.
error performance gains on the two datasets are up to 10% and
30%, respectively. The emergence and success of federated learning (FL) [10–
Index Terms—wireless traffic prediction, federated learning, 14] make the prediction problem possible while keep data
deep neural networks locally. In FL setting, many clients e.g., mobile devices, base
stations (BSs), or companies, collaboratively train a predic-
I. I NTRODUCTION tion model under the orchestration of a central server. Only
Since the commercialization of fifth generation (5G) com- intermediate gradients or model parameters obtained by local
munication networks in 2019, the preliminary research on training are sent to the central server, instead of the raw data.
the potential features and enabling technologies for the sixth There are enough reasons to support FL in the next-generation
generation (6G) communications has attracted extensive at- communications [15]. First, the advances of edge computing
tention in academia and industry [1, 2]. There are a set of have paved the way for easily implementing FL in reality.
emerging technologies and novel paradigms, e.g., terahertz As edge clients equip abundant computing resources [16], the
spectrum, space-air-ground communications, large reflecting centralized powerful datacenter is no longer essential and the
surfaces, and cognitive radios [3]. In addition, the communica- delay of transferring raw data can be considerably reduced. In
tion research community has nevertheless reached a consensus addition, FL facilitates unprecedented large-scale flexible data
that artificial intelligence (AI) is the key to implement novel collection and model training. The edge clients can proactively
paradigms, coordinate heterogeneous networks, organize var- collect data during day hours, then jointly update the global
ious communication resources, and enable the truly smart 6G model during night hours, to improve efficiency and accuracy
communications in the 2030s [4]. In particular, AI aided by big for next-day usage.
data and high-rate real-time transmission capability is expected Despite the promising application prospect, accurate wire-
to be the most efficient approach to reduce network overhead less traffic prediction under the FL settings is still a major
and elevate the quality of service (QoS) of both access and research challenge, especially the network-wide prediction.
core networks [5]. This is because user mobility can cause sophisticated spatio-
To facilitate the fusion of AI and communication networks, temporal coupling among wireless traffic, which can hardly be
wireless traffic prediction is indispensable. Wireless traffic captured and modeled. Furthermore, different BSs may have
prediction [6, 7] estimates traffic data volume in the future distinct traffic patterns which makes the traffic data highly
heterogeneous and learning and prediction on this kind of FL. Section III gives the system model and problem state-
heterogeneous data is very challenging. ment. In section IV, we introduce our proposed dual atten-
Therefore, to cope with the wireless traffic prediction is- tion mechanism in detail, including the data augmentation-
sues for future communication networks, we propose a novel assisted client clustering, mathematical expression of dual
wireless traffic prediction framework named the dual attention- attention, and the corresponding optimization techniques. Sec-
based federated learning (FedDA), by which a high-quality tion V presents and discusses all the experiments. Finally,
prediction model is trained collaboratively by multiple BSs. Section VI concludes this paper. Our code is available at
The FedDA framework relies on a set of state-of-the-art https://github.com/chuanting/FedDA.
training paradigms, including a data augmentation assisted
clustering strategy, an intermediate and auxiliary training II. R ELATED W ORK
model, a dual attention-based model aggregation, and a hier- As the present work is closely related to wireless traffic
archical aggregation structure. Specifically, the processing of prediction and FL, we review the most related achievements
FedDA can be split into three stages to ensure high-accuracy, and milestones of these two research topics in this section.
transferable, and secure training process for wireless traffic
prediction. A. Wireless Traffic Prediction
We first introduce an augmentation assisted clustering strat- Recently, wireless traffic prediction has received a lot of
egy to group all BSs, i.e., clients in the context of FL, attention as many tasks in wireless communications require
into a number of clusters depending on their augmented accurate traffic modeling and prediction capabilities. Wireless
traffic patterns and geographic locations. Then, leveraging the traffic prediction is essentially a time series prediction prob-
augmented data collected from distributed BSs, a quasi-global lem. The methods to solve it can be roughly classified into
prediction model can be constructed at the central server. This three categories, i.e., simple methods, parametric methods, and
quasi-global model is used to mitigate the generalization diffi- non-parametric methods.
culty of the global model caused by the statistical heterogene- Historical average and naı̈ve methods are representatives of
ity among traffic data collected from different clusters. Finally, the first category [17]. The former predicts all future values
instead of simply averaging the model weights collected from as the average of the historical data, while the latter takes the
local clients to yield the global model, a dual attention-based last observation as the future. This kind of prediction method
model aggregation mechanism and a hierarchical aggregation involves no complex computations, and thus makes it quite
structure are adopted at the central server. By introducing the simple and easy to implement. However, as simple methods
dual attention and hierarchical settings, an adequate equilib- fail to capture the hidden patterns of wireless traffic, their
rium between generality and specialty can be achieved. prediction performance is relatively poor.
Following the proposed FedDA framework and the descrip- For the second category, i.e., parametric methods, the wire-
tions given above, the contributions of this paper can be less traffic is modeled and predicted based on tools from
summarized as follows: statistics and probability theory. The most classical method
• We design a data augmentation-assisted iterative clus- is AutoRegressive Integrated Moving Average (ARIMA) [18].
tering strategy, which takes the augmented data and To characterize the self-similarity and bursty of wireless traffic,
geographic locations of clients as clustering reference to ARIMA and its variants were explored in [19, 20]. In a recent
simultaneously capture various traffic patterns of clients study, [21] first decomposed the wireless traffic into regularity
and protect data privacy. and randomness components. Then the authors demonstrated
• We introduce a quasi-global model, which is an inter- that the regularity component can be predicted through the
mediate and auxiliary tool to mitigate the generalization ARIMA model, but the prediction of random components is
difficulty of the global model caused by the statistical het- impossible. Besides the ARIMA model, the α-stable model
erogeneity among traffic patterns collected from different [22], entropy theory [23], and covariance functions [24] were
clients. also explored to perform wireless traffic prediction.
• We propose the FedDA framework consisting of two As machine learning and AI techniques [25] continue their
advanced settings for aggregations, which are the dual fast evolving, the non-parametric methods have made them-
attention-based model aggregation mechanism and the selves strong competitors to parametric methods for wireless
hierarchical aggregation structure. In this way, the central traffic prediction. Particularly, recent years have witnessed an
server can capture not only the cluster-specific data obvious trend in solving wireless traffic prediction problems
patterns but also ensure the transferability of the global based on deep neural networks. In [26], the authors proposed
model. a deep belief networks-based prediction method for wireless
• We verify the effectiveness and efficiency of the FedDA mesh networks. In [6], A hybrid deep learning framework
framework by testing on two real-world datasets and was designed on the basis of autoencoder and Long-Short
compare the experimental results with those generated Term Memory networks (LSTM) to simultaneously capture
by existing algorithms. the spatial and temporal dependence among different cells.
The rest of this paper is organized as follows. Section Aiming to perform prediction on multiple cells, the researcher
II reviews related works on wireless traffic prediction and also introduced a multi-task learning framework by using
LSTM [27]. Besides, the city-scale wireless traffic predictions
are also investigated in [7, 28], in which the authors introduced
novel prediction frameworks by modeling spatio-temporal
dependence over cross-domain datasets.
All aforementioned works mainly focus on wireless traffic
prediction in the centralized way. Our proposed framework in
this paper differs from the above works, and we are trying to
solve the wireless traffic prediction problem by a distributed
architecture and federated learning.
B. Federated Learning
FL provides a distributed training architecture that can be
jointly applied with many machine learning algorithms, in Central server Base station Edge server Wireless traffic

particular deep neural networks. In FL, a global model can


be obtained by aggregating local clients’ models. To obtain Fig. 1: System diagram of FedDA with five BSs in two
the global model, [10] introduced an aggregation method clusters.
called federated averaging (FedAvg). Research shows that
when the client data is independent and identically distributed
(IID), FedAvg achieves similar performance compared with either in a linear form like linear regression or in a nonlinear
centralized learning. However, when the client data is non- form like deep neural networks.
IID, the performance of FedAvg degrades greatly. To solve For machine learning based wireless traffic prediction tech-
this problem, [29] proposed a data-sharing strategy by creating niques, it is common to use only part of the historical
a small subset of data which is globally shared among all the traffic data as input features to reduce complexity. Thus,
client devices. This strategy can solve the statistical hetero- based upon dk , a set of input-output pairs {xki , yik }ni=1 can
geneity challenge confronted with FL. In [30], the authors be obtained by using sliding window scheme. xki denotes
introduced FedProx, which can be viewed as a generalization the historical traffic data related to yik , and we set it to
and re-parameterization of FedAvg, to tackle heterogeneity {dkz−1 , · · · , dkz−p , dkz−φ1 , · · · , dkz−φq }. p and q are the sliding
in federated networks. Besides, [31] introduced an attentive window sizes for capturing the closeness dependence and pe-
federated aggregation scheme, called FedAtt, by considering riod dependence of wireless traffic data and φ is the periodicity
unequal contributions from different clients to the global [28, 32]. As we focus only on the one-step ahead prediction
model. This scheme improves the generalization ability of problem, so yik = dkz . Thus, the problem formulated in (1) can
the global model and has been successfully used to solve the be reformulated as
language modeling problem. For a more detailed introduction ŷik = f (xki ; w). (2)
on the development of FL, please refer to a recent survey [12].
Our work is inspired by the above research, but we mainly Our objective is to minimize the prediction error over all K
focus on wireless traffic prediction problem, which is different BSs, thus the parameters w can be obtained by solving
from the above works. Also, because the wireless traffic data is ( )
K n
highly heterogeneous, we propose a novel FedDA framework 1 XX k k
arg min L(f (xi ; w), yi ) , (3)
to solve this statistical challenge. w Kn i=1 k=1
III. P ROBLEM F ORMULATION AND P RELIMINARIES
where L is the loss function and the structure typically takes
In this section we provide the problem formulation of |f (xki ; w) − yik |2 or |f (xki ; w) − yik |.
wireless traffic prediction and give the implementation details
of FL corresponding to the formulated problem. B. Preliminaries of Federated Learning
A. Problem Formulation We try to solve (3) in a distributed manner, particularly
Given K BSs, each BS have its own local wireless traffic under the cross-silo FL settings [14] and assume data are
data, denoted as dk = {dk1 , dk2 , · · · , dkZ } with a total of Z located at geo-distributed clients. After initializing the global
time intervals. The wireless traffic prediction problem can be model characterized by the parameters w, FL works as follows
described as the prediction of future traffic volume based on at the t-th training round:
the current and the previous traffic volumes. Suppose dkz to 1) The central server send the global model wt to all BSs;
be the target traffic volume required to be predicted, then the 2) The BS treats the global model as prior knowledge and
wireless traffic prediction problem can be described as updates its local model based upon wt with its local data.
The local model update rule is given as follows: wkt+1 ←
dˆkz = f (dkz−1 , dkz−2 , · · · , dk1 ; w), (1)
wkt −η∇wt L(f (xk ; wt ), y k ), where η is learning rate and
where f (·) denote the chosen prediction model and w the ∇wt is the gradient of loss function with respect to wt ;
corresponding parameters. The prediction model f (·) can be 3) The BS sends wkt+1 to the central server;
4) The central server performs model aggregation (also
known as federated optimization) based on local mod-
els. The most classical model aggregation scheme is
the federated
Paveraging [10], which can be written as
K t+1
wt+1 ← K 1
w
k=1 k .
By running the above steps iteratively until the termination
conditions are satisfied, a final global model w can be ob-
tained. (a) Wireless traffic augmentation strategy.

IV. P ROPOSED F RAMEWORK 2


Augmented Traffic
In this section, we give a detailed introduction of our 1 Original Traffic
proposed wireless traffic prediction framework: FedDA, and
a demo FedDA system diagram with five BSs in two clusters 0
is shown in Fig. 1. Specifically, FedDA consists of three steps:
-1
1) Each BS performs the traffic augmentation procedure
and sends the augmented data to the central server; -2
0 50 100 150
2) The central server yields a quasi-global model and
clusters the BSs into different groups based on the (b) Augmented versus original traffic data.
augmented data and the location information of BSs;
3) The BSs cooperatively train a global model under the Fig. 2: Illustration on wireless traffic data augmentation and
orchestration of the central server by using the dual the comparison between the augmented and the original traffic
attention-based federated optimization. data of a BS.
In the following, we explain how to augment wireless traffic
data and analyze the similarities between augmented data and well and achieves a Pearson correlation coefficient of 0.9526,
original traffic data. After that, we introduce the iterative clus- which indicates high similarities between augmented data and
tering strategy by taking into consideration both locations and original data.
traffic patterns of BSs, followed by a detailed elaboration of
Each BS sends a very small part, say ϕ%, of its augmented
our proposed dual attention-based model aggregation scheme.
data to the central server, in which a global dataset that
A. Wireless Traffic Data Augmentation consists of a uniform distribution over y is centralized. Note
As urban areas have different functions to support the daily that compared with the size of the raw data, the augmented
operation of a city, the traffic patterns of spatially distributed data size is much smaller. Based upon this augmentation
BSs differ a lot. Besides, users have different mobility and dataset, a quasi-global model can be trained and treated as
communication behaviors, which further enlarge the pattern prior knowledge for all BSs. We use the term ‘quasi-global’
diversity of wireless traffic. Therefore, wireless traffic data because the model is trained using augmented data of all BSs,
of different BSs has a high level of heterogeneity and is instead of the original data. Even so, this model can still
essentially non-IID. Performing FL over non-IID data is quite be used as prior knowledge of all BSs because of the high
challenging as weight divergence exists when performing similarities between the augmented data and the original data.
model aggregation on the server side [29]. Herein, we propose We characterize the quasi-global model by wQ in Fig. 1, and it
a data sharing strategy by creating a small augmentation has exactly the same network architecture as the local models
dataset using original wireless traffic dataset to conquer the and the global model.
statistical heterogeneity challenge.
B. Iterative Clustering for BSs
Our augmentation strategy works as follows. We first split
the dataset into weekly slices based on the time index. For As mentioned earlier, spatially distributed BSs have dif-
the traffic of each week we compute the statistical average ferent traffic patterns. To capture the pattern heterogeneity
value for each time point and treat the obtained result as the among BSs and train an accurate prediction model suited
augmented data. Finally, the augmented traffic is standardized for most BSs, we perform clustering analysis for BSs and
to have zero mean and unit variance. An illustration of the propose an iterative clustering strategy to achieve this purpose
augmentation procedure is displayed in Fig. 2a, and the by taking into consideration both the geo-locations and the
comparison between the augmented wireless traffic and the traffic patterns of BSs. The detail of our clustering strategy is
original traffic can be found in Fig. 2b. summarized in Algorithm 1.
It can be seen from Fig. 2 that the proposed augmentation For an arbitrary BS k, the central server knows its location
strategy is quite easy to implement and produce the augmented information gk (i.e., the longitude and latitude) and stores its
data, compared with traditional time series data augmentation augmented traffic data d˜k . By using a random initialization of
strategies either in time domain or frequency domain [33]. C cluster centers {vc }C
c=1 , we perform the K-Means algorithm
Though the strategy is simple and straightforward, it works on the augmented data {d˜k }K k=1 and obtain the cluster labels
Algorithm 1: Iterative clustering strategy
Input: BS location {gk }K k=1 , augmented traffic
{d˜k }K
k=1 , cluster size C, cluster center
{vc }Cc=1 , and iteration threshold J.
Output: C clusters
1 Random initialize {vc }C c=1
2 while j < J do
3 Group {d˜k }K k=1 into C clusters (l1 , l2 , · · · , lC ) by
using K-Means with {vc }C c=1 ;
4 Group {gk }K k=1 into C clusters (l1′ , l2′ , · · · , lC

) by
using K-Means Fig. 3: Dual attention-based model aggregation.
5 if {lc }C ′ C
c=1 is the same as {lc }c=1 then
6 break t+1
In Fig. 3, wI,m denotes the m-th input model, and there are
7 Update cluster center {vc }C ′ ′
c=1 based on (l1 , l2 , · · · , t+1
′ M input models in total. wO denotes the output model. For
lC ) t+1
intra-cluster model aggregation in cluster c, wI,m belongs to
8 j =j+1 t+1 t+1
a local model of that cluster, and wO stands for wR,c . For
t+1 t+1 C
inter-cluster model aggregation, wI,m belongs to {wR,c }c=1 ,
t+1
and the output is the global model wG .
of BSs (Line 3 of Algorithm 1). Then we use the location The purpose of federated optimization on the central server
information {gk }K k=1 as input and similarly perform the K- side is to find an optimal global model that can have a
Means algorithm on it. This can yield C different clusters strong generalization ability over all BSs. To achieve this,
(Line 4 of Algorithm 1). If the clustering results on these two the global model should find a balance between capturing
kinds of data are the same, then Algorithm 1 stops and returns the unique and the common-shared traffic patterns of BSs.
the cluster label of each BS. If the yielded results are not the Thus, in our proposed scheme, we regard the optimization
same, then based on the obtained cluster label on geo-location problem as finding a global model that is close to both of
data, we compute the cluster center and use this to initialize the local models and the quasi-global model in the parameter
the K-Means clustering on the traffic data. This indicates that space, considering their different contributions during the
the geo-location information is considered by the traffic pattern model aggregation procedure. Consequently, the optimization
clustering process. The above steps are repeated on an iterative objective is to minimize the sum-weighted distance among
basis until the termination conditions are satisfied. different models by using self-adaptive scores as weights. The
As shown in Fig. 1, the BSs are clustered into different federated optimization problem is formally defined as
clusters based on their geo-location and traffic pattern. After ( M )
obtaining the cluster label of each BS, FedDA proceeds to the X 1 1
t t+1 2 t 2
arg min αm L(wO , wI,m ) + ρβL(wO , wQ ) ,
federated optimization, which will be introduced in the next t+1
wO 2 2
m=1
subsection. (4)
where αm and β represent attention weight vectors denoting
C. Dual Attention-Based Model Aggregation the contributions of each layer of the m-th input model and
One of the most fundamental part of FL is the model the quasi-global model; ρ is a task-dependent regularization
aggregation scheme on the central server side, which involves parameter and can be manually set based on experiment
constructing the final global model based on the received requirements.
local ones. In this paper, we propose a novel federated To obtain the weights αm , we use the attention mechanism
optimization strategy, i.e., FedDA, for obtaining the global and apply it to the layer-wise parameters. For the m-th input
l
model. Specifically, we introduce the attention scheme into model, the parameter of the l-th layer is denoted as wI,m .
model aggregation in FedDA and quantifies the contributions Similarly, the parameter of the l-th layer of the output model
l l l
of both local models and the quasi-global model in a layer- is denoted as wO . The time stamps are omitted in wI,m and wO
wise manner. Fig. 3 shows an illustration of our proposed for simplicity. Based on the layer-wise parameters, the distance
l l
layer-wise dual attention-based model aggregation procedure. between wI,m and wO can be calculated by the Frobenius
Note that we adopt a hierarchical learning scheme in FedDA. norm of their difference, which is expressed as
That is, there are two levels of model aggregation. The first
one performs intra-cluster model aggregation, whose function slm = L(wI,m
l l
, wO l
) = kwI,m l 2
− wO k2 . (5)
is to obtain cluster models that capture the unique traffic
patterns of each cluster. The second one performs inter-cluster Subsequently, the softmax function is applied to slm and
model aggregation, after which the final global model that maps the non-normalized distance values to a probability
incorporates knowledge of all clusters is generated. distribution over the M input models. In this way, the con-
Algorithm 2: Implementation of FedDA Milan and the province of Trentito. The area of Milan is
Input: Wireless traffic data {x k
, y k }K divided into a grid of 10000 cells, and for Trentino, the grid
k=1 ;
Quasi-global
model, wQ ; Fraction of BSs, δ; Learning rate is of 6575 cells. In each cell, the user’s telecommunication
of local BS, η; Step size of server side, γ. activities are served and logged by the BS and thus we
Output: Global model, wG use BS or cell interchangeably to denote a cell. There are
1 for each round t = 1, 2, · · · , do threes types of wireless traffic, which are corresponding to
2 m ← max(K · δ, 1) SMS, voice Call, and Internet services. The traffic is logged
3 St ← a random set of m BSs every ten minutes over two months, from 11/01/2013 to
4 for each client k ∈ St do 01/01/2014. For experiments in the following subsections,
5 wkt+1 ← wkt − η∇wt L(f (xk ; wt ), y k ) the traffic is resampled into hourly to circumvent the data
sparsity problem. These two datasets are publicly available and
6 for cluster c = 1, 2, · · · do can be accessed on Harvard Dataverse [35, 36]. To evaluate
7 Sc ← a set of BS with cluster label c prediction performance, two widely used regression metrics
t+1
8 Obtain wR,c by using Equations (5) to (8) are adopted in this paper, i.e., mean squared error (MSE) and
t+1 mean absolute error (MAE).
9 Obtain wG by using Equations (5) to (8)
B. Baseline Methods
We compare our proposed wireless traffic prediction frame-
tributions of these models can be determined. The standard work, FedDA, with five baseline methods.
softmax function σ(·) is described as • Lasso: A linear model for regression.
l • Support Vector Regression (SVR) [37]: SVR is one of
esm
αlm = σ(slm ) = PM l
. (6) the most classical machine learning algorithms and has
m=1 esm been successfully used for traffic prediction.
Similarly, the values of β can be obtained. After we get • LSTM [27]: LSTM has a strong ability to model time

the αm and β, the parameters of the output model can be series dataset and normally has better prediction perfor-
updated by the gradient descent algorithm. We first calculate mance than linear models and shallow-learning models.
the derivative of (4) with respect to wO t
and obtain the • FedAvg [10]: FedAvg is first proposed in the pioneering

corresponding gradient work of federated learning. It adopts an average of local


m
weights for model aggregation.
• FedAtt [31]: This algorithm is similar to FedAvg. How-
X
t t+1 t
∇= αm (wO − wI,m ) + ρβ(wO − wQ ). (7)
m=1
ever, when performing model aggregation in the central
server, it differentiates and quantifies the contributions of
With the derived gradient, the output model parameters can be different client models to the global model.
updated by
The first three baselines are trained in a fully distributed way.
M
t+1 t
X
t t+1 t
That is, the model is trained per client. The latter two baselines
wO = wO − γ( αm (wO − wI,m ) + ρβ(wO − wQ )), (8) and our FedDA are trained in a federated way.
m=1

where γ is a predetermined step size that controls how much C. Experimental Settings and Overall Results
wO should move in the direction of the opposite gradient in Without loss of generality, we randomly select 100 cells
every iteration. The whole procedure of our proposed FedDA from each dataset and carry out experiments on the three
is summarized in Algorithm 2. kinds of wireless traffic of these cells. The traffic from the
first seven weeks is used to train prediction models and the
V. E XPERIMENTS traffic from the last week is used for test. When constructing
In this section, extensive experiments are conducted to training samples using sliding window scheme, the length of
validate the effectiveness and efficiency of FedDA. We begin closeness dependence p and periodicity dependence q are both
with a brief introduction of the dataset, evaluation metrics, set to 3. Considering that the edge client has limited computing
and baseline methods. Then, experimental settings, such as power and thermal constraints, a relatively lightweight LSTM
learning rate and batch size, are given. After that, we present network is adopted. Specifically, the network has two LSTM
experimental results, including the overall prediction perfor- layers and each layer has 64 hidden neurons, followed by
mance of various methods and the influence of learning (hyper- a linear layer that maps the features to predictions. All
)parameters on prediction performance. baselines except shallow learning algorithms share the same
network architecture for the sake of fairness. Unless otherwise
A. Dataset and Evaluation Metrics specified, we take 100 communication rounds between local
The datasets used in this paper come from the Big Data clients and the central server and report results on the final
Challenge[34] launched by Telecom Italia and mainly are Call model. The regularization term ρ is determined through a grid
Detail Records (CDR) of two Italian areas, i.e., the city of search with values ranging from −0.3 to 0.3 and step size
TABLE I: Prediction performance comparisons among different methods in terms of MSE and MAE on two datasets (‘↑’
denotes the performance gain of FedDA over FedAtt).
Milano Trento
Methods MSE MAE MSE MAE
SMS Call Internet SMS Call Internet SMS Call Internet SMS Call Internet
Lasso 0.7580 0.3003 0.4380 0.6231 0.4684 0.5475 4.7363 1.6277 5.9121 1.3182 0.8258 1.5391
SVR 0.4144 0.0919 0.1036 0.3528 0.1852 0.2220 5.2285 1.7919 5.9080 1.0390 0.5656 1.0470
LSTM 0.5608 0.1379 0.1697 0.4287 0.2458 0.2936 3.6947 1.1378 4.6976 0.9426 0.5013 1.1193
FedAvg 0.3744 0.0776 0.1096 0.3386 0.1838 0.2319 2.2287 1.6048 4.7988 0.7416 0.5319 1.0668
FedAtt 0.3667 0.0774 0.1096 0.3375 0.1837 0.2321 2.1558 1.5967 4.7645 0.7444 0.5306 1.0629
FedDA (ϕ=1) 0.3559 0.0752 0.1118 0.3353 0.1820 0.2367 2.1468 1.4925 4.4335 0.7478 0.5140 1.0212
FedDA (ϕ=10) 0.3481 0.0753 0.1062 0.3321 0.1810 0.2275 2.0719 1.1699 3.9266 0.7320 0.4543 0.9504
FedDA (ϕ=100) 0.3322 0.0659 0.1033 0.3214 0.1741 0.2211 1.9703 1.0592 2.4473 0.6920 0.4281 0.7471
↑ (ϕ=100) +9.4% +14.9% +5.8% +4.8% +5.2% +4.7% +8.6% +33.7% +48.6% +7.0% +19.3% +29.7%

1 1
4 6 FedDA

SMS volume

Percentage
FedDA
SMS volume

Percentage

Ground Truth 4 Ground Truth


2 0.5 FedAtt 0.5
FedAtt FedDA
FedDA 2
FedAtt FedAtt
0 0
0 0
0 1 2 3 4 0 1 2 3 4
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Absolute error
Absolute error
Time index Time index
2 1 1
4

Percentage
Call volume
Percentage
Call volume

1
0.5 2 0.5
0
0
0 0
-1 0 0.5 1 1.5 0 1 2 3
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 Absolute error
Absolute error
Time index Time index
2 1 1
Internet volume
Internet volume

Percentage
Percentage

0 0.5 0.5
0

0 0
-2 0 0.5 1 1.5 -2 0 0.5 1
0 20 40 60 80 100 120 140 160 Absolute error 0 20 40 60 80 100 120 140 160 Absolute error
Time index Time index

(a) Results on the Milano dataset for randomly selected cells. (b) Results on the Trento dataset for randomly selected cells.

Fig. 4: Comparisons between predictions and the real values and the corresponding error analysis.

0.1. The cluster size C is set to 16. Similar to the standard terms of the metric of MAE, though the improvements are
settings in FL [10], the values of local epochs and local batch not as remarkable as MSE, it is still obvious that an average
size are set to 1 and 20, respectively. In each communication of 4.9% (18.7%) performance gain can be achieved for the
round, 10% percent of the cells are involved in model training. Milano (Trento) dataset. We can also notice that the prediction
We use stochastic gradient descent (SGD) as the optimizer to performance of FedDA improves consistently with the increase
update our models with learning rate 0.01. of shared augmentation data size. This is because the quasi-
global model can capture the traffic patterns better when more
The experimental results of different prediction methods
data samples are available. The success of FedDA can be
are presented in Table I. Note that in Table I, our proposed
attributed to the following reasons:
method have three variations on the basis of how many
augmented data samples shared, i.e., ϕ=1, ϕ=10, and ϕ=100. • Compared with fully distributed algorithms (SVR and
In reality, the amount of transfered data samples can be flexibly LSTM), which only consider temporal dependence of
adjusted according to network situation. It can be seen from wireless traffic, FedDA can capture both the spatial
this table that our proposed method, FedDA, outperforms dependence and the temporal dependence by means of
all the baseline methods for all kinds of wireless traffic in model fusion, and is thus more robust;
both datasets, even with only 1% of the augmentation data • Compared with conventional FL algorithms (FedAvg
is shared. Here and in the following experiments, we report and FedAtt), the introduced clustering strategy makes
the results of FedDA with ϕ=100 unless otherwise specified. the learning process of FedDA case-specific, and the
Specifically, for the SMS, CALL, and Internet service traffic dual attention scheme greatly reduces data heterogeneity.
of the Milano dataset, compared with the best-performing Therefore, FedDA has a high generalization ability;
method in baselines, namely FedAtt, FedDA can offer MSE • FedDA can balance between capturing the unique charac-
gains of 9.4%, 14.9%, and 5.8%, respectively. Likewise, for teristics of a cluster and the shared macro traffic patterns
the Trento dataset, FedDA yields performance gains of 8.6% among different clusters, and thus has have more accurate
(SMS), 33.7% (CALL), and 48.6% (Internet), respectively. In predictions.
Milano-SMS Milano-Call Milano-Internet Milano-SMS Milano-Call Milano-Internet
0.8 1.0 1.0 0.38
0.40 FedDA
R squared score

R squared score

R squared score
0.6 0.36 0.10 FedAvg
0.30 FedAtt

MSE

MSE

MSE
0.4 0.5 0.5 0.34
0.08 0.20
0.2 FedDA 0.32
FedAtt
0.0 0.0 0.0 0.10
0.30 0.06
0 50 100 0 50 100 0 50 100 -0.2 0 0.2 -0.2 0 0.2 -0.2 0 0.2
Communication rounds Communication rounds Communication rounds
Trento-SMS Trento-Call Trento-Internet Milano-SMS Milano-Call Milano-Internet
0.8 0.8 0.36 0.24 0.50
0.6 FedDA
R squared score

R squared score

R squared score
0.6 0.6 0.35 0.22 FedAvg
0.40
0.4 FedAtt

MAE

MAE

MAE
0.4 0.4 0.34 0.20
0.2 0.30
0.2 FedDA 0.2 0.33 0.18
FedAtt 0.0
0.0 0.0 0.32 0.16 0.20
0 50 100 0 50 100 0 50 100 -0.2 0 0.2 -0.2 0 0.2 -0.2 0 0.2
Communication rounds Communication rounds Communication rounds

Fig. 5: Prediction accuracy versus communication rounds. Fig. 6: Influence of ρ on prediction performance for the Milano
dataset.

Besides, the shown results also indicate that in comparison Trento-SMS Trento-Call Trento-Internet
5.0 10.0
with fully distributed algorithms, FL-based algorithms can FedDA 1.6
8.0
achieve better predictions, and FedAtt achieves the second 4.0 FedAvg
FedAtt
1.4

MSE

MSE

MSE
best prediction performance, followed by the classic FedAvg 3.0
1.2
6.0

algorithm. This is rather intuitive, since there is no knowledge 1.0 4.0


2.0
sharing when training prediction models with fully distributed 0.8 2.0
-0.2 0 0.2 -0.2 0 0.2 -0.2 0 0.2
algorithms. The lack of knowledge sharing results in a loss of
Trento-SMS Trento-Call Trento-Internet
prediction accuracy. 1.2 0.55
1.6
FedDA
To further evaluate the predictive ability of different al- FedAvg 0.50 1.4
1.0
gorithms, the comparisons between predicted values and the FedAtt
MAE

MAE

MAE
1.2
0.45
real values of different algorithms are given in Fig. 4. The 0.8 1.0

results on cumulative distribution functions (CDFs) of absolute 0.40 0.8


0.6
prediction error are also included in Fig. 4 for quantitatively -0.2 0 0.2 -0.2 0 0.2 -0.2 0 0.2

measuring the goodness of prediction models. Fig. 4a (Fig.


4b) represents results on the Milano (Trento) dataset. More Fig. 7: Influence of ρ on prediction performance for the Trento
specifically, the left three subfigures of Fig. 4a (Fig. 4b) denote dataset.
the comparisons between predictions and the ground truth
for the SMS, Call, and Internet service traffic of randomly
selected cells, and the right three subfigures are the corre- use R-squared score to denote the accuracy as it reflects how
sponding CDFs of errors. Here, we choose FedAtt as the well ground truth values are predicted by the model [38].
benchmark for performance comparison, since it achieves the The obtained results are summarized in Fig. 5, in which
best performance among all baseline methods in Table I. By the upper (lower) three subfigures represent results on the
observing Fig. 4, we can tell that FedDA obtains consistent Milano (Trento) dataset. From Fig. 5 we can observe that
better prediction performance than FedAtt, on all three kinds FedDA achieves higher accuracy on both two datasets and
of wireless traffic. Meanwhile, it has smaller prediction errors, its advantages are more clear on the Trento dataset. More
especially when the traffic volume comes to high and unstable. importantly, FedDA needs much fewer communications to
For prediction errors, taking the SMS traffic of the Trento achieve a certain prediction accuracy. Take the Milano dataset
dataset for example, there are approximately 83% errors that as an example, after 30 communication rounds, FedDA can
are smaller than 1 for FedDA, while the case for FedAtt is achieve accuracies of 0.74, 0.89, and 0.86 for the SMS,
about 76%. Moreover, the average prediction errors for FedAtt Call, and Internet traffic, respectively. While for FedAtt, the
and FedDA are 0.65 and 0.54, respectively. Based on the above achieved accuracies for the SMS, Call, and Internet traffic are
evaluation, we can summarize that FedDA can achieve more 0.69, 0.84, and 0.83, respectively. Thus here we argue that our
accurate prediction results than those of baseline methods. proposed method is communication-efficient, which is key for
performing learning tasks at the edge.
D. Communication Rounds Versus Prediction Accuracy
In FL or any other distributed learning frameworks, it is E. Effect of ρ on Prediction Performance
assumed that communication resources are more precious As ρ is a task-dependent regularizer and it impacts the
than computation resources and fewer communications are prediction performance a lot, we present the prediction per-
preferred. Thus, in this subsection, we report the prediction formance of FedDA along with ρ and summarize the results
accuracy along with each communication round (epoch) and in Fig. 6 and Fig. 7 for the Milano and Trento datasets,
1.01 1.01 1.00 1.00
1.00 1.00
0.90 0.95
0.99 C=1 C=1 C=1
C=16 0.99 C=16 C=16

Normalized MAE

Normalized MAE
Normalized MSE

Normalized MSE
C=32 C=32 C=32
0.98
0.98 0.80 0.90
C=1
0.97 C=16
C=32
0.97
0.96 0.70 0.85

0.96
0.95
0.60 0.80
0.94 0.95

0.93 0.94 0.50 0.75


SMS Call Internet SMS Call Internet SMS Call Internet SMS Call Internet

(a) Results on the Milano dataset. (b) Results on the Trento dataset.

Fig. 8: Cluster size versus prediction performance.

respectively. In both figures, the value of ρ is ranging from performance degrades slightly when C=16, it improves when
−0.3 to 0.3 with a step size of 0.1 and the obtained MSE C=32. For the Trento dataset, introducing the clustering strat-
(MAE) results are given in the upper (lower) three subfigures. egy can always yield better prediction performance than not,
Besides FedDA, the other two FL-based methods, i.e., FedAvg especially for the Internet traffic, on which the improvement
and FedAtt, are also included for comparison purpose but their is up to 50%. Overall, the results in Fig. 8 demonstrate the
MSE and MAE results keep constant as they are not affected superiority of introducing the clustering into FedDA. This is
by ρ. We can tell from Fig. 6 and Fig. 7 that for FedDA, because the cluster size C controls the specialty of FedDA
with the increase of ρ, the values of MSE and MAE first and thereby affects the global model. If no clustering strategy
decrease rapidly, and then slowly increase. For the other two is involved, all data is mixed together to generate the global
baselines, FedAtt achieves slightly better results than FedAvg. model. In this case, some unique traffic patterns hidden in
Though ρ has a great influence on the prediction performance, the data cannot be captured by FedDA and hence leads to
FedDA generally can yield lower MSE and MAE values than performance degradation.
FedAvg and FedAtt. Take the SMS traffic of the Milano
dataset as an example, the obtained MSE results are always VI. C ONCLUSION
better than FedAvg and FedAtt regardless of the choice of ρ;
while for the metric of MAE, similar conclusion hold except In this work, we investigated the wireless traffic prediction
that ρ=−0.3, by which FedDA achieves worse results than problem and proposed a novel framework called FedDA.
FedAvg and FedAtt. Nonetheless, the optimal values of ρ can To deal with the heterogeneity of wireless traffic data, we
be determined by using a grid search strategy during model proposed a data-sharing strategy in FedDA by transferring a
training and the cost is low. The results in these two figures small augmented traffic dataset to the central server, by which
demonstrate that the dual attention scheme in FedDA can a quasi-global model is obtained and shared among all BSs.
indeed improve prediction performance by introducing prior Besides, we also introduced an iterative clustering algorithm
knowledge and the influence varies on different datasets. to cluster BSs into different groups, by considering both the
wireless traffic pattern and the geo-location information. To
F. Effect of C on Prediction Performance enhance the generalization ability of the global model, we
The cluster size C determines how many cells are involved proposed a dual attention-based model aggregation scheme by
in model aggregation and it also affects the final prediction paying attention to the unequal contributions of different local
performance. Thus in this subsection, we explore how the models and the quasi-global model. The aggregation scheme
cluster size affects the prediction performance of FedDA and is applied over a hierarchical architecture so as to capture both
the obtained MSE and MAE results are plotted in Fig. 8. In intra-cluster and inter-cluster patterns of wireless data traffic.
particlular, Fig. 8a (Fig. 8b) shows the results on the Milano Finally, we verified the effectiveness and efficiency of FedDA
(Trento) dataset. We consider three scenarios, i.e., C=1, C=16, on two real-world datasets.
and C=32. Note that C=1 means no clustering adopted in
prediction. In addition, to make the comparison clearer, the R EFERENCES
MSE and MAE results of C=16 and C=32 are normalized [1] K. David and H. Berndt. 6G vision and requirements: Is there
based on the results of C=1. We can observe that the choices any need for beyond 5G? IEEE Vehicular Technology Magazine,
of C yield different influences on the prediction performance 13(3):72–80, 2018.
of FedDA. In most cases, introducing the clustering strategy [2] B. Zong, C. Fan, X. Wang, X. Duan, B. Wang, and J. Wang.
can indeed lead to lower prediction errors. Specifically, we 6G technologies: Key drivers, core requirements, system archi-
tectures, and enabling technologies. IEEE Vehicular Technology
can observe from Fig. 8a that FedDA achieves considerable Magazine, 14(3):18–27, 2019.
performance improvements when cluster size is 16 or 32, for [3] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini. What should
the SMS and Call traffic. For the Internet traffic, though the 6G be? Nature Electronics, 3(1):20–29, 2020.
[4] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y. A. Zhang. lar networks. IEEE Transactions on Wireless Communications,
The roadmap to 6G: AI empowered wireless networks. IEEE 16(6):3899–3912, 2017.
Communications Magazine, 57(8):84–90, 2019. [23] R. Li, Z. Zhao, X. Zhou, J. Palicot, and H. Zhang. The
[5] M. Yao, M. Sohul, V. Marojevic, and J. H. Reed. Artificial prediction analysis of cellular radio access network traffic: From
intelligence defined 5G radio access networks. IEEE Commu- entropy theory to networking practice. IEEE Communications
nications Magazine, 57(3):14–20, 2019. Magazine, 52(6):234–240, 2014.
[6] J. Wang, J. Tang, Z. Xu, Y. Wang, G. Xue, X. Zhang, and [24] X. Chen, Y. Jin, S. Qiang, W. Hu, and K. Jiang. Analyzing
D. Yang. Spatiotemporal modeling and prediction in cellular and modeling spatio-temporal dependence of cellular traffic
networks: A big data enabled deep learning approach. In 2017 at city scale. In 2015 IEEE International Conference on
IEEE Conference on Computer Communications (INFOCOM), Communications (ICC), pages 3585–3591, 2015.
pages 1–9, 2017. [25] H. Liu, B. Xu, D. Lu, and G. Zhang. A path planning approach
[7] C. Zhang, H. Zhang, J. Qiao, D. Yuan, and M. Zhang. Deep for crowd evacuation in buildings based on improved artificial
transfer learning for intelligent cellular traffic prediction based bee colony algorithm. Applied Soft Computing, 68:360–376,
on cross-domain big data. IEEE Journal on Selected Areas in 2018.
Communications, 37(6):1389–1401, 2019. [26] L. Nie, D. Jiang, S. Yu, and H. Song. Network traffic prediction
[8] Y. Xu, F. Yin, W. Xu, J. Lin, and S. Cui. Wireless traffic based on deep belief network in wireless mesh backbone net-
prediction with scalable gaussian process: Framework, algo- works. In 2017 IEEE Wireless Communications and Networking
rithms, and verification. IEEE Journal on Selected Areas in Conference (WCNC), pages 1–5, 2017.
Communications, 37(6):1291–1306, 2019. [27] C. Qiu, Y. Zhang, Z. Feng, P. Zhang, and S. Cui. Spatio-
[9] N. Kato, B. Mao, F. Tang, Y. Kawamoto, and J. Liu. Ten temporal wireless traffic prediction with recurrent neural net-
challenges in advancing machine learning technologies toward work. IEEE Wireless Communications Letters, 7(4):554–557,
6G. IEEE Wireless Communications, 27(3):96–103, 2020. 2018.
[10] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. [28] C. Zhang, H. Zhang, D. Yuan, and M. Zhang. Citywide cellular
y Arcas. Communication-efficient learning of deep networks traffic prediction based on densely connected convolutional
from decentralized data. In Proceedings of the 20th Interna- neural networks. IEEE Communications Letters, 22(8):1656–
tional Conference on Artificial Intelligence and Statistics, pages 1659, 2018.
1273–1282, 2017. [29] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chan-
[11] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine dra. Federated learning with non-iid data. arXiv preprint
learning: Concept and applications. ACM Transactions on arXiv:1806.00582, 2018.
Intelligent Systems and Technology, 10(2):1–19, 2019. [30] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and
[12] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith. Federated V. Smith. Federated optimization in heterogeneous networks.
learning: Challenges, methods, and future directions. arXiv In Proceedings of Machine Learning and Systems 2020, 2020.
preprint arXiv:1908.07873, 2019. [31] S. Ji, S. Pan, G. Long, X. Li, J. Jiang, and Z. Huang. Learning
[13] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Inger- private neural language modeling with attentive aggregation.
man, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. In 2019 International Joint Conference on Neural Networks
McMahan, et al. Towards federated learning at scale: System (IJCNN), pages 1–8, 2019.
design. arXiv preprint arXiv:1902.01046, 2019. [32] J. Zhang, Y. Zheng, and D. Qi. Deep spatio-temporal residual
[14] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, networks for citywide crowd flows prediction. In Proceedings
A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cum- of the Thirty-First AAAI Conference on Artificial Intelligence,
mings, et al. Advances and open problems in federated learning. AAAI’17, page 1655–1661. AAAI Press, 2017.
arXiv preprint arXiv:1912.04977, 2019. [33] Q. Wen, L. Sun, X. Song, J. Gao, X. Wang, and H. Xu. Time
[15] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. series data augmentation for deep learning: A survey. arXiv
Hong. Federated learning over wireless networks: Optimization preprint arXiv:2002.12478, 2020.
model design and analysis. In 2019 IEEE Conference on [34] G. Barlacchi, M. D. Nadai, R. Larcher, A. Casella, C. Chitic,
Computer Communications (INFOCOM), pages 1387–1395, G. Torrisi, F. Antonelli, A. Vespignani, A. Pentland, and
2019. B. Lepri. A multi-source dataset of urban life in the ity of
[16] H. Liu, B. Liu, H. Zhang, L. Li, X. Qin, and G. Zhang. Crowd Milan and the Province of Trentino. Scientific Data, 2:150055,
evacuation simulation approach based on navigation knowledge 2015.
and two-layer control mechanism. Information Sciences, 436- [35] Telecom Italia. Telecommunications - SMS, Call, Internet - TN,
437:247–267, 2018. 2015. URL https://doi.org/10.7910/DVN/QLCABU.
[17] R. J. Hyndman and G. Athanasopoulos. Forecasting: principles [36] Telecom Italia. Telecommunications - SMS, Call, Internet - MI,
and practice. OTexts, 2018. 2015. URL https://doi.org/10.7910/DVN/EGZHFV.
[18] J. M. Hamilton. Time Series Analysis, volume 2. Princeton New [37] H. Feng, Y. Shu, S. Wang, and M. Ma. Svm-based models for
Jersey, 1994. predicting wlan traffic. In 2006 IEEE International Conference
[19] Y. Shu, M. Yu, O. Yang, J. Liu, and H. Feng. Wireless traffic on Communications (ICC), volume 2, pages 597–602, 2006.
modeling and prediction using seasonal arima models. IEICE [38] A. Cameron and F. A. G. Windmeijer. R-squared measures for
Transactions on Communications, 88(10):3992–3999, 2005. count data regression models with applications to health-care
[20] B. Zhou, D. He, and Z. Sun. Traffic predictability based on utilization. Journal of Business & Economic Statistics, 14(2):
ARIMA/GARCH model. In 2006 2nd Conference on Next 209–220, 1996.
Generation Internet Design and Engineering, pages 200–207,
Apr. 2006.
[21] F. Xu, Y. Lin, J. Huang, D. Wu, H. Shi, J. Song, and Y. Li. Big
data driven mobile traffic understanding and forecasting: A time
series approach. IEEE Transactions on Services Computing, 9
(5):796–805, 2016.
[22] R. Li, Z. Zhao, J. Zheng, C. Mei, Y. Cai, and H. Zhang. The
learning and prediction of application-level traffic data in cellu-

You might also like