0% found this document useful (0 votes)
32 views

Deep Learning and Optimisation For Quality of Service Modelling

Uploaded by

Fabio_WB_Queiroz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Deep Learning and Optimisation For Quality of Service Modelling

Uploaded by

Fabio_WB_Queiroz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

Contents lists available at ScienceDirect

Journal of King Saud University –


Computer and Information Sciences
journal homepage: www.sciencedirect.com

Deep learning and optimisation for quality of service modelling


Krishnakumar Arunachalam a,⇑, Senthilkumaran Thangamuthu a, Vijayanand Shanmugam b,
Mukesh Raju c, Kamali Premraj d
a
Department of CSE, ACS College of Engineering, Bangalore, India
b
Department of CSE, RajaRajeswari College of Engineering, Bangalore, India
c
Department of Aerospace Engineering, ACS College of Engineering, Bangalore, India
d
Department of Mathematics, Khadir Mohideen College, Adirampattinam, India

a r t i c l e i n f o a b s t r a c t

Article history: Machine learning is increasingly used to create digital twins for data collected from various underlying
Received 23 May 2021 engineering processes. Such digital twins can be used in a wide variety of activities such as optimisation,
Revised 16 December 2021 forecasting of future data, etc. In this respect, forecasting the evolution of time-series data in the future
Accepted 22 January 2022
time-steps is often encountered in various engineering systems and applications. In particular, proba-
Available online 10 February 2022
bilistic forecasting of time-series data over point-based predictions is often encouraged, but challenging
to achieve though. In this work, deep learning (DR) technology is combined with various state-of-the-art
Keywords:
mathematical optimisation algorithms in order to effectively achieve the ’confidence-based’ probabilistic
Optimisation
Deep learning
predictions of Quality of Service (QoS) data emanating from various low-powered Internet of Things (IoT)
Machine learning devices. The results demonstrate that Deep Neural Networks (DNN), if combined with right mathematical
Time-series modelling optimisation algorithm, can help generating accurate probabilistic forecasts for both single time-series
and a combination of multiple time-series data.
Ó 2022 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction developed or used in order to build a model and forecast a single


time-series data. In this kind of setting, models are built for each
The use of Artificial Intelligence (AI) and Machine Learning (ML) time-series data separately and forecasting for each time-series
approaches in order to create digital twins or machine learning data is made with the corresponding model. The manual selection
models (or simply models) for data emanating from a variety of and the tuning of each model is widely done accounting various
sources is becoming increasingly popular across various disci- factors such as seasonality, trend, autocorrelation structure, etc.
plines. The models are often used for various purposes such as State space models, exponential smoothing techniques, traditional
understanding the source of data, forecasting the evolution of data, Box–Jenkins techniques are some of the famous methodologies
serving as a replacement for computationally expensive data used to model and forecast individual time-series data (Box and
source during optimisation, etc. In particular, the time-series data Jenkins, 1968).
(i.e., the data where the quantity-of-interest is varying over time) Over the past few years, building a single model (sometimes
is a form of data that is more than often encountered in various referred to as the global model) that can approximate and forecast
sectors. Creating models and making forecasts for such time- more than one time-series data at one go is gaining importance in
varying data is one of the important aspects of machine learning. many areas of applications (Graves, 2014; Sutskever et al., 2014).
Traditional time-series analysis methods such as linear, aver- Approximating multiple time-series data using a single model
age, ARIMA models play a significant role in addressing the prob- can not only alleviate the various time-intensive manual steps,
lem of forecasting time-varying data (Box and Jenkins, 1968; but can also allow to effectively approximate each time-series data
Díaz-Robles, 2008). However, some of the traditional methods such using the related time-series (i.e., covariates) data (Kourentzes,
as ARIMA demand expertise on setting or tuning values for some of 2013; Zhang et al., 1998). Additionally, forecasting the probability
the model parameters (i.e., model hyperparameters) such as sea- distribution of the time-series data over the point-estimate is
sonality. Besides, the majority of the forecasting methods are another aspect that is also gaining to be an important requirement
in various areas of applications. Various neural network architec-
ture based methods exist in the literature which can be used to
⇑ Corresponding author.
E-mail address: [email protected] (K. Arunachalam).

https://doi.org/10.1016/j.jksuci.2022.01.016
1319-1578/Ó 2022 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

build a probabilistic single model involving multiple time-series RNNs leads to ill-conditioned optimisation problem, it is effectively
data (Sutskever et al., 2014). alleviated by another variant of RNN known as Long Short-Term
The primary contribution of this work is threefold: (1) we Memory (LSTM) which is used by DeepAR. Another concept from
demonstrate the use of Deep Neural Networks in building models RNNs is the encoder-decoder framework which is utilised by
combining multiple time-series data and making forecasts for all of DeepAR.
the time-series data involved. (2) further, we demonstrate the use DeepAR allows RNNs to map an input vector x ¼ ðx1 ; . . . ; xnx Þ to
of Deep Neural Networks in making probabilistic forecasts involv- an output vector y ¼ ðy1 ; . . . ; yny Þ of varying lengths. It makes use of
ing multiple time-series data. (3) furthermore, we demonstrate the LSTM cells in its architecture that allows the modelling of many
effect of various state-of-the-art optimisers on building accurate time-series sequences simultaneously. Besides, encoder-decoder
Deep Neural Networks based models in making probabilistic fore- frameworks allows the forecasts to be obtained in varying
casts involving multiple time-series data. All of these contributions sequence lengths for many time-series sequences simultaneously.
are demonstrated by using a collection of open-source QoS time- It is also possible to model individual time-series sequences based
series data originally emanated from low-powered IoT devices on the related time-series sequences (i.e., covariates) in DeepAR.
(White et al., 2018). This allows to fit time-series sequences with less historical data.
The remaining part of the paper is organised as follows. We Additionally, the quantile estimate of the forecasts can be obtained
begin by discussing the methodology, which incorporate both using the Monte-Carlo sampling in DeepAR.
dataset and probabilistic model, in Section 2. Section 3 provides DeepAR models the conditional distribution of future values of
an overview of various state-of-the-art optimisers employed in this each time-series
paper. Section 4 provides the results along with the corresponding
discussions. Section 5 concludes our findings along with potential zi;t0 :T ¼ ½zi;t0 ; zi;t0þ1 ; . . . ; zi;T  ð1Þ
future work. as
Pðzi;t0 :T jzi;1:t0 1 ; xi;1:T Þ ð2Þ
2. Methodology
given the past values of each time-series
2.1. Dataset zi;1:t0 1 ¼ ½zi;1 ; . . . ; zi;t0 2 ; zi;t0 1 ; ð3Þ

The primary dataset used in this paper represents the QoS where xi;1:T are covariates which are assumed to be known at all
parameter (i.e., Throughput Nallur and White (2017) and Xia times-steps and t 0 is the last time-step at which the historical val-
(2008)) recorded over a month using various sensor devices for ues of zi;t are known. Based on the autoregressive formulation of
services invoked in a custom setup using Raspberry Pi devices. RNN, the model distribution is assumed to have a product of likeli-
The dataset was made publicly available by White et al. (2018). hood factors
The dataset contains various time-series data reflecting the varia-
Y
T
tion of QoS with respect to time, thus serving as one of the ideal Q H ðzi;t0 :T jzi;1:t0 1 ; xi;1:T Þ ¼ Q H ðzi;t jzi;1:t1 ; xi;1:T Þ ð4Þ
choices for our current work. The QoS parameter is measured using t¼t 0
10 different types of sensors which include: pressure, altitude,
which, in turn, can be represented as a parameterisation of the out-
humidity, 2 types of temperature sensors, LPG, co, smoke, infrared
put hi;t from an autoregressive RNN
motion sensor and photo resistor. The QoS is measured at every
5 min over a period of a month. Further information on the dataset Y
T
can be obtained from White et al. (2018). Additionally, we use a Q H ðzi;t0 :T jzi;1:t0 1 ; xi;1:T Þ ¼ lðzi;t jhðhi;t ; HÞÞ; ð5Þ
t¼t 0
secondary dataset from Zheng et al. (2010) and Zheng et al.
(2014) due to the existence of a wide variety of web-services in where hi;t is parameterised by a function represented by RNN that is
the dataset.1 This dataset contains the real-world QoS (i.e., Through- composed of multiple layers of LSTM cells. Further information on
put) from 339 users on various 5825 web services. We extract the the mathematical formulation of DeepAR approach can be obtained
data that corresponds to 10 users (i.e., 10 time-series data each with from Gers and Schmidhuber (2001), Gregor and Danihelka (2015),
5825 data points), and use it for our work. More information on the and Salinas et al. (2019).
dataset can be obtained from Zheng et al. (2010) and Zheng et al. The MxNet-GluonTS (Salinas et al., 2019) framework is used, in
(2014). Furthermore, the readers are referred to Syu and Wang our work, to build probabilistic DeepAR models. It is an open-
(2021) for a comprehensive survey of various QoS datasets and source machine learning framework much similar to frameworks
state-of-the-art articles which use QoS in the context of forecasting such as Tensorflow, PyTorch, etc. However, GluonTS mainly focuses
and deep learning. on the time-series modelling aspect of the machine learning prob-
lems. The DeepAR models, in our work, are built with a context
2.2. Probabilistic model length (i.e., historical time-steps) of 5 and the forecasts are made
for 100 future time-steps. The DeepAR architecture is composed
A probabilistic forecasting method with autoregressive Recur- of 2 RNN layers, and each RNN layer contains 40 RNN cells. Data
rent Neural Networks (RNN) (Sutskever, 2013), known as DeepAR scaling is automatically applied by the DeepAR architecture on
(Salinas et al., 2019), is used in this work. The approach was orig- both the training and the validation dataset. The RNN cell based
inally proposed by Graves (2014) and Kaastra and Boyd (1996). dropout is used with a regularisation parameter of 0.1. A batch size
Various concepts of RNNs play a significant role in the way the of 32 is used during each epoch of the model training. No imputa-
DeepAR technique works. For example, the RNNs are deterministic tion or temporal activation regularisation is applied during the
in nature and are non-linear dynamic systems as compared to tra- model training.
ditional forecasting techniques such as exponential smoothing that
is a linear non-deterministic dynamic system in nature (Hyndman 2.3. Error metrics
et al., 2008; Zhang et al., 1998). Although, the recursive nature of
The performance of models is often assessed by the values of
1
http://wsdream.github.io. various error metrics. In our work, we use Root Mean Square Error
5999
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

(RMSE), Normalized Root Mean Square Error (NRMSE) and various  LBSGD
Weighted Quantity Loss (WQL) metrics in order to evaluate the  DCASGD
performance of the models (Armstrong and Collopy, 1992). Both  FTRL
RMSE and NRMSE can be perceived as providing the overall robust-  FTML
ness (i.e., how accurate the models are across many test sample  SGLD
points) of the models. However, the reliability (i.e., how inaccurate
the models are across many test sample points) of the models are 3.1. AdaGrad
often not captured by RMSE and NRMSE. The WQL metric allows
one to capture the reliability of the models, to some extent, at var- The Adaptive Gradient (AdaGrad) optimiser is one of the adap-
ious confidence regions. tive learning rate methods (Duchi and Singer, 2011). The adaptive
learning methods, as the name suggests, adapt the global learning
2.3.1. RMSE rate for each learnable parameter of the model during the process
Root Mean Square Error (Chai and Draxler, 2014) is used to of optimisation. To be precise, in order to obtain the learning rate
measure the error between the given two datasets, i.e., comparing for each learnable parameter, the optimiser multiplies the global
the predicted values with the known or the observed values. If the learning rate by the L2 norm of the proceeding gradient estimate
RMSE value is small, then the predicted values are meant to be clo- of the corresponding parameter. The optimiser performs the
ser to actual values. update step using the following set of equations.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi g
uX n
u X t ¼ X t1  pffiffiffiffiffiffiffiffiffiffiffiffiffi  Gt ð9Þ
u ðyi  y^i Þ2
t St þ 
i¼1
RMSE ¼ ð6Þ and
n
where n is the number of observations, yi denotes the predicted val- St ¼ St1 þ G2t ; ð10Þ
ues and y^i denotes the observed values.
where g ¼ 0:001 and  ¼ 107 .
2.3.2. NRMSE
Normalized Root Mean Square Error is a Normalized version of 3.2. RMSprop
the RMSE that can be represented as
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Root Mean Square Propagation (RMSprop) is an optimiser used
uX n
u to train Neural Networks. It is proposed by Geoff Hinton, the father
t ðyi y^i Þ2 of back propagation. In order to improve the learning rate,
i¼1
RMSprop uses exponentially decaying average instead of the sum
NRMSE ¼ n
: ð7Þ
maxðyi Þ  minðyi Þ of squared gradients (Ruder, 2016; Tieleman and Hinton, 2012).
It assembles the gradients in a fixed window instead of letting all
the gradients for the momentum calculation. It helps the model
2.3.3. WQL
to adjust its learning rate automatically. The optimiser performs
Weighted Quantity Loss metric is used to evaluate the accuracy
the update step using the following set of equations.
of a given model at various quantiles. It is particularly useful when
dealing with the impact of underpredicting and overpredicting St bSt1 þ ð1  bÞG2t ð11Þ
models (Srivastava, 2014). In our work, WQL is calculated at 10,
50 and 90 quantiles and can be expressed as and

X
g
n
Xt X t1  pffiffiffiffiffiffiffiffiffiffiffiffiffi  Gt ; ð12Þ
½maxðyi  y^i ðTÞ ; 0Þ þ ð1  TÞmaxðy^i ðTÞ  yi ; 0Þ St þ 
i¼1
WQL ¼ 2 : ð8Þ where b  0:9 and  ¼ 106 .
X
n
jyi j
i¼1 3.3. AdaDelta
where T ¼ f0:1; 0:5; . . . 0:9g is quantile in the set. Overall, for all the
error metrics incorporated in our work (RMSE, NRMSE & WQL), AdaDelta (or Adaptive Delta) belongs to the family of Stochastic
their small range of values indicate a superior model (Chen et al., Gradient Descent (SGD) algorithms that provides a flexible tech-
2017). nique for hyperparameter tuning (Zeiler, 2012). The term Delta
denotes the difference between the present weight and the
3. Optimisers updated weight. AdaDelta is an extension of AdaGrad in order to
resolve the key disadvantage of AdaGrad which is the accumula-
The following list of state-of-the-art optimisers is considered in tion of squared gradients. In AdaGrad the accumulated sum keeps
this work. The optimisers are chosen in such a way that a wide amplifying, and leads to shrink in learning rate. AdaDelta adapts
variety of optimiser classes (i.e., adaptive learning methods, online the learning rate by accumulating a fixed size of w instead of all
learning methods, Stochastic methods and Bayesian methods), the past accumulated gradients. In this way, the learning rate is
which are available in the MxNet-GluonTS framework, can be kept to be balanced even after many updates are completed. Thus,
investigated. the default learning rate is not needed in AdaDelta. The following
set of equations play an important role in the update step.
 AdaGrad htþ1 ¼ ht þ Dht ð13Þ
 RMSprop
and
 AdaDelta
 Adam RMS½Dht1
Dht ¼  : ð14Þ
 Nadam RMS½St 
 Signum

6000
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

As compared to RMSprop, the difference is the usage of the delta 3.7. LBSGD
parameter instead of the learning rate parameter.
Large Batch Stochastic Gradient Descent (LBSGD) optimiser
implements a separate learning rate for each layer of the neural
3.4. Adam
network. It uses a technique called Layer-wise Adaptive Rate Scal-
ing (LARS) in order to achieve the individual learning rate for each
Adaptive Moment Estimation (Adam) is one of the adaptive
layer. Other than this aspect, LBSGD often exhibits the similar qual-
learning methods (Kingma and Ba, 2014). Adam is more similar
ities of SGD. For more information on this optimiser, the readers
to the clubbing of the RMSprop and the Momentum-based GD
are referred to You et al. (2017).
optimisers. Adam implements the Root Mean Square (RMS) decay-
ing average of gradients as in the RMSprop optimiser. Besides, it
3.8. DCASGD
also makes use of the decaying average of the momentum. The
optimiser first calculates the moving average of the gradient ðV t Þ
Delay Compensated Asynchronous Stochastic Gradient Descent
and the squared gradient ðSt Þ (with hyperparameters b1 and b2
(DCASGD) is one of the optimisers often used for the large scale
varying between [0,1]) as follows.
distributed training problems. DCASGD can exhibit similar effects
V t ¼ b1 V t1 þ ð1  b1 ÞGt ð15Þ as of the sequential SGD, however has no significant loss in the
speed when compared to the Asynchronous Stochastic Gradient
Descent (ASGD) algorithm. DCASGD leverages Taylor’s gradient
St ¼ b2 V t1 þ ð1  b2 ÞG2t ð16Þ function and Hessian matrix (Martens, 2010) of the loss function
in order to rectify the delayed gradient issue associated with ASGD
where
(see Eq. 20).
Vt St
V^t ¼ ; S^t ¼ W iþrþ1 ¼ W iþr  lr  ðgradðW i Þ þ k  gradðW i Þ2  ðW iþr  W i ÞÞ; ð20Þ
1  b1 1  b2
where gradðW i Þ is the delayed gradient, W iþr is the value of the
g parameter at the current iteration and k is the scale factor to control
htþ1 ¼ ht  qffiffiffiffiffiffiffiffiffiffiffiffiffiffi S^t the delay. More on the mathematical formulations of the optimiser
V^t þ  can be seen from Zheng et al. (2020).

and g is the learning rate. 3.9. FTRL

3.5. Nadam Follow the Regularized Leader (FTRL) is one of the online learn-
ing optimisers (Mcmahan and Inc, 2021). There is quite a similarity
Nesterov-Accelerated Adaptive Moment Estimation (Nadam) between FTRL and the variants of gradient descent algorithms due
optimiser is one of the many variants of the Adam optimiser. The to the fact that online learning is quite similar to the training of
optimiser can be perceived as a combination of both the RMSprop neural networks (Mcmahan and Inc, 2021). The optimiser solves
and the Polyak Momentum optimisers (Dozat, 2016a; Nesterov, the optimisation problem by minimising an important parameter
1983). However, the primal feature is the fact that the Polyak called total regret (McMahan et al., 2013). The mathematical for-
Momentum component is replaced by the Nesterov Momentum mulations of the optimiser can be obtained from McMahan et al.
(Dozat, 2016b) that leads to lookahead gradient being used to (2013).
update the weighted mean of the momentum. Overall the Nadam
optimiser is slightly faster than the Adam optimiser, and often 3.10. FTML
requires limited hyperparameter tuning. For more information on
the Nadam optimiser, the readers are referred to Dozat (2016b). Follow the Moving Leader (FTML) is closely related to RMSprop
and Adam optimisers in such a way that it incorporates some of the
positive characteristics of them while avoiding most of their pit-
3.6. Signum falls. In FTRL, the solution to the optimisation problem involves
estimating the sum of all previous gradients for every update.
Signum is one of the optimisers often used for the large scale However, this is not quite suitable for deep learning non-convex
distributed training problems. Signum solves the optimisation loss functions (Jain and Kar, 2017). This pitfall of FTRL is addressed
problem by effectively making use of the sign of the gradients by FTML, and more information on this can be obtained from
resulting from each minibatch instead of the gradients in precise. Zheng and Kwok (2017).
The optimiser has two mathematical variants in the way it esti-
mates the compressed gradients (Bernstein et al., 2018). Having 3.11. SGLD
momentum equal to 0 results in the first variant of the Signum
optimiser (see Eq. 17), Stochastic Gradient Langevin Dynamics (SGLD) is one of the
Bayesian formulation based Stochastic Gradient Descent optimisa-
W iþ1 ¼ W i  lr  signðgradðW i ÞÞ; ð17Þ
tion approaches. It is proposed by Welling and Teh (2011) in 2011.
whereas having momentum not equal to 0 results in the second The SGLD approach is an iterative optimisation algorithm which
variant of the Signum optimiser as follows. adds Gaussian noise and reduces the learning rate to zero at every
update step. By using SGLD, the uncertainty or confidence intervals
V iþ1 ¼ bV i  gradðW i Þ ð18Þ of the machine learning model parameters can be captured during
the model training itself. The weight update in SGLD is done as
W iþ1 ¼ W i  lr  signðV iþ1 Þ: ð19Þ follows:
lriþ1
For more information on this optimiser, the readers are referred W iþ1 ¼ W i þ  gradðW i Þ þ giþ1 ; ð21Þ
to Bernstein et al. (2018). 2

6001
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

Fig. 1. Quality of Service (QoS) data recorded from various IoT sensors (primary dataset). The top plots show the train data used to train the machine learning models. The
bottom plot shows the test data (the data after the red line) used to assess the accuracy of the models.

6002
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

where giþ1 comes from a Gaussian process with zero mean and a Additionally, the model trained with RMSprop optimiser results
variance of lriþ1 . in the second-best accuracy across various error metrics. Besides,
it can be noticed that most of the variants of the adaptive learning
rate methods (AdaDelta, Adam, Nadam & Signum except AdaGrad
4. Results & discussions based model that results in the third-worst model) result in mod-
els with good level of accuracy. As the name suggests, the adaptive
Fig. 1 shows the QoS data recorded from various IoT sensors nature of the learning rates of these optimisers can be the reason
described in Section 2.1 (i.e., primary dataset. In the rest of the dis- for their good performance. However, in the case of the AdaGrad
cussions, the terms ’data’ and ’dataset’ refer to the ’primary dataset’ optimiser, the accumulated square of the gradient of the loss func-
unless specified explicitely.). As the QoS data is recorded from 10 tion can be blamed for its poor performance.
different IoT sensors, 10 different time-series dataset are used to The Stochastic Gradient Descent based optimisers which are
both train and test the machine learning models. Each training often used in the large scale distributed model training (LBSGD,
dataset contains about 7000 recorded (thus the total number of DCASGD) result in models with the lowest level of accuracy for
sample points is about 70000) sample points. The actual recorded our problem (except Signum). The usage of the layer-wise learning
data is further divided into a training dataset and a validation data- rate and the delayed update of the node-specific gradients to the
set. The training dataset is used to train the machine learning mod- global gradient could be some of the reasons for the poor perfor-
els whereas the validation dataset is used to test or validate the mance of the SGD distributed training optimisers. All the online
accuracy of the machine learning models built. In our case, the learning algorithm based optimisers (FTML & FTRL) result in good
actual recorded dataset of 7000 sample points are further divided performance models. The nature of continual learning and updat-
into 6900 training sample points and 100 testing sample points. ing of the parameters (i.e., once the true values are known even
The ratio of the training dataset to the validation dataset, in our after making predictions) could be the reason for their effective-
case, is 70:1 (i.e., 70 times more sample points are used in the ness in reducing model loss significantly. Surprisingly, the Baye-
training dataset than in the validation dataset). The rationale sian approach based optimiser (SGLD), which is known to
behind the selection of the ratio is the fact that the deep learning provide confidence bound for the model parameters instead of
architectures often demand a significant amount of training data- the point estimates, result in poor model for our problem. Although
set as compared to the traditional methods such as ARIMA we note seldom reasons for its (Bayesian based optimiser) poor
(Box and Jenkins, 1968; Díaz-Robles, 2008). The secondary dataset performance, we suspect the large training dataset and the time
contains 5725 training samples and 100 validation samples for requirements as some of the potential reasons.
each of the 10 users. Figs. 2 and 3 provide the qualitative assessment of the accuracy
In the context of time-series modelling, the models, once they of the DeepAR models built with the FTML (the best performing)
are built using the training data, are fed with the training data in and the DCASGD (the worse performing) optimisers for our prob-
order to subsequently provide the forecasted values. The fore- lem. As all of the DeepAR models, in this work, are multivariate
casted values are further compared with the left-out validation in nature (i.e., a single model is trained with data from all the sen-
(or test) dataset in order to assess the ability of the models in sors & forecasts are made for all the sensors’ test data at a single
reproducing the validation dataset. In this work, the model assess- instance.), the comparison between the actual and the forecasted
ment is carried out using various error metrics which are described QoS values for each sensor data is separately provided in Figs. 2
in Section 2.3. The top subplots (in Fig. 1) show the training dataset and 3. The x-axis and the y-axis values in each subplot denote
whereas the bottom subplots show the validation dataset as a con- the time and the corresponding QoS value, respectively. The blue
tinuation of the training dataset (see the vertical divisive line). coloured thick line in each of the plots provides the actual QoS val-
Table 1 provides the quantitative assessment of the accuracy of ues observed over the period of time given in the horizontal axis.
the DeepAR models built with various optimisers described in Sec- Similarly, the red coloured thick line in each of the plots provides
tion 3. In precise, the table provides the values for various error the median values of the forecasted QoS values obtained from
metrics computed using the common validation dataset for the the DeepAR models for the time-span corresponding to the test
common DeepAR architecture trained using different optimisers data. The red coloured line being close to the blue coloured line
with the common training dataset. For all the error metrics shown in Fig. 2 indicates that the forecasts from the DeepAR models with
in the Table 1, the lower the values, the better is the corresponding FTML optimiser are very accurate. On the contrary, the red and the
model accuracy. This, in turn, means that the actual and the fore- blue coloured lines are quite apart in Fig. 3 which indicate that the
casted values are quite close to each other (i.e., an intended desired forecasts from the DeepAR model with DCASGD optimiser is not so
quality for the trained models). It can be observed that the DeepAR accurate.
model trained with the FTML optimiser results in the highest accu- Moreover, the light and the dark shades of red regions in Figs. 2
racy by providing the lowest values for most of the error metrics. and 3 indicate the confidential regions for the forecasts (i.e., the

Table 1
Error metrics computed using validation dataset for DeepAR models built with various state-of-the-art optimisers (primary dataset).

Optimisers RMSE NRMSE WQuantileLoss[0.1] WQuantileLoss[0.5] WQuantileLoss[0.9]


AdaGrad 138.3502 1.5348 0.4720 0.9380 0.6416
RMSprop 76.0002 0.8431 0.2813 0.4919 0.2352
AdaDelta 67.2446 0.7460 0.2391 0.4767 0.2650
Adam 78.6634 0.8727 0.2253 0.4975 0.2343
Nadam 76.5916 0.8497 0.2188 0.4767 0.2931
Signum 74.2448 0.8237 0.2353 0.5013 0.3031
LBSGD 270.0204 2.9956 3.3709 1.4431 3.1426
DCASGD 3972.1857 44.0669 60.4109 15.3591 58.8736
FTRL 86.8979 0.9640 0.2803 0.5374 0.3633
FTML 62.8523 0.6973 0.2279 0.4417 0.2301
SGLD 1274.6325 14.1406 19.0340 5.2558 18.6897

6003
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

Fig. 2. Probabilistic forecasts of Quality of Service (QoS) data recorded from various IoT sensors (primary dataset). The trained model uses the DeepAR architecture with the
Follow the Moving Leader (FTML) optimiser. The median predictions (thick red line) provide the most probable forecasts whereas the 50% (dark shades of red) and the 90% (light
shades of red) prediction intervals provide the probable forecasts for the 50% and the 90% of the times, respectively.

6004
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

Fig. 3. Probabilistic forecasts of Quality of Service (QoS) data recorded from various IoT sensors (primary dataset). The trained model uses the DeepAR architecture with the
Delay Compensated Asynchronous Stochastic Gradient Descent (DCASGD) optimiser. The median predictions (thick red line) provide the most probable forecasts whereas the 50%
(dark shades of red) and the 90% (light shades of red) prediction intervals provide the probable forecasts for the 50% and the 90% of the times, respectively.

6005
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

Table 2
Error metrics computed using validation dataset for DeepAR models built with various state-of-the-art optimisers (secondary dataset).

Optimisers RMSE NRMSE WQuantileLoss[0.1] WQuantileLoss[0.5] WQuantileLoss[0.9]


AdaGrad 205.2272 4.4783 0.8034 3.3792 2.1068
RMSprop 181.9787 3.9710 1.2461 2.2904 2.5484
AdaDelta 183.8060 4.0109 1.3132 2.6880 2.5861
Adam 395.8409 8.6378 3.3331 5.5243 5.8701
Nadam 723.6690 15.7914 2.9535 12.6748 7.9863
Signum 832.1562 18.1588 5.0611 6.2796 12.8864
FTRL 659.6342 14.3941 2.7088 10.8286 7.3565
FTML 699.1053 15.2554 3.2520 10.8247 8.2223

forecasts which may appear 90% & 50% of the times, respectively). Declaration of Competing Interest
It can be clearly observed that both the 50% and the 90% prediction
intervals (of the forecasts) encapsulate the actual QoS values in The authors declare that they have no known competing finan-
almost all of the plots in Fig. 2. However, this is quite the opposite cial interests or personal relationships that could have appeared
for the DCASGD optimiser based model as one could see (from to influence the work reported in this paper.
Fig. 3) that both the 50% and the 90% prediction intervals are too
far from both the median predictions and the actual QoS values.
Table 2 provides the quantitative assessment of the accuracy of
References
the DeepAR models built with various optimisers described in Sec-
tion 3 for the secondary dataset described in 2.1. The overall ability Armstrong, J., Collopy, F., 1992. Error measures for generalizing about forecasting
of the models to accurately model the secondary dataset is low as methods: Empirical comparisons. Int. J. Forecast. 8, 69–80. https://doi.org/
compared to the primary dataset. The explicit volatile dynamics of 10.1016/0169-2070(92)90008-W.
Bernstein, J., Wang, Y.X., Azizzadenesheli, K., Anandkumar, A., 2018. signsgd:
the secondary dataset could be the reason behind the reduced level Compressed optimisation for non-convex problems. arXiv:1802.04434.
of performance (Zheng et al., 2010; Zheng et al., 2014). The Box, G.E.P., Jenkins, G., 1968. Some recent advances in forecasting and control. J. R.
RMSprop and the adaptive learning rate based optimisers carry Stat. Soc. Ser. C (Appl. Stat.) 17 (02), 19–109.
Chai, T., Draxler, R., 2014. Root mean square error (rmse) or mean absolute error
over their good performance to the secondary dataset too. The (mae)?– arguments against avoiding rmse in the literature. Geoscientific Model
Stochastic Gradient Descent based optimisers failed during the Develop. 7, 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014.
training process. Chen, C., Twycross, J., Garibaldi, J., 2017. A new accuracy measure based on bounded
relative error for time series forecasting. PLOS ONE 12, 1–23. https://doi.org/
It is important to note that the DeepAR models built are multi- 10.1371/journal.pone.0174202.
variate models as mentioned earlier. The multivariate models are Díaz-Robles, L.A., 2008. A hybrid arima and artificial neural networks model to
quite advantageous over the univariate models. In univariate mod- forecast particulate matter in urban areas: the case of temuco, chile. Atmos.
Environ. 42 (35), 8331–8340.
els, a separate model has to be built for each dataset (thus leading Dozat, T., 2016a. Incorporating nesterov momentum into adam. International
to a total of 10 models, one for each IoT sensor-based dataset or for Conference on Learning Representations, 1–14.
each user in the case of the secondary dataset). Due to this reason, Dozat, T., 2016. Incorporating nesterov momentum into adam.
Duchi, J., Singer, Y., 2011. Adaptive subgradient methods for online learning and
the univariate models often demand a large amount of computa-
stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159.
tional and memory costs which can be easily tackled by the multi- Gers, F., Schmidhuber, J., 2001. Applying lstm to time series predictable
variate models. Irrespective of the nature of the models built, there throughtime-window approaches. In: Georg Dorffner, editor, Artificial Neural
are no restrictions on the length of the future time-steps where the Networks – ICANN 2001(Proceedings). pp. 669–676.
Graves, A., 2014. Generating sequences with recurrent neural networks.
forecasts need to be made. However, the longer the length of the arXiv:1308.0850.
future time-steps, the more inaccurate the forecasts may become. Gregor, K., Danihelka, I., 2015. Draw: A recurrent neural network for image
generation. arXiv preprint arXiv:1502.04623.
Hyndman, R., Koehler, A., Ord, K., Snyder, R., 2008. Forecasting with exponential
smoothing. State Space Approach. https://doi.org/10.1007/978-3-540-71918-2.
Jain, P., Kar, P., 2017. Non-convex optimization for machine learning. Found. Trends
5. Conclusions Mach. Learn. 10, 142–336.
Kaastra, I., Boyd., M., 1996. Designing a neural network for forecasting financial and
Deep Neural Networks (DNN) based architectures are demon- economic time series. Neurocomputing 10(3), 215–236.
Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. International
strated for the ’confidence-interval-based’ probabilistic predictions Conference on Learning Representations, 1–15.
of Quality of Service (QoS) data emanating from various Internet of Kourentzes, N., 2013. Intermittent demand forecasts with neural networks. Int. J.
Things (IoT) devices. The efficacy of the deep neural architectures Prod. Econ. 143 (1), 198–206.
Martens, J., 2010. Deep learning via hessian-free optimization. International
in building models combining multiple time-series data and mak- Conference on Machine Learning, 735–742.
ing forecasts for all of the time-series data involved is also demon- Mcmahan, H.B., Inc, G., Follow-the-regularized-leader and mirror descent:
strated. Additionally, the effect of various state-of-the-art Equivalence theorems and l1 regularization.
McMahan, H.B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T.,
optimisers on building accurate deep neural networks based mod- Davydov, E., Golovin, D., Chikkerur, S., Liu, D., Wattenberg, M., Hrafnkelsson, A.
els in making probabilistic forecasts involving multiple time-series M., Boulos, T., Kubica, J., 2013. Ad click prediction: A view from the trenches. In:
data is investigated. The results indicate that both the Follow the Proceedings of the 19th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. Association for Computing Machinery, New York,
Moving Leader and the RMSprop optimisers are the best perform- NY, USA, pp. 1222–1230. https://doi.org/10.1145/2487575.2488200.
ers (for our primary dataset) in optimising the values of the Nallur, V., White, G., 2017. Quality of service approaches in iot: A systematic
weights of the DeepAR architectures. The results also indicate that mapping. J. Syst. Softw. 132, 186–203.
Nesterov, Y., 1983. A method for unconstrained convex minimization problem with
the traditional adaptive learning methods such as AdaDelta, Adam,
the rate of convergence Oðk12 Þ. Doklady Akademii Nauk SSSR 269, 543–547.
Nadam and Signum provide more accurate probabilistic models Ruder, S., 2016. An overview of gradient descent optimization algorithms. arXiv
than the LBSGD and DCASGD Stochastic Gradient Descent based preprint arXiv:1609.04747.
optimisers. Our future work will focus on the effect of values of Salinas, D., Flunkert, V., Gasthaus, J., 2019. Deepar: Probabilistic forecasting with
autoregressive recurrent networks. arXiv:1704.04110.
the optimisers’ and the model architecture’s individual parameters Srivastava, N.G.H., 2014. Dropout: a simple way to prevent neural networks from
on the model accuracy. overfitting. Journal of Machine Learning Research, 15, 1929–1958.

6006
K. Arunachalam, S. Thangamuthu, V. Shanmugam et al. Journal of King Saud University – Computer and Information Sciences 34 (2022) 5998–6007

Sutskever, I., 2013. Training recurrent neural networks (Ph.D. dissertation). Zeiler, M.D., 2012. Adadelta: An adaptive learning rate method. arXiv:1212.5701.
University of Toronto, Ontario, Canada. Zhang, G. Eddy Patuwo, B., Hu., M.Y., 1998. Forecasting with artificial neural
Sutskever, I., Vinyals, O., Quoc, L., 2014. Sequence to sequence learning with neural networks:: The state of the art. Int. J. Forecast. 14(1), 35–62.
networks. Advances in Neural Information Processing Systems, 3104–3112. Zhang, P., Patuwo, E., Hu, M., 1998. Forecasting with artificial neural networks: The
Syu, Y., Wang, C.M., 2021. blueQoS Time Series Modeling and Forecasting for Web state of the art. Int. J. Forecast. 14, 35–62. https://doi.org/10.1016/S0169-2070
Services: A Comprehensive Survey. IEEE Trans. Netw. Serv. Manage. 18, 926– (97)00044-7.
944. https://doi.org/10.1109/TNSM.2021.3056399. Zheng, S., Kwok, J.T., 2017. Follow the moving leader in deep learning, in: Precup, D.,
Tieleman, T., Hinton, G., 2012. Divide the gradient by a running average of its recent Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on Machine
magnitude. COURSERA: Neural Networks for Machine Learning, 26–31. Learning, PMLR. pp. 4110–4119. url:http://proceedings.mlr.press/v70/
Welling, M., Teh, Y.W., 2011. Bayesian learning via stochastic gradient langevin zheng17a.html.
dynamics. International Conference on Machine Learning, 681–688. Zheng, Z., Zhang, Y., Lyu, M.R., 2010. Distributed QoS Evaluation for Real-World Web
White, G., Palade, A., Clarke, S., 2018. Forecasting qos attributes using lstm Services. In: 2010 IEEE International Conference on Web Services, pp. 83–90.
networks. In: 2018 International Joint Conference on Neural Networks (IJCNN), https://doi.org/10.1109/ICWS.2010.10.
pp. 1–8. https://doi.org/10.1109/IJCNN.2018.8489052. Zheng, Z., Zhang, Y., Lyu, M.R., 2014. blueInvestigating QoS of Real-World Web
Xia, F., 2008. Qos challenges and opportunities in wireless sensor/actuator Services. IEEE Trans. Serv. Comput. 7, 32–39. https://doi.org/10.1109/
networks. Sensors 8, 1099–1110. https://doi.org/10.3390/s8021099. URL: TSC.2012.34.
https://www.mdpi.com/1424-8220/8/2/1099. Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z.M., Liu, T.Y., 2020.
You, Y., Gitman, I., Ginsburg, B., 2017. Large batch training of convolutional Asynchronous stochastic gradient descent with delay compensation.
networks. arXiv:1708.03888. arXiv:1609.08326.

6007

You might also like