0% found this document useful (0 votes)
56 views27 pages

2104.05522

The document introduces NBEATSx, an extension of the neural basis expansion analysis (NBEATS) that incorporates exogenous variables for improved electricity price forecasting. NBEATSx demonstrates a nearly 20% increase in forecast accuracy compared to the original NBEATS model and up to 5% over other established methods, while also providing interpretable outputs that decompose time series into trend and seasonal components. The authors provide a comprehensive study of its application across various markets and years, showcasing its effectiveness and making the code available for further research.

Uploaded by

yulong.livc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views27 pages

2104.05522

The document introduces NBEATSx, an extension of the neural basis expansion analysis (NBEATS) that incorporates exogenous variables for improved electricity price forecasting. NBEATSx demonstrates a nearly 20% increase in forecast accuracy compared to the original NBEATS model and up to 5% over other established methods, while also providing interpretable outputs that decompose time series into trend and seasonal components. The authors provide a comprehensive study of its application across various markets and years, showcasing its effectiveness and making the code available for further research.

Uploaded by

yulong.livc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Neural basis expansion analysis with exogenous variables:

Forecasting electricity prices with NBEATSx


Kin G. Olivaresa , Cristian Challua , Grzegorz Marcjaszb , Rafal Weronb , Artur Dubrawskia
a
Auton Lab, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
b
Department of Operations Research and Business Intelligence, Wroclaw University of Science and
Technology, Wroclaw, Poland
arXiv:2104.05522v6 [cs.LG] 4 Apr 2022

Abstract
We extend the neural basis expansion analysis (NBEATS) to incorporate exogenous factors.
The resulting method, called NBEATSx, improves on a well performing deep learning model,
extending its capabilities by including exogenous variables and allowing it to integrate mul-
tiple sources of useful information. To showcase the utility of the NBEATSx model, we
conduct a comprehensive study of its application to electricity price forecasting (EPF) tasks
across a broad range of years and markets. We observe state-of-the-art performance, signifi-
cantly improving the forecast accuracy by nearly 20% over the original NBEATS model, and
by up to 5% over other well established statistical and machine learning methods specialized
for these tasks. Additionally, the proposed neural network has an interpretable configura-
tion that can structurally decompose time series, visualizing the relative impact of trend
and seasonal components and revealing the modeled processes’ interactions with exogenous
factors. To assist related work we made the code available in a dedicated repository.
Keywords: Deep Learning, NBEATS and NBEATSx models, Interpretable neural
network, Time series decomposition, Fourier series, Electricity price forecasting

1. Introduction
In the last decade, a significant progress has been made in the application of deep learning
to forecasting tasks, with models such as the exponential smoothing recurrent neural network
(ESRNN; Smyl 2020) and the neural basis expansion analysis (NBEATS; Oreshkin et al. 2020),
outperforming classical statistical approaches in the recent M4 competition (Makridakis
et al., 2020). Despite this success we still identify two possible improvements, namely the
integration of time-dependent exogenous variables as their inputs and the interpretability of
the neural network outputs.


Corresponding author
Email address: [email protected] (Kin G. Olivares)
International Journal of Forecasting April 5, 2022
Neural networks have proven powerful and flexible, yet there are several situations where
our understanding of the model’s predictions can be as crucial as their accuracy, which
constitutes a barrier for their wider adoption. The interpretability of the algorithm’s outputs
is critical because it encourages trust in its predictions, improves our knowledge of the
modeled processes, and provides insights that can improve the method itself.
Additionally, the absence of time-dependent covariates makes these powerful models
unsuitable for many applications. For instance, Electricity Price Forecasting (EPF) is a task
where covariate features are fundamental to obtain accurate predictions. For this reason, we
chose this challenging application as a test ground for our proposed forecasting methods.
In this work, we address the two mentioned limitations by first extending the neural basis
expansion analysis, allowing it to incorporate temporal and static exogenous variables. And
second, by further exploring the interpretable configuration of NBEATS and showing its use
as a time-series signal decomposition tool. We refer to the new method as NBEATSx. The
main contributions of this paper include:

(i) Incorporation of Exogenous Variables: We propose improvements to the NBEATS


model to incorporate time dependent as well as static exogenous variables. For this
purpose, we have designed a special substructure built with convolutions, to clean and
encode useful information from these covariates, while respecting time dependencies
present in the data. These enhancements greatly improve the accuracy of the NBEATS
method, and extend its interpretability capabilities, so rare in neural forecasting.
(ii) Interpretable Time Series Signal Decomposition: Our method combines the
power of non-linear transformations provided by neural networks with the flexibility to
model multiple seasonalities and simultaneously account for interaction events such as
holidays and other covariates, all while remaining interpretable. The extended NBEATSx
architecture allows to decompose its predictions into the classic set of level, trend, and
seasonality, and identify the effects of exogenous covariates.
(iii) Time Series Forecasting Comparison: We showcase the use of NBEATSx model on
five EPF tasks achieving state-of-the-art performance on all of the considered datasets.
We obtain accuracy improvements of almost 20% in comparison to the original NBEATS
and ESRNN architectures, and up to 5% over other well-established machine learning,
EPF-tailored methods (Lago et al., 2021a).

The remainder of the paper is structured as follows. Section 2 reviews relevant literature
on the developments and applications of deep learning to sequence modeling and current
approaches to EPF. Section 3 introduces mathematical notation and describes the NBEATSx
model. Section 4 explores our model’s application to time series decomposition and forecast-
ing over a broad range of electricity markets and time periods. Finally, Section 5 discusses
possible directions for future research, wraps up the results, and concludes the paper.

2
2. Literature Review
2.1. Deep Learning and Sequence Modeling
The Deep Learning methodology (DL) has demonstrated a significant utility in solving
sequence modeling problems with applications to natural language processing, audio signal
processing, and computer vision. This subsection summarizes the critical DL developments
in sequence modeling, that are building blocks of the NBEATS and ESRNN architectures.
For a long time, sequence modeling with neural networks and Recurrent Neural Net-
works (RNNs; Elman 1990) were treated as synonyms. The hidden internal activations of
the RNNs propagated through time provided these models with the ability to encode the
observed past of the sequence. This explains their great popularity in building different vari-
ants of the Sequence-to-Sequence models (Seq2Seq) applied to natural language processing
(Graves, 2013), and machine translation (Sutskever et al., 2014). Most progress on RNNs
was made possible by architectural innovations and novel training techniques that made
their optimization easier.
The adoption of convolutions and skip-connections within the recurrent structures were
important precursors for new advancements in sequence modeling, as using deeper represen-
tations endowed longer effective memory for the models. Examples of such precursors could
be found in WaveNet for audio generation and machine translation (van den Oord et al.,
2016), as well as the Dilated Recurrent Neural Network (DilRNN; Chang et al. 2017) and the
Temporal Convolutional Network (TCN; Bai et al. 2018).
Nowadays, Seq2Seq models and their derivatives can learn complex nonlinear temporal
dependencies efficiently; its use in the time series analysis domain has been a great suc-
cess. Seq2Seq models have recently showed better forecasting performance than classical
statistical methods, while greatly simplifying the forecasting systems into single-box mod-
els, such as the Multi Quantile Convolutional Neural Network (MQCNN; Wen et al. 2017), the
Exponential Smoothing Recurrent Neural Network (ESRNN; Smyl 2020), or the Neural Basis
Expansion Analysis (NBEATS; Oreshkin et al. 2020). For quite a while, the academia resisted
to broadly adopt these new methods (Makridakis et al., 2018), although their evident suc-
cess in challenges such as the M4 competition has motivated their wider adoption by the
forecasting research community (Benidis et al., 2020).

2.2. Electricity Price Forecasting


The Electricity Price Forecasting (EPF) task aims at predicting the spot (balancing,
intraday, day-ahead) and forward prices in wholesale markets. Since the workhorse of short-
term power trading is the day-ahead market with its conducted once-per-day uniform-price
auction (Mayer & Trück, 2018), the vast majority of research has focused on predicting
electricity prices for the 24 hours of the next day, either in a point (Weron, 2014; Lago et al.,
2021a) or a probabilistic setting (Nowotarski & Weron, 2018). There also are studies on
EPF for very short-term (Narajewski & Ziel, 2020), as well as mid- and long-term horizons
(Ziel & Steinert, 2018). The recent expansion of renewable energy generation and large-scale
battery storage has induced complex dynamics to the already volatile electricity spot prices,

3
turning the field into a prolific subject on which to test novel forecasting ideas and trading
strategies (Chitsaz et al., 2018; Gianfreda et al., 2020; Uniejewski & Weron, 2021).
Out of the numerous approaches to EPF developed over the last two decades, two classes
of models are of particular importance when predicting day-ahead prices – statistical (also
called econometric or technical analysis), in most cases based on linear regression, and
computational intelligence (also referred to as artificial intelligence, non-linear or machine
learning), with neural networks being the fundamental building block. Among the latter,
many of the recently proposed methods utilize deep learning (Wang et al. 2017; Lago et al.
2018a; Marcjasz 2020), or are hybrid solutions, that typically comprise data decomposition,
feature selection, clustering, forecast averaging and/or heuristic optimization to estimate
the model (hyper)parameters (Nazar et al., 2018; Li & Becker, 2021).
Unfortunately, as argued by Lago et al. (2021a), the majority of the neural network EPF
related research suffers from too short and limited to a single market test periods, lack of
well performing and established benchmark methods, and/or incomplete descriptions of the
pipeline and training methodology resulting in poor reproducibility of the results. To address
these shortcomings, our models are compared across two-year out-of-sample periods from
five power markets and using two highly competitive benchmarks recommended in previous
studies: the Lasso Estimated Auto-Regressive (LEAR) model and a (relatively) parsimonious
Deep Neural Network (DNN).

3. NBEATSx Model
As a general overview, the NBEATSx framework decomposes the objective signal by per-
forming separate local nonlinear projections of the target data onto basis functions across
its different blocks. Figure 1 depicts the general architecture of the model. Each block con-
sists of a Fully Connected Neural Network (FCNN; Rosenblatt 1961) which learns expansion
coefficients for the backcast and forecast elements. The backcast model is used to clean the
inputs of subsequent blocks, while the forecasts are summed to compose the final prediction.
The blocks are grouped in stacks. Each of the potentially multiple stacks specializes in a
different variant of basis functions.
To continue the description of the NBEATSx, we introduce the following notation: the
objective signal is represented by the vector y, the inputs for the model are the backcast
window vector yback of length L, and the forecast window vector yf or of length H; where
L denotes the length of the lags available as classic autoregressive features, and H is the
forecast horizon treated as the objective. While the original NBEATS only admits as regressor
the backcast period of the target variable yback , the NBEATSx incorporates covariates in its
analysis denoted with the matrix X. Figure 1 shows an example where the target variable
is the hourly electricity price, the backcast vector has a length L of 96 hours, and the
forecast horizon H is 72 hours, in the example, the covariate matrix X is composed of wind
power production and electricity load. For the EPF comparative analysis of Section 4.3.6
the horizon considered is H = 24 that corresponds to day-ahead predictions, while backcast
inputs L = 168 correspond to a week of lagged values.

4
X2
X1
y

Model Input Stack Input Block Input


Backcast Period Forecast Period (yback, X) (yback
s−1 , Xs−1) (yback
s,b−1, Xs,b−1)
Trend

Stack 1 Block 1
_
\

FC Stack
Seasonality

Stack
Block 2
Stack 2
forecast
+
\
_
\
FC FC
… … … for back
… θs,b θs,b
Exogenous

Block B
Stack S _
\

Forecast Backcast
+
\

ŷ for
s,b
ŷback
s,b
Global Forecast Stack Residual
(model output) (to next stack)
ŷ for yback
s

Figure 1: Building blocks of the NBEATSx are structured as a system of multilayer fully connected networks
with ReLU based nonlinearities. Blocks overlap using the doubly residual stacking principle for the backcast
back f or
ŷs,b and forecast ŷs,b outputs of the b-th block within the s-th stack. The final predictions ŷf or are
composed by aggregating the outputs of the stacks.

For its predictions, the NBEATS model only receives a local vector of inputs corresponding
to the backcast period, making the computations exceptionally fast. The model can still
represent longer time dependencies through its local inputs from the exogenous variables;
for example, it can learn long seasonal effects from calendar variables.
Overall, as shown in Figure 1, the NBEATSx is composed of S stacks of B blocks each, the
input yback of the first block consists of L lags of the target time series y and the exogenous
matrix X, while the inputs of each of the subsequent blocks include residual connections with
the backcast output of the previous block. We will describe in detail in the next subsections
the blocks, stacks, and model predictions.

5
3.1. Blocks
For a given s-th stack and b-th block within it, the NBEATSx model performs two trans-
formations, depicted in the blue rectangle of Figure 1. The first transformation, defined in
back
Equation (1), takes the input data (ys,b−1 , Xs,b−1 ), and applies a Fully Connected Neural
Network (FCNN; Rosenblatt 1961) to learn hidden units hs,b ∈ RNh that are linearly adapted
into the forecast θ fs,bor ∈ RNs and backcast θ back
s,b ∈ R
Ns
expansion coefficients, with Ns the
dimension of the stack basis.
back

hs,b = FCNNs,b ys,b−1 , Xb−1
(1)
θ back
s,b = LINEAR
back
(hs,b ) θ fs,bor = LINEARf or (hs,b )

The second transformation, defined in Equation (2), consists of a basis expansion operation
back f or
between the learnt coefficients and the block’s basis vectors Vs,b ∈ RL×Ns and Vs,b ∈
H×Ns back f or
R , this transformation results in the backcast ŷs,b and forecast ŷs,b components.

back back back f or f or f or


ŷs,b = Vs,b θ s,b and ŷs,b = Vs,b θ s,b (2)

3.2. Stacks and Residual Connections


The blocks are organized into stacks using the doubly residual stacking principle, which
is described in Equation (3) and depicted in the brown rectangle of Figure 1. The residual
back
backcast ys,b+1 allows the model to subtract the component associated to the basis of the
back
s, b-th stack and block Vs,b from yback , which can be also thought of as a sequential de-
composition of the modeled signal. In turn, this methodology helps with the optimization
procedure as it prepares the inputs of the subsequent layer making the downstream forecast
easier. The stack forecast ysf or aggregates the partial forecasts from each block.
B
X
back back back f or
ys,b+1 = ys,b − ŷs,b and ŷsf or = ŷs,b (3)
b=1

3.3. Model predictions


The final predictions ŷf or of the model, shown in the yellow rectangle of Figure 1, are
obtained by summation of all the stack predictions.
S
X
f or
ŷ = ŷsf or (4)
s=1

The additive generation of the forecast implies a very intuitive decomposition of the
prediction components when the bases within the blocks are interpretable.

6
3.4. NBEATSx Configurations
The original neural basis expansion analysis method proposed two configurations based
back
on the assumptions encoded in the learning algorithm by selecting the basis vectors Vs,b
f or
and Vs,b used in the blocks from Equation (2). A mindful selection of restrictions to the
basis allows the model to output an interpretable decomposition of the forecasts, while
allowing the basis to be freely determined can produce more flexible forecasts by effectively
removing any constraints on the form of the basis functions.
In this subsection, we present both interpretable and generic configurations, explaining
in particular how we propose to include the covariates in each case. We limit ourselves to
the analysis of the forecast basis, as the backcast basis analysis is almost identical, only
differing by its extension over time. We show an example in Appendix A.1.

3.4.1. Interpretable Configuration


The choice of basis vectors relies on time series decomposition techniques that are often
used to understand the structure of a given time series and patterns of its variation. Work
in this area ranges from classical smoothing methods and their extensions such as X-11-
ARIMA, X-12-ARIMA, and X-13-ARIMA-SEATS, to modern approaches such as TBATS (Livera
et al., 2011). To encourage interpretability, the blocks within each stack may use harmonic
functions, polynomial trends, and exogenous variables directly to perform their projections.
The partial forecasts of the interpretable configuration are described by Equations (5)-(7).
Npol
X
trend
ŷs,b = ti θs,b,i
trend
≡ T θ trend
s,b (5)
i=0

bH/2−1c    
X t t
seas
ŷs,b = cos 2πi seas
θs,b,i + sin 2πi seas
θs,b,i+bH/2c ≡ S θ seas
s,b (6)
i=0
Nhr Nhr
Nx
X
exog exog
ŷs,b = Xi θs,b,i ≡ X θ exog
s,b (7)
i=0
|
where the time vector t = [0, 1, 2, . . . , H − 2, H − 1]/H is defined discretely. When the basis
f or
Vs,b is T = [1, t, . . . , tNpol ] ∈ RH×(Npol +1) , where Npol is the maximum polynomial degree, the
f or
coefficients are those of a trend polynomial model. When the bases Vs,b are harmonic S =
[1, cos(2π Nhr ), . . . , cos(2πbH/2 − 1c Nhr ), sin(2π Nhr ), . . . , sin(2πbH/2 − 1c Nthr )] ∈ RH×(H−1) ,
t t t

the coefficients vector θ fs,bor can be interpreted as Fourier transform coefficients, the hyper-
parameter Nhr controls the harmonic oscillations. The exogenous basis expansion can be
thought as a time-varying local regression when the basis is the matrix X = [X1 , . . . , XNx ] ∈
RH×Nx , where Nx is the number of exogenous variables. The resulting models can flexibly
reflect common structural assumptions, in particular using the interpretable bases, as well
as their combinations.
In this paper, we propose including one more type of stack to specifically represent
exogenous variable basis as described in Equation (7) and depicted in Figure 1. In the
7
original NBEATS framework (Oreshkin et al. (2020)), the interpretable configuration usually
consists of a trend stack followed by a seasonality stack, each containing three blocks. Our
NBEATSx extension of this configuration consists of three stacks, one of each type of factors
(trend, seasonal, exogenous). We refer to this interpretable and its enhanced interpretable
configuration as the NBEATS-I and NBEATSx-I models, respectively.

3.4.2. Generic Configuration


For the generic configuration, the basis of the non linear projection in Equation (2)
f or
corresponds to canonical vectors, that is Vs,b = IH×H , an identity matrix of dimensionality
equal to the forecast horizon H that matches the coefficient’s cardinality |θ fs,bor | = H.
gen f or f or
ŷs,b = Vs,b θ s,b = θ fs,bor (8)

This basis enables NBEATSx to effectively behave like a classic Fully Connected Neural
Network (FCNN). The output layer of the FCNN inside each block has H neurons, that cor-
respond to the forecast horizon, each producing the forecast for one particular time point
of the forecast period. This can be understood as the basis vectors being learned during
optimization, allowing the waveform of the basis of each stack to be freely determined in a
data-driven fashion. Compared to the interpretable counterpart described in Section 3.4.1,
the constraints on the form of the basis functions are removed. This affords the generic
variant more flexibility and power at representing complex data, but it can also lead to less
interpretable outcomes and potentially escalated risk of overfitting.
For the NBEATSx model with the generic configuration, we propose a new type of exoge-
nous block that learns a context vector Cs,b from the time-dependent covariates with an
encoder convolutional sub-structure:
Nc
X
exog f or
ŷs,b = Cs,b,i θs,b,i ≡ Cs,b θ fs,bor with Cs,b = TCN(X)
i=1
(9)

In the previous equation, a Temporal Convolutional Network (TCN; Bai et al. 2018) is em-
ployed as an encoder, but any neural network with a sequential structure will be compatible
with the backcast and forecast branches of the model, and could be used as an encoder. For
example, the WaveNet (van den Oord et al., 2016) can be an effective alternative to RNNs
as it is also able to capture long term dependencies and interactions of covariates by stack-
ing multiple layers, while dilations help it keep the models computationally tractable. In
addition, convolutions have a very convenient interpretation as a weighted moving average
signal filters. The final linear projection and the additive composition of the predictions can
be interpreted as a decoder.
The original NBEATS configuration includes only one generic stack with dozens of blocks,
while our proposed model includes both the generic and exogenous stacks, with the order
determined via data-driven hyperparameter tuning. We refer to this configuration as the
NBEATSx-G model.
8
3.4.3. Exogenous Variables
We distinguish the exogenous variables by whether they reflect static or time-dependent
aspects of the modeled data. The static exogenous variables carry time-invariant informa-
tion. When the model is built with common parameters to forecast multiple time series,
these variables allow sharing information within groups of time series with similar static vari-
able levels. Examples of static variables include designators such as identifiers of regions,
groups of products, among others.
As for the time-dependent exogenous covariates, we discern two subtypes. First, we
consider seasonal covariates from the natural frequencies in the data. These variables are
useful for NBEATSx to identify seasonal patterns and special events inside and outside the
window lookback periods. Examples of these are the trends and harmonic functions from
Equation (5) and Equation (6). Second, we identify domain-specific temporal covariates
unique to each problem. The EPF setting typically includes day-ahead forecasts of electricity
load and production levels from renewable energy sources.

4. Empirical Evaluation
4.1. Electricity Price Forecasting Datasets
To evaluate our method’s forecasting capabilities, we consider short-term electricity price
forecasting tasks, where the objective is to predict day-ahead prices. Five major power
markets1 are used in the empirical evaluation, all comprised of hourly observations of the
prices and two influential temporal exogenous variables that extend for 2,184 days (312
weeks, six years). From the six years of available data for each market, we hold two years
out, to test the forecasting performance of the algorithms. The length and diversity of
the test sets allow us to obtain accurate and highly comprehensive measurements of the
robustness and the generalization capabilities of the models.
Table 1 summarizes the key characteristics of each market. The Nord Pool electric-
ity market (NP), which corresponds to the Nordic countries exchange, contains the hourly
prices and day-ahead forecasts of load and wind generation. The second dataset is the
Pennsylvania-New Jersey-Maryland market in the United States (PJM), which contains hourly
zonal prices in the Commonwealth Edison (COMED) and two day-ahead forecasts of load
at the system and COMED zonal levels. The remaining three markets are obtained from
the integrated European Power Exchange (EPEX). Belgium (EPEX-BE) and France (EPEX-FR)
markets share the day-ahead forecast generation in France as covariates since it is known to
be one of the best predictors for Belgian prices (Lago et al., 2018b). Finally, the German
market (EPEX-DE) contains the hourly prices, day-ahead load forecasts, and the country level
wind and solar generation day-ahead forecast.
Figure 2 displays the NP electricity price time series and its corresponding covariate
variables to illustrate the datasets. The NP market is the least volatile among the considered
markets, since most of its power comes from hydroelectric generation, renewable source

1
For the sake of reproducibility we only consider datasets that are openly accessible in the EPFtoolbox
library https://github.com/jeslago/epftoolbox (Lago et al., 2021a).
9
Table 1: Datasets used in our empirical study. For the five day-ahead electricity markets considered, we
report the test period dates and two influential covariate variables.

Market Exogenous Variable 1 Exogenous Variable 2 Test Period


NP day-ahead load day-ahead wind generation 27-12-2016 to 24-12-2018
PJM 2 day-ahead system load 2 day-ahead COMED load 27-12-2016 to 24-12-2018
EPEX-FR day-ahead load day-ahead total France generation 04-01-2015 to 31-12-2016
EPEX-BE day-ahead load day-ahead total France generation 04-01-2015 to 31-12-2016
EPEX-DE day-ahead zonal load day-ahead wind and solar generation 04-01-2016 to 31-12-2017

volatility is negligible, and zero spikes are rare. The PJM market is transitioning from coal
generation to natural gas and some renewable sources, zero spikes are rare, but the system
exhibits higher volatility than NP. In EPEX-BE and EPEX-FR markets, negative prices and
spikes are more frequent, and as time passes, these markets begin to show increasing signs
of integration. Finally, the EPEX-DE market shows few price spikes, but the most frequent
negative and zero price events, due in great part to the impact of renewable sources.
The exogenous covariates are normalized following best practices drawn from the EPF
literature (Uniejewski et al., 2018). Preprocessing the inputs of neural networks is essential
to accelerate and stabilize the optimization (LeCun et al., 1998).

4.2. Interpretable Time Series Signal Decomposition


In this subsection, we demonstrate the versatility of the proposed method and show how
a careful selection of the inductive bias, constituted by the assumptions used to learn the
modeled signal, endows NBEATSx with an outstanding ability to model complex dynamics
while enabling human understanding of its outputs, turning it into a unique and exciting
tool for time series analysis. Our method combines the power of non-linear transforma-
tions provided by neural networks with the flexibility to model multiple seasons that can
be fractional, and simultaneously account for interaction events such as holidays and other
covariates. As described earlier, the interpretable configuration of the NBEATSx architecture
computes time-varying coefficients for slowly changing polynomial functions to model the
trend, harmonic functions to model the cyclical behavior of the signal, and exogenous co-
variates. Here, we show how this configuration can decompose a time series into the classic
set of level, trend, and seasonality components, while identifying the covariate effects.
In this time series signal decomposition example, we show how the NBEATSx-I model
benefits over NBEATS-I from explicitly accounting for information carried by exogenous co-
variates. Figure 3 shows the NP electricity market’s hourly price (EUR/MWh), for December
18, 2017 which is a day with high prices due to high load. Other days have a less pronounced
difference between the results obtained with the original NBEATS-I and the NBEATSx-I. We
selected a day with a higher than normal load for exposition purposes, to demonstrate quali-
tative differences in the forecasts. We can see a substantial difference in the forecast residual
magnitudes in the bottom row of Figure 3. NBEATS shows a strong negative bias. On the
other hand, NBEATSx-I is able to capture the evidently substantial explanatory value of the
exogenous features, resulting in a much more accurate forecast.

10
Early Stopping

Validation Test Set

Figure 2: The top panel shows the day-ahead electricity price time series for the NordPool (NP) market.
The second and third panels show the day-ahead forecast for the system load and wind generation. The
training data is composed of the first four years of each dataset. The validation set is the year that follows
the training data (between the first and second dotted lines). For the held-out test set, the last two years
of each dataset are used (marked by the second dotted line). During evaluation, we recalibrate the model
updating the training set to incorporate all available data before each daily prediction. The recalibration
uses an early stopping set of 42 weeks randomly chosen from the updated training set (a sample selection is
marked with blue rectangles in the top panel).

11
Price[EUR/MWh]

Price[EUR/MWh]
true true
50 level 50 level
forecast forecast
40 40

30 30

4 4
3 3
2 2
Trend

Trend
1 1
0 0
1 1

5.0 5.0
Seasonality

Seasonality

2.5 2.5
0.0 0.0
2.5 2.5

20
Exogenous

10

20 20
Residual

Residual

10 10

0 0
0 5 10 15 20 0 5 10 15 20
Hours [December 18, 2017, 00:00 to 23:00] Hours [December 18, 2017, 00:00 to 23:00]
(a) NBEATS (b) NBEATSx

Figure 3: Time series signal decomposition for NP electricity price day-ahead forecasts using interpretable
variants of NBEATS and NBEATSx. The top row of graphs shows the original signal and the level, the latter
is defined as the last available observation before the forecast. The second row shows the polynomial
trend components, the third and fourth rows display the complex seasonality modeled by nonlinear Fourier
projections and the exogenous effects of the electricity load on the price, respectively. The bottom row
graphs show the unexplained variation of the signal. The use of electricity load and production forecasts
turns out to be fundamental for accurate price forecasting.

12
4.3. Comparative Analysis
4.3.1. Evaluation Metrics
To ensure the comparability of our results with the existing literature, we opted to follow
the widely accepted practice of evaluating the accuracy of point forecasts with the following
metrics: mean absolute error (MAE), relative mean absolute error (rMAE) 2 , symmetric
mean absolute percentage error (sMAPE) and root mean squared error (RMSE), defined as:
Nd X 24 PNd P24
1 X |yd,h − ŷd,h |
M AE = |yd,h − ŷd,h | rM AE = PNd=1P24h=1
24Nd d=1 h=1 d
d=1
naive
h=1 |yd,h − ŷd,h |

v
Nd X
24 u Nd X 24
200 X |yd,h − ŷd,h | u 1 X
sM AP E = RM SE = t (yd,h − ŷd,h )2
24Nd d=1 h=1
|yd,h | + |ŷd,h | 24Nd d=1 h=1

where yd,h and ŷd,h are the actual value and the forecast of the time series at day d and hour
h, for our experiments given the two years of each test set Nd = 728.
While regression-based models are estimated by minimizing squared errors, to train
neural networks we minimize absolute errors (see Section 4.3.3 below). Hence, both the
MAE and RMSE are highly relevant in our context. Since they are not easily comparable
across datasets – and given the popularity of such errors in forecasting practice (Makridakis
et al., 2020) – we have additionally computed a percentage and a relative measure. The
sMAPE is used as an alternative to MAPE, which in the presence of values close to zero
may degenerate (Hyndman & Koehler, 2006). The rMAE is calculated instead of a scaled
measure used in the M4 competition for reasons explained in Sec. 5.4.2. of Lago et al.
(2021a).

4.3.2. Statistical Tests


To assess which forecasting model provides better predictions, we rely on the Giacomini-
White test (GW; Giacomini & White 2006) of the multi-step conditional predictive ability,
which can be interpreted as a generalization of the Diebold-Mariano test (DM; Diebold &
Mariano 1995), widely used in the forecasting literature. Compared with the DM or other
unconditional tests, the GW test is valid under general assumptions such as heterogeneity
rather than stationarity of data. The GW test examines the null hypothesis of equal accuracy
specified in Equation (10), measured by the L1 norm of the daily errors of a pair of models

2
The naı̈ve forecast method in EPF corresponds to a similar day rule, where the forecast for a Monday,
Saturday and Sunday equals the value of the series observed on the same weekday of the previous week,
while the forecast for Tuesday, Wednesday, Thursday, and Friday is the value observed on the previous day.

13
A and B, conditioned on the available information to that moment3 in time Fd−1 .
h i
H0 : E ||yd − ŷdA ||1 − ||yd − ŷdB ||1 | Fd−1 ≡ E ∆A,B
 
d | Fd−1 = 0 (10)

4.3.3. Training Methodology


The cornerstone of the training methodology for NBEATSx and the benchmark models
included in this work is the definition and use of the training, validation, early stopping,
and test datasets depicted in Figure 2. The training set for each of the five markets comprises
the first three years of data, the test set includes the last two years of data. The validation
set is defined as the year between the train and test set coverages. The early stopping set,
used for regularization, is either randomly sampled or corresponds to 42 weeks following the
time span of the training set. These sets are used in the hyperparameter optimization phase
and recalibration phase that we describe below.
During the hyperparameter optimization phase, model performance measured on the val-
idation set is used to guide the exploration of the hyperparameter space defined in Table 2.
During the recalibration phase, the optimally selected model, as defined by its hyperpa-
rameters, is re-trained for each day to include newly available information before the test
inference. In this phase, an early stopping set provides a regularization signal for the re-
training optimization.
To train the neural network, and as is common in the literature (Smyl, 2020; Oreshkin
et al., 2020), we minimize the mean absolute error (MAE) using stochastic gradient descent
with Adaptive Moments (ADAM; Kingma & Ba 2014). Figure A.2 in the Appendix compares
the training and validation trajectories for NBEATS and NBEATSx, as diagnostics to assess the
differences of the methods. The early stopping strategy halts the training procedure if a
specified number of consecutive iterations occur without improvements of the loss measured
on the early stopping set (Yao et al., 2007).
The NBEATSx model is implemented and trained in PyTorch (Paszke et al., 2019) and
can be run with both CPU and GPU resources. The code is available publicly in a dedi-
cated repository to promote reproducibility of the presented results and to support related
research.

4.3.4. Hyperparameter Optimization


We follow the practice of Lago et al. (2018a) to select the hyperparameters that define the
model, input features, and optimization settings. During this phase, the validation dataset
is used to guide the search for well performing configurations. To compare the benchmarks
and the NBEATSx, we rely on the same automated selection process: a Bayesian optimization
technique that efficiently explores the hyperparameter space using tree-structured Parzen
estimators (HYPEROPT; Bergstra et al. 2011). The architecture, optimization, and regular-
ization hyperparameters are summarized in Table 2. To have comparable results, during

3
In practice, the available information Fd−1 is replaced with a constant and lags of the error difference
∆A,B
d and the test is performed using a linear regression with a Wald-like test. When the conditional
information considered is only the constant variable, one recovers the original DB test.
14
Table 2: Hyperparameters of NBEATSx networks. They are common to all presented datasets. We list the
typical values we considered in our experiments. The configuration that performed best on the validation
set was selected automatically.

Hyperparameter Considered Values


Architecture Parameters
Input size, size of autorregresive feature window. L ∈ {168}
Output size is the forecast horizon for day ahead forecasting. H ∈ {24}
List for architecture’s type/number of stacks. {[Identity, TCN], [TCN, Identity]
[Identity, WaveNet], [WaveNet, Identity]}
Type of activations used accross the network. {SoftPlus,SeLU,PreLU,Sigmoid,ReLU, TanH, LReLU}
Blocks separated by residual links per stack (shared across stacks). {[1,1,1], [1, 1]}.
FCNN layers within each block. {2}
FCNN hidden neurons on each layer of a block. Nh ∈ {50, . . . , 500}
Exogenous Temp. convolution filter size (Equation 9) {2, . . . , 10}
Only interpretable, degree of trend polynomials. Npol ∈ {2, 3, 4}
Only interpretable, number of Fourier basis (seasonality smoothness). Nhr ∈ {1, 2}
Whether NBEATSx coefficients take input X (Equation (1)). {True, False}
Optimization and Regularization parameters
Initialization strategy for network weights. {orthogonal, he norm, glorot norm}
Initial learning rate for regression problem. Range(5e-4,1e-2)
The number of samples for each gradient step. {256, 512}
The decay constant allows large initial lr to escape local minima. {0.5}
Number of times the learning rate is halved during train. {3}
Maximum number of gradient descent iterations. {30000}
Iterations without validation loss improvement before stop. {10}
Frequency of validation loss measurements. {100}
Whether batch normalization is applied after each activation. {True, False}
The probability for dropout of neurons in the projection layers. Range(0,1)
The probability for dropout of neurons for the exogenous encoder. Range(0,1)
Constant to control the Lasso penalty used on the coefficients. Range(0, 0.1)
Constant that controls the influence of L2 regularization of weights. Range(1e-5,1e-0)
The objective loss function with which NBEATSx is trained. {MAE}
Random weeks from full dataset used to validate. {42}
Number of iterations of hyperparameter search. {1500}
Random seed that controls initialization of weights. DiscreteRange(1,1000)
Data Parameters
Rolling window sample frequency, for data augmentation. {1, 24}
Number of time windows included in the full dataset. 4 years
Number of validation weeks used for early stopping strategy. {40, 52}
Normalization strategy of model inputs. {none, median, invariant, std }

the hyperparameter optimization stage we used the same number of configurations as in


Lago et al. (2018a). Note, that some of the methods do not require any hyperparameter
optimization – e.g., the AR1 benchmark – and some might only have one hyper-parameter to
be determined, such as the regularization parameter in the LEARx method, which is typically
computed using the information criteria or cross-validation.

4.3.5. Ensembling
In many recent forecasting competitions, and particularly in the M4 competition, most of
the top-performing models were ensembles (Atiya, 2020). It has been shown that in practice,
combining a diverse group of models can be a powerful form of regularization to reduce the
variance of predictions (Breiman, 1996; Nowotarski et al., 2014; Hubicka et al., 2019).
The techniques used by the forecasting community to induce diversity in the models
are plentiful. The original NBEATS model obtained its diversity from three sources, training
with different loss functions, varying the size of the input windows, and bagging models
15
with different random initializations (Oreshkin et al., 2020). They used the median as the
aggregation function for 180 different models. Interestingly, the original model did not rely
on regularization, such as L2 or dropout, as Oreshkin et al. (2020) found it to be good for
the individual models but detrimental to the ensemble.
In our case, we ensemble the NBEATSx model using two sources of diversity. The first
being a data augmentation technique controlled by the sampling frequency of the windows
used during training, as defined in the data parameters from Table 2. The second source of
diversity being whether we randomly select the early stopping set or instead use the last 42
weeks preceding the test set. Combining the data augmentation and early stopping options,
we obtain four models that we ensemble using arithmetic mean as the aggregation function.
This technique is also used by the DNN benchmark (Lago et al., 2018a, 2021a).

4.3.6. Forecasting Results


We conducted an empirical study involving two types of Autoregressive Models (AR1 and
ARx1; Weron 2014), the Lasso Estimated Auto-Regressive (LEARx; Uniejewski et al. 2016),
a parsimonious Deep Neural Network (DNN; Lago et al. 2018a, 2021a), the original Neural
Basis Expansion Analysis without exogenous covariates (NBEATS; Oreshkin et al. 2020), and
the Exponential Smoothing Recurrent Neural Network (ESRNN; Smyl 2020). This experiment
examines the effects of including the covariate inputs and comparing NBEATSx with state-of-
the-art methods for the electricity price day-ahead forecasting task.
Table 3 summarizes the performance of the ensembled models where NBEATSx ensemble
shows prevailing performance. It improves 18.77% on average for all metrics and mar-
kets when compared with the original NBEATS and 20.6% when compared to ESRNN without
time-dependent covariates. For the ensembled models, NBEATSx RMSE improved on av-
erage 4.68%, MAE improved 2.53%, rMAE improved 1.97%,and sMAPE improved 1.25%.
When comparing NBEATSx ensemble against DNN ensemble on individual markets, NBEATSx
improved by 5.38% on the NordPool market, by 2.48% on French market and 2.81% on
German market. There was a non-significant difference of NBEATSx performance on PJM and
EPEX-BE markets of 0.24% and 1.1%, respectively.
Figure 4 provides a graphical representation of the statistical significance from the
Giacomini-White test (GW) for the six ensembled models, across the five markets for the
MAE evaluation metric. A similar significance analysis is conducted for the single models.
The models included in the significance tests are the same as in Table 3: LEAR, DNN, ESRNN,
NBEATS, and our proposed methods, NBEATSx-G and NBEATSx-I. The p-value of each com-
parison shows if the performance improvement of the model’s predictions corresponding to
the column index of a cell in the grids shown in Figure 4 over the model’s predictions corre-
sponding to the row of this cell of the grid is statistically significant. NBEATSx-G model out-
performed DNN model in NP and EPEX-DE, while NBEATSx-I outperformed it in NP, EPEX-FR,
and EPEX-DE. Moreover, no benchmark model significantly outperformed NBEATSx-I and
NBEATSx-G in any market.
In the Appendix A we observe similar results for the single best models chosen from the
four possible configurations of the ensemble components described in Section 4.3.5. Table A2
summarizes the accuracy of the predictions measured with the MAE and Figure A.3 displays
16
Table 3: Forecast accuracy measures for day-ahead electricity price predictions of ensembled models. The
ESRNN and NBEATS do not include time dependent covariates. The reported metrics are mean absolute error
(MAE), relative mean absolute error (rMAE), symmetric mean absolute percentage error (sMAPE) and root
mean squared error (RMSE). The smallest errors in each row are highlighted in bold.
*
The LEARx results for EPEX-DE differ from Lago et al. (2021a) – the values presented there are revised
(Lago et al., 2021b)

AR1 ESRNN NBEATS ARx1 LEARx* DNN NBEATSx-G NBEATSx-I


MAE 2.26 2.09 2.08 2.01 1.74 1.68 1.58 1.62
rMAE 0.71 0.66 0.66 0.63 0.55 0.53 0.50 0.51
NP
sMAPE 6.47 6.04 5.96 5.84 5.01 4.88 4.63 4.70
RMSE 4.08 3.89 3.94 3.71 3.36 3.32 3.16 3.27
MAE 3.83 3.59 3.49 3.53 3.01 2.86 2.91 2.90
rMAE 0.79 0.74 0.72 0.73 0.62 0.59 0.60 0.60
PJM
sMAPE 14.5 14.12 13.57 13.64 11.98 11.33 11.54 11.61
RMSE 6.24 5.83 5.64 5.74 5.13 5.04 5.02 4.84
MAE 7.2 6.96 6.84 7.19 6.14 5.87 5.95 6.11
. EPEX-BE
rMAE 0.88 0.85 0.83 0.88 0.75 0.72 0.73 0.75
sMAPE 16.26 15.84 15.80 16.11 14.55 13.45 13.86 14.02
RMSE 18.62 16.84 17.13 18.07 15.97 15.97 15.76 15.80
MAE 4.65 4.65 4.74 4.56 3.98 3.87 3.81 3.79
rMAE 0.78 0.78 0.80 0.76 0.67 0.65 0.64 0.64
EPEX-FR
sMAPE 13.03 13.22 13.30 12.7 11.57 10.81 10.59 10.69
RMSE 13.89 11.83 12.01 12.94 10.68 11.87 11.50 11.25
MAE 5.74 5.60 5.31 4.36 3.61 3.41 3.31 3.29
rMAE 0.71 0.70 0.66 0.54 0.45 0.42 0.41 0.41
EPEX-DE
sMAPE 21.37 20.97 19.61 17.73 14.74 14.08 13.99 13.99
RMSE 9.63 9.09 8.99 7.38 6.51 5.93 5.72 5.65

the significance of the GW test. Ensembling improves the accuracy of NBEATSx by 3% on


average acrosss all markets, when compared to the single best models.
Finally, regarding the computational time complexity NBEATSx maintains good perfor-
mance. As shown in Table A1 in the Appendix, the time necessary to compute day-ahead
predictions is in the order of miliseconds and comparable to that of the LEAR and DNN bench-
marks. Additionally, the average time needed to perform a recalibration only takes circa 50
percent more than the relatively parsimonious DNN.

17
NP (ensemble, MAE) 0.10 PJM (ensemble, MAE) 0.10 BE (ensemble, MAE) 0.10
AR1 AR1 AR1
ESRNN 0.08 ESRNN 0.08 ESRNN 0.08
NBEATS NBEATS NBEATS
ARx1 0.06 ARx1 0.06 ARx1 0.06
LEARx 0.04 LEARx 0.04 LEARx 0.04
DNN DNN DNN
NBEATSx-G 0.02 NBEATSx-G 0.02 NBEATSx-G 0.02
NBEATSx-I NBEATSx-I NBEATSx-I
0.00 0.00 0.00
AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I
FR (ensemble, MAE) 0.10 DE (ensemble, MAE) 0.10
AR1 AR1
ESRNN 0.08 ESRNN 0.08
NBEATS NBEATS
ARx1 0.06 ARx1 0.06
LEARx 0.04 LEARx 0.04
DNN DNN
NBEATSx-G 0.02 NBEATSx-G 0.02
NBEATSx-I NBEATSx-I
0.00 0.00
AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I
Figure 4: Results of the Giacomini-White test for the day-ahead predictions with mean absolute error (MAE)
applied to pairs of the ensembled models on the five electricity markets datasets. Each grid represents one
market. Each colored cell in a grid is plotted black, unless the predictions of the model corresponding to its
column of the grid outperforms the predictions of the model corresponding to its row of the grid. The color
scale reflects significance of the difference in MAE, with solid green representing the lowest p-values.

5. Conclusions
We have presented NBEATSx: the new method for univariate time series forecasting with
exogenous variables. It extends the well-performing neural basis expansion analysis. The
resulting neural based method has several valuable properties that make it suitable for a
wide range of forecasting tasks. The network is fast to optimize as it is mainly composed
of fully-connected layers. It can produce interpretable results, and achieves state-of-the-art
performance on forecasting tasks where consideration of exogenous variables is fundamental.
We demonstrated the utility of the proposed method using a set of benchmark datasets
from electricity price forecasting domain, but it can be straightforwardly applied to fore-
casting problems in other domains. Qualitative evaluation shows that the interpretable
configuration of NBEATSx can provide valuable insights to the analyst, as it explains the
variation of the time series by separating it into trend, seasonality, and exogenous compo-
nents, in a fashion analogous to classic time series decomposition. Regarding the quantitative
forecasting performance, we observed no significant differences between ESRNN and NBEATS
without exogenous variables. At the same time, NBEATSx improves over NBEATS by nearly
20% and up to 5% over LEAR and DNN models specialized for the Electricity Price Forecasting
tasks. Finally, we found no significant trade-offs between the accuracy and interpretability
of NBEATSx-G and NBEATSx-I predictions.

18
The neural basis expansion analysis is a very flexible method capable of producing ac-
curate and interpretable forecasts, yet there is still room for improvement. For instance,
augmentation of the harmonic functions towards wavelets or replacement of the convolu-
tional encoder that would generate the covariate basis with smoothing alternatives such as
splines. Additionally, one can extend the current non-interpretable method by regularizing
its outputs with smoothness constraints.

Acknowledgements
This work was partially supported by the Defense Advanced Research Projects Agency
(award FA8750-17-2-0130), the National Science Foundation (grant 2038612), the Space
Technology Research Institutes grant from NASA’s Space Technology Research Grants Pro-
gram, the U.S. Department of Homeland Security (award 18DN-ARI-00031), the Ministry
of Education and Science (MEiN, Poland; grant 0219/DIA/2019/48), the National Science
Center (NCN, Poland; grant 2018/30/A/HS4/00444), and Nixtla. Kin G. Olivares and Cris-
tian Challu want to thank Stefania La Vattiata, Max Mergenthaler and Federico Garza for
their support.

References
Atiya, A. F. (2020). Why does forecast combination work so well? International Journal of Forecasting,
36 , 197–200. URL: https://www.sciencedirect.com/science/article/pii/S0169207019300779.
doi:https://doi.org/10.1016/j.ijforecast.2019.03.010. M4 Competition.
Bai, S., Kolter, J. Z., & Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent
networks for sequence modeling. Computing Research Repository, abs/1803.01271 . URL: http://arxiv.
org/abs/1803.01271. arXiv:1803.01271.
Benidis, K., Rangapuram, S. S., Flunkert, V., Wang, B., Maddix, D., Turkmen, C., Gasthaus, J., Bohlke-
Schneider, M., Salinas, D., Stella, L., Callot, L., & Januschowski, T. (2020). Neural forecasting: Intro-
duction and literature overview. Computing Research Repository, . arXiv:2004.10240.
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization.
In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), Advances in Neural
Information Processing Systems (pp. 2546–2554). Curran Associates, Inc. volume 24. URL: https:
//proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24 , 123–140. URL: https://doi.org/10.
1023/A:1018054314350. doi:10.1023/A:1018054314350.
Chang, S., Zhang, Y., Han, W., Yu, M., Guo, X., Tan, W., Cui, X., Witbrock, M., Hasegawa-Johnson,
M. A., & Huang, T. S. (2017). Dilated recurrent neural networks. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing
Systems. Curran Associates, Inc. volume 30. URL: https://proceedings.neurips.cc/paper/2017/
file/32bb90e8976aab5298d5da10fe66f21d-Paper.pdf.
Chitsaz, H., Zamani-Dehkordi, P., Zareipour, H., & Parikh, P. (2018). Electricity price forecasting for
operational scheduling of behind-the-meter storage systems. IEEE Transactions on Smart Grid , 9 ,
6612–6622. doi:10.1109/TSG.2017.2717282.
Diebold, F., & Mariano, R. (1995). Comparing predictive accuracy. Journal of Business & Economic
Statistics, 13 , 253–265. URL: https://www.sas.upenn.edu/~fdiebold/papers/paper68/pa.dm.pdf.
doi:10.1080/07350015.1995.10524599.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14 , 179–211. URL: https://
onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog1402_1.

19
Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74 ,
1545–1578. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1468-0262.2006.00718.
x. doi:https://doi.org/10.1111/j.1468-0262.2006.00718.x.
Gianfreda, A., Ravazzolo, F., & Rossini, L. (2020). Comparing the forecasting performances of linear
models for electricity prices with high RES penetration. International Journal of Forecasting, 36 , 974–
986. URL: https://www.sciencedirect.com/science/article/pii/S0169207019302596. doi:https:
//doi.org/10.1016/j.ijforecast.2019.11.002.
Graves, A. (2013). Generating sequences with recurrent neural networks. Computing Research Repository,
abs/1308.0850 . URL: http://arxiv.org/abs/1308.0850. arXiv:1308.0850.
Hubicka, K., Marcjasz, G., & Weron, R. (2019). A note on averaging day-ahead electricity price forecasts
across calibration windows. IEEE Transactions on Sustainable Energy, 10(1), 321–323. doi:10.1109/
TSTE.2018.2869557.
Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International
Journal of Forecasting, 22 , 679 – 688. URL: http://www.sciencedirect.com/science/article/pii/
S0169207006000239. doi:https://doi.org/10.1016/j.ijforecast.2006.03.001.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On large-batch training
for deep learning: Generalization gap and sharp minima. URL: http://arxiv.org/abs/1609.04836
published as a conference paper at the 5th International Conference for Learning Representations (ICLR),
Toulon, France, 2017.
Kingma, D. P., & Ba, J. (2014). ADAM: A method for stochastic optimization. URL: http://arxiv.
org/abs/1412.6980 published as a conference paper at the 3rd International Conference for Learning
Representations (ICLR), San Diego, 2015.
Koopmans, L. H. (1995). The spectral analysis of time series. Elsevier.
Lago, J., De Ridder, F., & De Schutter, B. (2018a). Forecasting spot electricity prices: Deep learn-
ing approaches and empirical comparison of traditional algorithms. Applied Energy, 221 , 386 –
405. URL: http://www.sciencedirect.com/science/article/pii/S030626191830196X. doi:https:
//doi.org/10.1016/j.apenergy.2018.02.069.
Lago, J., De Ridder, F., Vrancx, P., & De Schutter, B. (2018b). Forecasting day-ahead electricity
prices in Europe: The importance of considering market integration. Applied Energy, 211 , 890–
903. URL: https://www.sciencedirect.com/science/article/pii/S0306261917316999. doi:https:
//doi.org/10.1016/j.apenergy.2017.11.098.
Lago, J., Marcjasz, G., De Schutter, B., & Weron, R. (2021a). Forecasting day-ahead electric-
ity prices: A review of state-of-the-art algorithms, best practices and an open-access bench-
mark. Applied Energy, 293 , 116983. URL: https://www.sciencedirect.com/science/article/pii/
S0306261921004529. doi:https://doi.org/10.1016/j.apenergy.2021.116983.
Lago, J., Marcjasz, G., Schutter, B. D., & Weron, R. (2021b). Erratum to ’Forecasting day-ahead electricity
prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark’ [Appl. Energy
293 (2021) 116983] . WORking papers in Management Science (WORMS) WORMS/21/12 Department
of Operations Research and Business Intelligence, Wroclaw University of Science and Technology. URL:
https://ideas.repec.org/p/ahh/wpaper/worms2112.html.
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient backprop. In Neural Networks: Tricks
of the Trade (pp. 9–50). Berlin, Heidelberg: Springer Berlin Heidelberg. URL: https://doi.org/10.
1007/3-540-49430-8_2. doi:10.1007/3-540-49430-8_2.
Li, W., & Becker, D. (2021). Day-ahead electricity price prediction applying hybrid models of lstm-based
deep learning methods and feature selection algorithms under consideration of market coupling. Energy,
237 , 121543.
Livera, A. M. D., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with complex sea-
sonal patterns using exponential smoothing. Journal of the American Statistical Association, 106 ,
1513–1527. URL: https://doi.org/10.1198/jasa.2011.tm09771. doi:10.1198/jasa.2011.tm09771.
arXiv:https://doi.org/10.1198/jasa.2011.tm09771.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and machine learning forecasting

20
methods: Concerns and ways forward. PLoS One, 13(3), e0194889. URL: https://journals.plos.
org/plosone/article?id=10.1371/journal.pone.0194889.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M4 competition: 100,000
time series and 61 forecasting methods. International Journal of Forecasting, 36 , 54–
74. URL: https://www.sciencedirect.com/science/article/pii/S0169207019301128. doi:https:
//doi.org/10.1016/j.ijforecast.2019.04.014. M4 Competition.
Marcjasz, G. (2020). Forecasting electricity prices using deep neural networks: A robust hyper-parameter
selection scheme. Energies, 13 , 13184605.
Mayer, K., & Trück, S. (2018). Electricity markets around the world. Journal of Commodity Markets, 9 ,
77–100. URL: https://doi.org/10.1016/j.jcomm.2018.02.001.
Narajewski, M., & Ziel, F. (2020). Econometric modelling and forecasting of intraday electricity prices.
Journal of Commodity Markets, 19 , 100107. doi:10.1016/j.jcomm.2019.100107.
Nazar, M. S., Fard, A. E., Heidari, A., Shafie-khah, M., & Catalão, J. P. (2018). Hybrid model using
three-stage algorithm for simultaneous load and price forecasting. Electric Power Systems Research, 165 ,
214–228. doi:10.1016/j.epsr.2018.09.004.
Nowotarski, J., Raviv, E., Trück, S., & Weron, R. (2014). An empirical comparison of alternative schemes
for combining electricity spot price forecasts. Energy Economics, 46 , 395–412. URL: https://ideas.
repec.org/a/eee/eneeco/v46y2014icp395-412.html. doi:10.1016/j.eneco.2014.07.0.
Nowotarski, J., & Weron, R. (2018). Recent advances in electricity price forecasting: A review of probabilistic
forecasting. Renewable and Sustainable Energy Reviews, 81 , 1548–1568. doi:https://doi.org/10.1016/
j.rser.2017.05.234.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior,
A. W., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. CoRR, abs/1609.03499 .
URL: http://arxiv.org/abs/1609.03499. arXiv:1609.03499.
Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2020). N-BEATS: neural basis expansion analysis
for interpretable time series forecasting. In 8th International Conference on Learning Representations,
ICLR 2020 . URL: https://openreview.net/forum?id=r1ecqn4YwB.
Paszke et al. (2019). Pytorch: An imperative style, high-performance Deep Learning library. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Infor-
mation Processing Systems 32 (pp. 8024–8035). Curran Associates, Inc. URL: http://papers.neurips.
cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Rosenblatt, F. (1961). Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Tech-
nical Report Cornell Aeronautical Lab Inc Buffalo NY.
Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for time series
forecasting. International Journal of Forecasting, 36 , 75–85. URL: https://www.sciencedirect.com/
science/article/pii/S0169207019301153. doi:https://doi.org/10.1016/j.ijforecast.2019.03.
017. M4 Competition.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence learning with neural networks. In
Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural
Information Processing Systems. Curran Associates, Inc. volume 27.
Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. URL: https://
arxiv.org/abs/physics/0004057 in The 37th annual Allerton Conf. on Communication, Control, and
Computing, pp 368–377.
Uniejewski, B., Nowotarski, J., & Weron, R. (2016). Automated variable selection and shrinkage for day-
ahead electricity price forecasting. Energies, 9(8),621 . doi:10.3390/en9080621.
Uniejewski, B., & Weron, R. (2021). Regularized quantile regression averaging for probabilistic electricity
price forecasting. Energy Economics, 95 , 105121. URL: https://www.sciencedirect.com/science/
article/pii/S0140988321000268. doi:https://doi.org/10.1016/j.eneco.2021.105121.
Uniejewski, B., Weron, R., & Ziel, F. (2018). Variance stabilizing transformations for electricity spot price
forecasting. IEEE Transactions on Power Systems, 33 , 2219–2229. doi:10.1109/TPWRS.2017.2734563.
Wang, L., Zhang, Z., & Chen, J. (2017). Short-term electricity price forecasting with stacked denoising

21
autoencoders. IEEE Transactions on Power Systems, 32 , 2673–2681. doi:10.1109/TPWRS.2016.2628873.
Wen, R., Torkkola, K., Narayanaswamy, B., & Madeka, D. (2017). A Multi-horizon Quantile Recurrent
Forecaster. In 31st Conference on Neural Information Processing Systems NIPS 2017, Time Series
Workshop. URL: https://arxiv.org/abs/1711.11053. arXiv:1711.11053.
Weron, R. (2014). Electricity price forecasting: A review of the state-of-the-art with a look into the future.
International Journal of Forecasting, 30 , 1030–1081. URL: https://www.sciencedirect.com/science/
article/pii/S0169207014001083. doi:https://doi.org/10.1016/j.ijforecast.2014.08.008.
Yao, Y., Rosasco, L., & Andrea, C. (2007). On early stopping in gradient descent learning. Constructive
Approximation, 26(2), 289–315. URL: https://doi.org/10.1007/s00365-006-0663-2.
Ziel, F., & Steinert, R. (2018). Probabilistic mid- and long-term electricity price forecasting. Renewable and
Sustainable Energy Reviews, 94 , 251–266. URL: https://arxiv.org/abs/1703.10806.

22
Appendix A. Appendix
Appendix A.1. Forecast and Backast Basis

1.0 1.0

0.9 0.8
Backcast Basis

Forecast Basis
0.8 0.6

0.7 0.4

0.6 0.2

0.5 0.0
0 10 20 30 40 0 10 20
Time Index Time Index
(a) Trend Basis

1.00 1.00
0.75 0.75
0.50 0.50
Backcast Basis

Forecast Basis

0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
0.75 0.75
1.00 1.00
0 10 20 30 40 0 10 20
Time Index Time Index
(b) Harmonic Basis

Figure A.1: Examples of polynomial and harmonic basis included in the interpretable configuration of the
neural basis expansion analysis. The slowly varying basis allow NBEATS to model trends and seasonalities.

As discussed in Section 3.4, the interpretable configuration of the NBEATSx method per-
forms basis projections into polynomial functions for the trends, harmonic functions for the
seasonalities and exogenous variables. As shown in Figure A.1, both the forecast and the
backcast components of the model rely on similar basis functions, and the only difference de-
pends upon the span of their time indexes. For this work in the EPF application of NBEATS,
the backcast horizon corresponds to 168 hours while the forecast horizon corresponds to 24.

23
Appendix A.2. Training and validation curves

2.4 NBEATSx
2.2 NBEATS
2.0
1.8
MAE

1.6
1.4
1.2
1.0
0.8
0 250 500 750 1000 1250 1500 1750 2000
Iteration
(a) Train set

NBEATSx
2.6 NBEATS
2.4
2.2
2.0
MAE

1.8
1.6
1.4
1.2
0 250 500 750 1000 1250 1500 1750 2000
Iteration
(b) Validation set

Figure A.2: Training and validation Mean Absolute Error (MAE) curves on the NP market. We show the
curves for NBEATSx-G with exogenous variables and NBEATS without exogenous variables as a function of the
optimization iterations. We define the four curves by a different random seed used for initialization.

To study the effects of exogenous variables on the NBEATS model, we performed model
training procedure diagnostics. Figure A.2 shows the train and validation mean absolute
error (MAE) for the NBEATS and NBEATSx models as training progresses. The curves cor-
respond to the hyperparameter optimization phase described in Section 4.3.4. The models
trained with and without exogenous variables display a considerable difference in their train
and validation errors as observed by the two separate clusters of trajectories. The exogenous
variables, in this case, the electricity load and production forecasts, significantly improve
the neural basis expansion analysis.

24
Appendix A.3. Computational Time
Table A1: Computational time performance in seconds for the top four most accurate models for the day-
ahead electricity price forecasting task in the NP market, averaged for the four elements of the ensembles
(Time performance for the rest of the markets is almost identical).

LEARx DNN NBEATSx-G NBEATSx-I


Recalibration 18.57 50.65 75.02 81.61
Prediction 0.0032 0.0041 0.0048 0.0054

We measured the computational time of the top four best algorithms with two metrics:
the recalibration of the ensemble models selected from the hyperparameter optimization,
and the computation of the predictions. For these experiments, we used a GeForce RTX
2080 GPU for the neural network models and an Intel(R) Xeon(R) Silver 4210 CPU @
2.20GHz for LEAR.
The training time of the recalibration phase of NBEATSx remains efficient, as it still
trains in 75 and 81 seconds, increasing by 30 seconds on the relatively simple DNN. The
computational time of the prediction remains within miliseconds. Finally the hyperparameter
optimization scales linearly with respect to the time of the recalibration phase and the
evaluation steps of the optimization, in case of the NBEATSx-G the approximate time of a
hyperparameter search of 1000 steps takes two days4 .
Appendix A.4. Best Single Models
Table A2 shows that the best NBEATSx models yield improvements of 14.8% on aver-
age across all the evaluation metrics when compared to its NBEATS counterpart without
exogenous covariates, and improvements of 23.9% when compared to ESRNN without time-
dependent covariates. A perhaps more remarkable result is the statistically significant im-
provement of forecast accuracy over LEAR and DNN benchmarks, ranging from 0.75% to 7.2%
across all metrics and markets, with the exception of EPEX-BE. Compared to DNN, the RMSE
improved on average 4.9%, the MAE improved 3.2%, the rMAE improved 3.0%, and sMAPE
improved 1.7%. When comparing the best NBEATSx models against the best DNN on individ-
ual markets, NBEATSx improved by 3.18% on the Nord Pool market (NP), 2.03% 2.65% on
French (EPEX-FR) and 5.24% on German (EPEX-DE) power markets. The positive difference
in performance for Belgian (EPEX-BE) market of 0.53% was not statistically significant.
Figure A.3 provides a graphical representation of the GW test for the six best models,
across the five markets for the MAE evaluation metric. The models included in the signifi-
cance tests are the same as in Tables A2: LEAR, DNN, the ESRNN, NBEATS, and our proposed
methods, the NBEATSx-G and NBEATSx-I. The p-value of each individual comparison shows if
the improvement in performance (measured by MAE or RMSE) of the x-axis model over the
y-axis model is statistically significant. Both the NBEATSx-G and NBEATSx-I model outper-
formed the LEAR and DNN models in all markets, with the exception of Belgium. Moreover,
no benchmark model outperformed the NBEATSx-I and NBEATSx-G on any market.
4
For comparability we use 1000 steps (Lago et al., 2021a), restricting to 300 steps yields similar results.
25
Table A2: Forecast accuracy measures for day-ahead electricity prices for the best single model out of the
four models described in the Subsection 4.3.5. The ESRNN and NBEATS, are the original implementations and
do not include time dependent covariates. The reported metrics are mean absolute error (MAE), relative
mean absolute error (rMAE), symmetric mean absolute percentage error (sMAPE) and root mean squared
error (RMSE). The smallest errors in each row are highlighted in bold.
*
The LEARx results for EPEX-DE differ from Lago et al. (2021a) – the values presented there are revised
(Lago et al., 2021b)

AR1 ESRNN NBEATS ARx1 LEARx* DNN NBEATSx-G NBEATSx-I


MAE 2.28 2.11 2.11 2.11 1.95 1.71 1.65 1.68
rMAE 0.72 0.67 0.67 0.67 0.62 0.54 0.52 0.53
NP
sMAPE 6.51 6.09 6.06 6.1 5.62 4.97 4.83 4.89
RMSE 4.08 3.92 3.98 3.84 3.60 3.36 3.27 3.33
MAE 3.88 3.63 3.48 3.68 3.09 3.07 3.02 3.01
rMAE 0.8 0.75 0.72 0.76 0.64 0.63 0.62 0.62
PJM
sMAPE 14.66 14.26 13.56 14.09 12.54 12.00 11.97 11.91
RMSE 6.26 5.87 5.59 5.94 5.14 5.20 5.06 5.00
MAE 7.04 7.01 6.83 7.05 6.59 6.07 6.14 6.17
. EPEX-BE
rMAE 0.86 0.86 0.83 0.86 0.80 0.74 0.75 0.75
sMAPE 16.29 15.95 16.03 16.21 15.95 14.11 14.68 14.52
RMSE 17.25 16.76 16.99 17.07 16.29 15.95 15.46 15.43
MAE 4.74 4.68 4.79 4.85 4.25 4.06 3.98 3.97
rMAE 0.80 0.78 0.80 0.86 0.71 0.68 0.67 0.67
EPEX-FR
sMAPE 13.49 13.25 13.62 16.21 13.25 11.49 11.07 11.29
RMSE 13.68 11.89 12.09 17.07 10.75 11.77 11.61 11.08
MAE 5.73 5.64 5.37 4.58 3.93 3.59 3.46 3.37
rMAE 0.71 0.70 0.67 0.57 0.49 0.45 0.43 0.42
EPEX-DE
sMAPE 21.22 21.09 19.71 18.52 16.80 14.68 14.78 14.34
RMSE 9.39 9.17 9.03 7.69 6.53 6.08 5.84 5.64

NP (single, MAE) 0.10 PJM (single, MAE) 0.10 BE (single, MAE) 0.10
AR1 AR1 AR1
ESRNN 0.08 ESRNN 0.08 ESRNN 0.08
NBEATS NBEATS NBEATS
ARx1 0.06 ARx1 0.06 ARx1 0.06
LEARx 0.04 LEARx 0.04 LEARx 0.04
DNN DNN DNN
NBEATSx-G 0.02 NBEATSx-G 0.02 NBEATSx-G 0.02
NBEATSx-I NBEATSx-I NBEATSx-I
0.00 0.00 0.00
AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

FR (single, MAE) 0.10 DE (single, MAE) 0.10


AR1 AR1
ESRNN 0.08 ESRNN 0.08
NBEATS NBEATS
ARx1 0.06 ARx1 0.06
LEARx 0.04 LEARx 0.04
DNN DNN
NBEATSx-G 0.02 NBEATSx-G 0.02
NBEATSx-I NBEATSx-I
0.00 0.00
AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

AR1
ESRNN
NBEATS
ARx1
LEARx
DNN
NBEATSx-G
NBEATSx-I

Figure A.3: Results of the Giacomini-White test for the day-ahead predictions with mean absolute error
(MAE) applied to pairs of the single models on the five electricity markets datasets. Each grid represents
one market. Each colored cell in a grid is plotted black, unless the predictions of the model corresponding
to its column of the grid outperforms the predictions of the model corresponding to its row of the grid. The
color scale reflects significance of the difference in MAE, with solid green representing the lowest p-values.

26
Appendix A.5. Comments on Hyperparameter Optimization
In this Section, we summarize observations and key empirical findings from the extensive
hyperparameter optimization on the space defined by Table 2 for the four models composing
each dataset ensemble. These observations and regularities of the optimally selected hyper-
parameters are important to create a more efficient and informed hyperparameter space and
possibly guide future experiments with the NBEATSx architecture.
Interpretable configuration observations:
1. Among quadratic, cubic and fourth degree polynomials, Npol ∈ {2, 3, 4}, the most
common basis selected for the day-ahead EPF task was quadratic, Npol = 2. As shown
in Figure 3, the combination of quadratic trend and harmonics already describes the
electricity price average daily profiles successfully. Linear trends were omitted from
exploration as they showed to be fairly restrictive. In experiments on longer forecast
horizons (H > 24), beyond the scope of this paper, we observed that more trend
flexibility tended to be beneficial.
2. We did not observe preferences in the harmonic basis spectrum controlled by Nhr ∈
{1, 2}, the hyperparameter that controls the number of oscillations of the basis in the
forecast horizon. We believe this is due to the flexibility of the harmonic basis S ∈
RH×(H−1) that already covers a broad spectrum of frequencies. Our intuition dictates
that Nhr = 1 is a good setting unless there is an apparent mismatch between the time-
series frequency and the number of recorded observations that one could have in a
Nyquist-frequency under-sampling or over-sampling phenomenon (Koopmans, 1995).
This, however, is beyond the scope of this paper.

Hyperparameter optimization regularities:


1. Regarding the optimal activation functions, we found that the most selected ones
were SeLU, PreLU, and Sigmoid, while activations like ReLU, TanH, and LReLU were
consistently outperformed. Sigmoid activations tend to make the optimization of the
network difficult when the networks grow in depth.
2. Surprisingly, the stochastic gradient batch size consistently preferred 256 and 512
over 128 windows. Our selection of the ADAM optimizer over classic SGD could explain
these observations. The machine learning community believes that more extended SGD
optimization with mini batches tends to have better generalization properties (Keskar
et al., 2017). Additional research on the area would be interesting.
3. The batch normalization technique was often detrimental in combination with the
doubly-residual stack strategy of the NBEATSx method. The residual signals tend to
be close to zero, making the normalization numerically unstable.
4. The robust median normalization of the exogenous variables was consistently preferred
over alternatives like standard deviation normalization.
5. Regarding the hidden units of the FCNN layers, the optimal parameters did not favor
an information bottleneck behavior (Tishby et al., 1999). Almost half of the optimal
models had a small number of hidden units followed by a larger number of hidden
units.
27

You might also like