0% found this document useful (0 votes)
17 views

Optimized hybrid ensemble learning approaches applied to very short-term load forecasting

This paper discusses the development of a novel hybrid machine learning model for very short-term load forecasting (VSTLF) that integrates various algorithms such as Gradient Boosting Regressor and Extreme Gradient Boosting with signal decomposition techniques. The study demonstrates that this approach significantly enhances forecasting accuracy, achieving an RMSE of 1,931.8 MW and MAPE of 2.54% using data from the National Operator of Electric System and Independent System Operators New England. The findings emphasize the importance of diverse methodologies in addressing VSTLF challenges.

Uploaded by

Vamsi Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Optimized hybrid ensemble learning approaches applied to very short-term load forecasting

This paper discusses the development of a novel hybrid machine learning model for very short-term load forecasting (VSTLF) that integrates various algorithms such as Gradient Boosting Regressor and Extreme Gradient Boosting with signal decomposition techniques. The study demonstrates that this approach significantly enhances forecasting accuracy, achieving an RMSE of 1,931.8 MW and MAPE of 2.54% using data from the National Operator of Electric System and Independent System Operators New England. The findings emphasize the importance of diverse methodologies in addressing VSTLF challenges.

Uploaded by

Vamsi Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Electrical Power and Energy Systems 155 (2024) 109579

Contents lists available at ScienceDirect

International Journal of Electrical Power and Energy Systems


journal homepage: www.elsevier.com/locate/ijepes

Optimized hybrid ensemble learning approaches applied to very short-term


load forecasting
Marcos Yamasaki Junior a , Roberto Zanetti Freire b , Laio Oriel Seman c ,
Stefano Frizzo Stefenon d,e , Viviana Cocco Mariani f,g ,∗, Leandro dos Santos Coelho a,g
a
Industrial and Systems Engineering Graduate Program, Pontifical Catholic University of Parana (PUCPR), Curitiba 80215-901, Brazil
b Universidade Tecnológica Federal do Paraná (UTFPR), Curitiba 80230-901, Brazil
c Department of Automation and Systems Engineering, Federal University of Santa Catarina (UFSC), Florianopolis 88040-900, Brazil
d Digital Industry Center, Fondazione Bruno Kessler, Trento 38123, Italy
e
Department of Mathematics, Computer Science and Physics, University of Udine, Udine 33100, Italy
f
Mechanical Engineering Graduate Program, Pontifical Catholic University of Parana (PUCPR), Curitiba 80215-901, Brazil
g
Department of Electrical Engineering, Federal University of Parana (UFPR), Curitiba 81530-000, Brazil

ARTICLE INFO ABSTRACT

Keywords: The significance of accurate short-term load forecasting (STLF) for modern power systems’ efficient and
Electrical power systems secure operation is paramount. This task is intricate due to cyclicity, non-stationarity, seasonality, and
Ensemble learning nonlinear power consumption time series data characteristics. The rise of data accessibility in the power
Machine Learning
industry has paved the way for machine learning (ML) models, which show the potential to enhance STLF
Short-term load forecasting
accuracy. This paper presents a novel hybrid ML model combining Gradient Boosting Regressor (GBR),
Signals decomposition methods
Time series forecasting
Extreme Gradient Boosting (XGBoost), k-Nearest Neighbors (kNN), and Support Vector Regression (SVR),
examining both standalone and integrated, coupled with signal decomposition techniques like STL, EMD,
EEMD, CEEMDAN, and EWT. Through Automated Machine Learning (AutoML), these models are integrated
and their hyperparameters optimized, predicting each load signal component using data from two sources: The
National Operator of Electric System (ONS) and the Independent System Operators New England (ISO-NE),
boosting prediction capacity. For the 2019 ONS dataset, combining EWT and XGBoost yielded the best results
for very short-term load forecasting (VSTLF) with an RMSE of 1,931.8 MW, MAE of 1,564.9 MW, and MAPE
of 2.54%. These findings highlight the necessity for diverse approaches to each VSTLF problem, emphasizing
the adaptability and strength of ML models combined with signal decomposition techniques.

1. Introduction is converted from solar energy using solar cells, wind energy which is
converted from mechanical energy using wind turbines [3].
Electricity is essential for the development of humanity, and its use Energy use increases with per capita Gross Domestic Product (GDP)
is growing in different areas such as residential and commercial build- analysis; richer countries usually consume more energy per person
ings, industry, medicine, transportation, public lighting, robotics and than developing countries. Therefore, this economic relationship is also
machinery, electro-valves, refrigeration, and air conditioning equip- based on how much energy the country can produce and how much
ment, among others [1]. The energy resulting from electricity is present energy it will need to produce in the coming years, so countries must do
in different sectors, and often machines can work 24 h a day, non- planning and invest in the infrastructure needed for energy generation.
stop, due to electricity. Electricity is non-storable and requires a stable The power utilities or power plant companies depend on the estimated
electrical system with a constant balance between production, transmis- electricity demand to meet the load connected to the grid [4].
sion, and demand [2]. There are many types of electricity generation Regarding the forecast horizon, Short Term Load Forecasting (STLF)
units, such as thermal energy, which is converted into electricity, can be useful for real-time applications such as controlling electric
hydroelectric energy, which is converted by gravitational potential or power generation units [5]. The medium-term forecast can be used for
kinetic energy from a hydro-power source, photovoltaic energy, which resource budgeting, and the long-term forecast can be used to plan an

∗ Corresponding author.
E-mail addresses: [email protected] (M. Yamasaki Junior), [email protected] (R.Z. Freire), [email protected] (L.O. Seman),
[email protected] (S.F. Stefenon), [email protected] (V.C. Mariani), [email protected] (L. dos Santos Coelho).

https://doi.org/10.1016/j.ijepes.2023.109579
Received 31 March 2023; Received in revised form 20 July 2023; Accepted 13 October 2023
Available online 26 October 2023
0142-0615/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

expansion of the power system network. However, the electricity de- (iii) The hyperparameters tuning of the ML models integrated with
mand depends on weather conditions such as temperature, wind speed, decomposition methods was based on an automated machine
precipitation, and other factors, and the demand changes depending learning method.
on the day, season, trade, and industry activities such as peak hours, (iv) An extensive evaluation based on two datasets, one from the
normal hours, weekdays, weekend days, holidays, and others. Thus, ONS and the other from the Independent System Operators
seasonality, trend, noise, outliers, and other aspects are present when New England (ISO-NE), is performed to demonstrate the poten-
power demand is studied, demanding from power supply companies tial of the proposed ML integrated with decomposition meth-
accurate load forecasting techniques to decrease electricity production ods in VSTLF. Pre-processing steps of data cleaning, feature
losses and costs [6]. engineering, and outlier processing were considered.
Electricity time series data generally has nonlinear behavior, which (v) Well-established metrics for VSTLF as RMSE, Mean Absolute
means random and periodic components embedded in the series, such Error (MAE), and Mean Absolute Percentage Error (MAPE) with
as time, seasonal events, economic activity behavior, measurement 10-fold cross-validation were adopted to evaluate the models.
errors (outliers), missing values, and noise [7]. These components The statistical tests demonstrated that the proposed models pro-
are intrinsic to the time series data, and each data set will contain vided superior performance using signals decomposition com-
different features. Then, pre-processing and data transformation steps pared to ML models without the time series signals decompo-
are required to provide appropriate inputs for the forecasting models. sition process in VSTLF.
With proper methodology, Machine Learning (ML) models can handle
such adversities, accurately predicting the next steps to be taken in The experiments of VSTLF were split into two datasets, ONS and
forecasting tasks [8]. ISO-NE. The objective was to evaluate how well the regression mod-
ML is a sub-area of Artificial Intelligence (AI) that provides applied els may predict the electricity demand between different ML algo-
models in classification, regression, and prediction that learn from rithms with and without decomposition techniques. To achieve the
experience. Experience comes from experimentally measured historical lowest values for RMSE and MAPE performance metrics for all eval-
data; such data can be text, audio, images, videos, examinations in the uated ML models, an AutoML toolkit has been used to select the best
field of medicine, sensor measurements, and electricity consumption, hyperparameters.
among other available data sources [9]. ML models can learn from The remainder of this study is organized as follows. Related works
experience and use this knowledge to reproduce some tasks, such as focusing on forecasting models using ML were described in Section 2.
predicting energy load consumption a week in advance, predicting In Section 3, the forecasting models are presented. The framework
weather conditions, preventing failures in machines that are in op- adopted in this study is shown in Section 4. The two datasets, the
eration, etc. Over the load forecasting horizon, the number of steps main results obtained together with the choice of hyper-parameters and
ahead to be predicted can be determined in four different categories, their justifications, are presented in Section 5. Finally, the last section
Very Short Term Load Forecasting (VSTLF), STLF, Medium Term Load presents the conclusion and indicates steps for future work.
Forecasting (MTLF), and Long Term Load Forecasting (LTLF) [10]. The
cut-off horizons for these four categories are one day, two weeks, and 2. Related works
three years, respectively.
Besides the advances in AI and ML applications, many challenges
This section briefly presents short-term power system load forecast-
still exist to overcome. The training for some ML models is sometimes
ing models in the specific literature field. Both EWT and improved
ineffective and superficial, requiring more computational effort and
density-based spatial clustering (IDBSCAN) models were used to fore-
generalization capacity [11]. The ML models do not have their param-
cast applications with noise in Zhang and Zhang [14]. The EWT de-
eters automatically adjusted, called hyperparameters, which directly
composes the load into intrinsic mode functions (IMFs), which are
affect their performance. Some recent studies, such as the one presented
predicted using rational models and long-term memory (LSTM). The
in [12], indicated that combining various individual ML models can
IDBSCAN and meteorological factors group the high-frequency compo-
produce a better result when compared with individual ones when
nents, which are overlaid to obtain the complete time series forecasting
built properly. Additional works addressing the previously mentioned
results.
aspects are presented in the sequence.
Sankalpa et al. [15] considered voting regression (VR) for short-
The ensemble models proved to be better than its worst individual
term load forecasting assuming five distinct models, i.e., three paramet-
predictor and, in some cases, better than its best individual predictor.
ric multiple linear regressors and two non-parametric machine-learning
Wang et al. [13] applied an adaptive decomposition method based on
models. The cross-validation (CV) procedure selects models, and the
Variational Mode Decomposition (VMD) and SampEn to decompose the
Blocked-CV technique yields the closest validation error to the test er-
data and the eXtreme Gradient Boosting (XGBoost) for short-term load
ror. Assuming the VR method, the strategy outperformed the individual
forecasting, which performed better than any individual learning algo-
predictions of the models. The LSTM model and penalized quantile
rithm. This work uses two different datasets to investigate ensemble
regression were used to forecast daily and weekly indoor loads in
model and decomposition techniques for VSTLF of one day ahead. The
Duan [16]. The suggested technique outperforms some classical models
main contributions of this research are summarized as follows:
in coverage probability by 6.4% to 20.9%.
(i) ML models, including Gradient Boosting Regressor (GBR), XG- Self-attention-based short-term load forecasting (STLF) employs
Boost, k-Nearest Neighbors (kNN), and Support Vector Regres- non-parametric kernel density estimation to create customer electricity
sion (SVR) models are designed and evaluated in both individual consumption feature curves, variational modal decomposition, and a
and combined models for VSTLF. maximum information coefficient for feature selection in Yu et al. [17].
(ii) A new hybrid framework based on GBR, XGBoost, kNN, and SVR The AdaBlief optimizer was applied to obtain the model parameters,
integrated with signal decomposition methods like Seasonal- and an Informer based on increased self-attention predicts intrinsic
Trend decomposition using locally estimated scatterplot smooth- mode function components. The proposed STLF outperforms other
ing, LOESS (STL), Empirical Mode Decomposition (EMD), En- models. Baliyan et al. [18] examined short-term load forecasting using
semble Empirical Mode Decomposition (EEMD), Complete En- hybrid neural networks, which combine neural networks and stochastic
semble Empirical Mode Decomposition with Adaptive Noise learning methods, including genetic algorithms and particle swarm
(CEEMDAN), and EWT is developed to enhance the accuracy of optimization. Short-term load forecasting is crucial to power system
classical ML structures in VSTLF. efficiency and reliability.

2
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

In Groß et al. [19], it is noticed that decentralized flexibilities must performed using AutoML, which evaluated the performance of vari-
be managed more efficiently as renewable production becomes increas- ous combinations of models and picked the best-performing one. This
ingly volatile. The research eight-day-ahead electricity load forecasting approach reduces the need for manual intervention and enhances per-
approaches for supermarkets, schools, and houses. Machine learning, formance by combining the strengths of different models.
statistics, and a median ensemble of forecasts were investigated. Com-
pared to a naïve seasonal benchmark approach, nearly all strategies 3.1. Decomposition techniques
improved predicting accuracy by up to 35%. It concludes that finding
the appropriate load forecasting strategy is a task-specific challenge. This subsection delves into the analysis of decomposition tech-
The research from Yu et al. [20] presented a graph representation niques applicable to time series analysis evaluated in this paper. These
learning-based short-term load forecasting approach. The load graph techniques include Decomposition using Locally Estimated Scatterplot
encodes electricity use and the coronavirus disease of 2019 effects to Smoothing, Empirical Mode Decomposition, Ensemble Empirical Mode
anticipate future loads. A residual graph convolutional network models Decomposition, Complete Ensemble Empirical Mode Decomposition
the non-linear correlations between the graph and future loads, and with Adaptive Noise, and Empirical Wavelet Transform.
a graph concatenation model improves learning efficiency. The epi-
demic strongly correlates with regional electricity use, and the strategy 3.1.1. Decomposition using Locally Estimated Scatterplot Smoothing (STL)
outperforms other representative methods. STL is a nonparametric and flexible method of decomposing a time
Sun et al. [21] proposed a study from Tai’an, Shandong Province, series [26]. The decomposition model can be written as:
China’s hourly historical loads and show significant daily and weekly
fluctuations. According to the authors, the Chinese Lunar Spring Fes- 𝑌 (𝑡) = 𝑇 (𝑡) + 𝑆(𝑡) + 𝑅(𝑡) (1)
tival and other factors affect load. A six-periodic, three-nonperiodic where 𝑌 (𝑡) is the observed series, 𝑇 (𝑡) is the trend component, 𝑆(𝑡) is
artificial neural network model is created to address this. The model the seasonal component, and 𝑅(𝑡) is the remainder.
was trained on data from January 2016 to August 2018 and showed
that the daily prediction model with specified parameters can improve 3.1.2. Empirical Mode Decomposition (EMD)
predicting accuracy. EMD is a method that identifies and separates the IMFs in a time
Genov et al. [22] proposed one and three-day-ahead load fore- series [27]. The decomposed time series can be represented as:
casting in smart grids using feed-forward artificial neural networks,
recurrent neural networks, and cross-learning algorithms. To test those ∑
𝑛
𝑌 (𝑡) = 𝐼𝑀𝐹𝑖 (𝑡) + 𝑅𝑛 (𝑡) (2)
strategies, they used high-resolution multi-seasonal electricity demand 𝑖=1
data from buildings in Belgium, Canada, and UK. The optimal model
where 𝑌 (𝑡) is the observed series, 𝐼𝑀𝐹𝑖 (𝑡) is the 𝑖th intrinsic mode
depends on the accuracy metric, but both feedforward and recurrent
function, and 𝑅𝑛 (𝑡) is the residual.
neural network models perform well.
A clustering-based filter feature selection strategy to improve short-
3.1.3. Ensemble EMD (EEMD)
term load forecasting models is proposed in Subbiah and Chinnap-
EEMD is an enhancement of the EMD approach, which includes a
pan [23]. A recurrent neural network-based LSTM model for short-term
noise-assisted data analysis method [28]. This allows extracting more
load forecasting was compared to Multilayer Perceptron, Radial Basis
accurate and true intrinsic mode functions (IMFs). The EEMD decom-
Function, Support Vector Regression, and Random Forest. Fast Corre-
position model is similar to the EMD model, and can be written as:
lation Based Filters (FCBF), Mutual Information, and RReliefF (Relief
for Regression) were evaluated to decrease the curse of dimensionality
and increase performance — clustering groups’ load patterns, and ∑𝑁
1 ∑
𝑀
𝑌 (𝑡) = 𝐼𝑀𝐹𝑖𝑗 (𝑡) + 𝑅𝑁 (𝑡) (3)
eliminate outliers. LSTM with RReliefF outperforms other models on 𝑀 𝑗=1
𝑖=1
two European datasets.
Candela Esclapez et al. [24] suggested automating variable process- in which 𝑌 (𝑡) is the original series, 𝑁 is the total number of IMFs, 𝑀 is
ing and selection to increase forecasting accuracy and interpretability. the total number of trials with added white noise, 𝐼𝑀𝐹𝑖𝑗 (𝑡) represents
The dataset for peninsular demand from Spanish energy company (Red the 𝑗th trial IMF for the 𝑖th mode, and 𝑅𝑁 (𝑡) is the final residual.
Eléctrica de España) was tested, finding that it reduces forecasting error
by 0.16% in MAPE and 59.71 MWh in RMSE. The authors observed 3.2. Complete ensemble EMD with Adaptive Noise (CEEMDAN)
that heat affects consumption more than cold, and on hot days the
temperature of the second prior day affects consumption more than CEEMDAN further improves upon EEMD by adapting the noise
the preceding one. The LSTM is currently widely used for time series level to the signal characteristics [29]. It reduces mode mixing and
forecasting due to its ability to deal with nonlinearities [25], which can enhances the extraction of true and more physically meaningful IMFs.
be tail oscillations caused by abrupt variations in time series, resulting The decomposition model remains similar to the EMD and EEMD
in higher frequencies in the signal and requiring more specialized models.
models.
The reviewed works show that many models have been employed 3.3. Empirical Wavelet Transform (EWT)
to address the power system load forecasting; a summary of the applied
methods covered in this section, considered evaluation metrics, case of EWT is an alternative to the standard wavelet transform, which
study, and reference are presented in Table 1. better adapts to the signal characteristics [30]. The decomposition
model can be represented as:
3. Employed methods ∑
𝑛
𝑌 (𝑡) = 𝑊𝑖 (𝑡) + 𝑅𝑛 (𝑡) (4)
This study uses the AutoML framework to analyze various decom- 𝑖=1

position and regressor models. The best models are chosen based on where 𝑌 (𝑡) is the observed series, 𝑊𝑖 (𝑡) is the 𝑖th wavelet function, and
their performance on the validation dataset. This choice is made even 𝑅𝑛 (𝑡) is the residual.
if the best model is a single regressor or a combination. These techniques are instrumental in performing comprehensive
This section outlines the main decomposition techniques and re- time series analysis, extracting underlying patterns, and improving
gression models used in this research. The final model selection was predictive modeling. Each technique has its own set of advantages and

3
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Table 1
Literature models to deal with the short-term load forecasting problem.
Technique Error metric Case/Dataset Reference
EWT and IDBSCAN MAPE, RMSE Chinese city [14]
VR-based Ensemble Learning MAPE Electricity Generating Authority of Thailand [15]
Long Short-Term Memory and Penalized Quantile Regression RMSE, MAE, CVRMSEa Office building Shanghai, China [16]
STLF with feature selection based considering VMD MAPE, RMSE Spanish regional level [17]
Hybrid Neural Networks with stochastic learning techniques MAPE Review based on distinct datasets [18]
Ensemble of Machine Learning Models and statistics NRMSEb Residential buildings, schools, and supermarkets in Germany [19]
Residual Graph Convolutional Network MAPE, RMSE Historical load information from major USA power markets [20]
Artificial Neural Networks and features selection MAE, MPE,c MAPE Tai’an, Shandong Province, China [21]
Recurrent Neural Networks and Cross-learning algorithms RMSE, MAPE Buildings in England, Canada, Belgium, and Green Energy Park [22]
LSTM and feature selection FCBF, Mutual Information, and RReliefF MSE,c RMSE Historical load and weather data of Switzerland and France [23]
Exogenous AutoRegressive (EAR) model and group of EAR networks MAPE, RMSE Spanish electricity operator [24]
Wavelet transform and LSTM with attention mechanism MSE, MAE, MAPE Hydroelectric power plant in Brazil [25]
a
CVRMSE — Coefficient of Variance of the Root Mean Squared Error.
b
NRMSE — Normalized Root Mean Squared Error.
c MPE — Mean Percentage Error.
d
MSE — Mean Squared Error.

Fig. 1. Trend signal decomposition example.

constraints, and their application depends on the nature of the time The AR, MA, ARMA, and similar models are used to forecast the
series. Fig. 1, using the EMD method, presents an example of how observation at one step-ahead (𝑡 + 1), based on historical data for
the decomposition is performed. Since all decomposition used here is the same time series. Some requirements must be fulfilled, such as
focused on the trend, similar results would be found using different identifying stationarity, seasonality, and outliers. The autoregressive
techniques. In this paper, the mean envelope is considered. integrated moving average model is the variation of ARMA to support
differencing the time series to make it stationary, where a combination
3.4. Regression models of several differences was applied to the model [32].

This subsection presents the regression models applied to perform 3.4.2. Random forests
time series forecasting. Random Forests (RF), or random decision forests, are ensemble
learning models for classification, regression, and other tasks. This
3.4.1. Autoregressive Moving Average (ARMA) method operates by constructing a multitude of decision trees at train-
Autoregressive Moving Average (ARMA) models provide a weakly ing time and outputting the class that is the mode of the classes
stationary stochastic time series in terms of two polynomials, the first is (classification) [33], or prediction (regression) [34], of the individual
the AutoRegressive (AR) model, and the second is the Moving Average trees. The RF builds multiple decision trees and merges them for a more
(MA) model [31]. 𝐴𝑅(𝑝) is written in Eq. (5) and 𝑀𝐴(𝑞) is written in accurate and stable prediction. The model typically uses a subset of the
Eq. (6): input features to split each node in the decision tree, which leads to

𝑝
better performance and helps to avoid overfitting [35].
𝑋𝑡 = 𝑐 + 𝜑𝑖 𝑋𝑡−𝑖 + 𝜖𝑡 , (5)
𝑖=1
3.4.3. Gradient Boosting regressor
where 𝑝 is the order of AR model, 𝜑1 , … , 𝜑𝑝 are parameters, 𝑐 is a Gradient Boosting is an ML model for regression and classification
constant, and the random variable 𝜖𝑡 is the white noise. The 𝑀𝐴(𝑞) problems, which produces a prediction model in the form of an en-
is given by: semble of weak prediction models, typically decision trees. It builds

𝑞
the model stage-wise like other boosting models do, and it general-
𝑋𝑡 = 𝜇 + 𝜖𝑡 + 𝜃𝑖 𝜖𝑡−𝑖 , (6) izes them by allowing optimization of an arbitrary differentiable loss
𝑖=1
function [36].
where 𝑞 is the order of MA model, the 𝜃1 , … , 𝜃𝑞 are the parameters of The idea of the GBR approach is that boosting can be interpreted as
the model, 𝜇 is the expectation of 𝑋𝑡 (usually considered equal to zero), an optimization algorithm on a suitable cost function. The algorithm
and 𝜖𝑡 , 𝜖𝑡−𝑖 are white noises. optimizes a cost function over function space by iteratively choosing
Thus, 𝐴𝑅𝑀𝐴(𝑝, 𝑞) model is written with 𝐴𝑅(𝑝) and 𝑀𝐴(𝑞) as follow: a function (weak hypothesis) that points in the negative gradient
direction. This functional gradient view of boosting has led to the

𝑝 ∑
𝑞
development of boosting algorithms in machine learning and statistics
𝑋𝑡 = 𝑐 + 𝜖𝑡 + 𝜑𝑖 𝑋𝑡−𝑖 + 𝜖𝑡 + 𝜃𝑖 𝜖𝑡−𝑖 . (7)
fields beyond regression and classification [37].
𝑖=1 𝑖=1

4
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

3.4.4. Extreme Gradient Boosting (XGBoost) In stacking, multiple base regressor models are trained to predict the
XGBoost is an open-source framework, available in several program- same output, and then a meta-learner model, such as the ARD model,
ming languages, often used in ML global competitions. It is one of is used to find the optimal way to combine these predictions into a
the best GBoost tree implementations currently available. Its parallel final predicted output. Each base regressor, such as GBR, XGB, kNN, or
tree-boosting capabilities make it way faster than other tree-based SVR in the discussed framework, produces its prediction for the output.
ensemble methods [38]. It was designed utilizing the principles of These individual predictions are then used as input features for the ARD
GBoost, joining weak learners into strong learners. However, compared model. In turn, The ARD model learns how to combine these individual
with gradient boosting built sequentially, slowly learning from data to predictions best to produce a final output that minimally deviates from
improve its prediction in the next iteration, XGBoost builds trees in the target.
parallel processing [39]. One key advantage of the ARD model is its ability to perform auto-
matic relevance determination, meaning it can evaluate the importance
3.4.5. 𝑘-Nearest Neighbor regression (kNN) of each base regressor’s prediction. The ARD model does this by assign-
The kNN works by predicting the target by local interpolation of the ing a weight to each base regressor’s prediction in the final combined
output, determining how much each base regressor’s prediction should
targets associated with the nearest neighbors in the training set, which
contribute to the final output. If a particular regressor’s prediction is
means the nearest values of training set data may contribute more or
not contributing significantly to the accuracy of the final output, the
less depending on the euclidean distance of the sample data associated
ARD model assigns it a weight close to zero, effectively ignoring it.
with the target, which is commonly called as a query. The kNN is being
This feature of the ARD model can be handy when one has many
used for recommendation systems, financial market prediction, and text
base regressors and wants to avoid overfitting by not relying too heavily
categorization, among other fields [40].
on any one regressor. It also allows for more interpretable results, as it
is possible to observe which base models the ARD model deemed most
3.4.6. Support Vector Regression (SVR) and Support Vector Machine useful in making its predictions.
(SVM)
The SVR is a supervised learning model that can be used to solve 4. Model framework
classification and regression problems [41]. An Support Vector Machine
(SVM) training algorithm builds a model that assigns new examples to The proposed framework presented in this study, as depicted in
one category or the other, making it a non-probabilistic binary linear Fig. 2, is a six-stage process, wherein the stages are in different colors.
classifier. SVM maps training examples to points in space to maximize These stages are:
the gap width between the two categories. New examples are then
mapped into that space and predicted to belong to a category based on • The first stage is regarding data selection and preprocessing.
which side of the gap they fall. In addition to performing linear clas- This paper considers two datasets: ISO-NE and ONS, from 2015
sification, SVM can efficiently perform a non-linear classification using to 2019. For the preprocessing, the data is cleaned (removing
the kernel trick, implicitly mapping their inputs into high-dimensional unused and unnecessary values from the dataset), the date and
feature spaces [42]. time are decoupled into more features or entries for the model,
and outliers are evaluated. The transformations (Box–Cox and
3.5. Stacking StandardScaler) are applied to avoid non-linear terms and bias
in the forecasting results.
The Automatic Relevance Determination (ARD) model [43], also • The second stage involves selecting a decomposition method from
known as Sparse Bayesian Learning or Relevance Vector Machine, is a pool of methods, including STL, EMD, EEMD, CEEMDAN, EWT,
a type of Bayesian linear regression. It imposes a prior distribution on or none. The decomposition is applied to denoise the signal. This
stage employs a hybrid approach, combining trend decomposition
the weights, leading to automatic sparsity and relevance determination
with the forecasting model.
of features.
• The third stage involves splitting the decomposed data into IMFs
Let us denote the design matrix (with 𝑛 samples and 𝑚 features)
and residues. Organize the considering k-fold cross-validation and
as 𝑋 ∈ R𝑛×𝑚 and the corresponding target values as 𝐲 ∈ R𝑛 . We
the setup of the hyperparameters for tuning.
also denote the weight vector as 𝐰 ∈ R𝑚 . The linear model can be
• In the fourth stage, one or more regressor algorithms are selected
represented as:
for prediction for each IMF and residue. The options include GBR,
𝐲 = 𝑋𝐰 + 𝝐 (8) XGB, kNN, and SVR. If more than one regressor is selected, the
Stacking ensemble regressor combines the different models, with
where 𝝐 is the noise term, assumed to follow a Gaussian distribution the ARD model serving as the meta-learner.
with zero mean and variance 𝜎 2 , i.e., 𝝐 ∼  (0, 𝜎 2 ). In ARD, each weight • In the fifth stage, the predictions from the various intrinsic mode
𝑤𝑖 in 𝐰 is assumed to follow a Gaussian distribution with zero mean and functions are summed, resulting in decomposed data for the
its own variance 𝛼𝑖−1 : predicted data only. The transformations that were applied in the
first stage are reverted.
𝑤𝑖 ∼  (0, 𝛼𝑖−1 ) (9)
• In the sixth stage, the models are tested on a separate test set to
The ARD model aims to find the Maximum a posteriori (MAP) evaluate their performance. Each IMF has one model predictor.
estimate of the weights, 𝐰∗ , and the hyperparameters 𝜶 ∗ and 𝜎 ∗2 : Finally, the performance of the models is assessed using error
metrics, including RMSE, MAE, and MAPE.
𝐰∗ , 𝜶 ∗ , 𝜎 ∗2 = arg max 𝑝(𝐰, 𝜶, 𝜎 2 |𝐲) (10)
𝐰,𝜶,𝜎 2
During the preprocessing, the removed values include price, the
This maximization problem can be solved iteratively using energy component of real-time locational marginal price (LMP), and
Expectation–Maximization or similar optimization algorithms. When 𝛼𝑖 day-ahead LMP, among others. Additionally, null or invalid values
becomes very large, the corresponding weight 𝑤𝑖 is pushed towards are replaced by linearly interpolating the nearest values. Additionally,
zero, leading to automatic sparsity. Hence, the ARD model can auto- the year, month, day, weekday, holiday, and hour are separated into
matically determine the relevance of each feature and achieve sparsity different features, and outliers are identified by the LOF algorithm and
in the weight vector, making it a useful meta-learner in stacking replaced by linear interpolation, similar to the method used in the first
regression. sub-stage.

5
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

prediction, and one wants to estimate how accurately a predictive


model will perform in practice. It is a statistical method of evaluating
and comparing learning algorithms by dividing data into two segments:
one used to learn or train a model, and the other used to validate it. In
typical cross-validation, the training and validation sets must cross over
in successive rounds such that each data point has a chance of being
validated. The basic form of cross-validation is k-fold cross-validation.
There are other special cases of k-fold cross-validation, which may
involve repeated rounds of k-fold cross-validation [44].
Cross-validation is often used for the evaluation or comparison of
learning algorithms. In each iteration, one or more learning algorithms
use 𝑘-folds of data to learn and adjust the weights for one or more
models, and subsequently, the trained models make predictions based
on the input data for the validation fold. The performance of each
learning algorithm on each fold can be tracked using a set of prede-
termined performance metrics like accuracy or MSE. Upon completion,
𝑘 values of the performance metric will be available for each algorithm.
Different methodologies, such as averaging, are usually used to get
an aggregated result from these values. These values can be used in
a statistical hypothesis test to show that one algorithm is superior to
another [45].
Since the order of the data is essential, cross-validation might be
problematic for time series models. A more suitable approach might be
to use rolling cross-validation. However, if a single summary statistic
describes the performance a stationary bootstrap may work. The boot-
strap statistic needs to accept an interval of the time series and return
the summary statistic [46]. The call to the stationary bootstrap must
Fig. 2. Model framework overview. specify an appropriate mean interval length. The expanding window
and sliding window methods are primarily used in this case. The
training set and validation set respect the data order, increasing or
The proposed framework provides a comprehensive, systematic ap- sliding the training set over the dataset as shown in Figs. 3 and 4.
proach to processing and evaluating time series data, ensuring optimal For this research, the sliding method has been used since the results
feature engineering, outlier treatment, and model selection for accurate responded better when compared with other methods.
forecasting. Fig. 5 corresponds to the data split of ONS dataset for model training
and validation. It has used a 10-fold cross-validation split. The gray
4.1. Generalization and algorithm evaluation curve is the original time series, and each color represents each k-fold
of cross-validation for the validation set. The training set comprises 130
Intending to provide a robust forecasting model in machine learn- days (3120 steps ahead), and the validation set contains 15 days (360
ing, it is necessary to train the model and evaluate its performance steps ahead) using the sliding window method.
using different data and models to obtain a better response for various
inputs that the model may receive to predict the output data. With a 4.2. Pre-processing
wide range of data possible, splitting train and test data and cross-
validating over different data portions results in a robust forecasting Before starting any regression task, one of the most important
model that may respond reliably for most input data, which generalizes procedures is pre-processing the dataset to verify possible outliers and
the regression model. inconsistent data [48]. Pre-processing data might be a time-consuming
task since the researcher must know every detail of the analyzed dataset
4.1.1. Error metrics and then must study how to identify outliers, what will be the policy
As it can be verified in Table 1, some o the most relevant error after identifying the outlier, it should be removed or replaced, and if it
metrics are RMSE, MAE, and MAPE, these considered in this study, and is decided to replace, what will be the value replaced and so on.
presented in Eqs. (11) to (13). In the previously mentioned equations,
𝑥 is the sample, 𝑥𝑖 is the 𝑖th sample, 𝑥̂ 𝑖 is the estimated sample, and 𝑛 4.2.1. Outliers processing
is the total number of samples for training/test. Thereby, among outliers, there are many kinds of them. The usual

∑𝑁 types are point outliers, contextual outliers, or collective outliers. A
𝑖=1 (𝑥𝑖 − 𝑥 ̂ 𝑖 )2
𝑅𝑀𝑆𝐸 = , (11) point outlier is the simplest type of outlier; the single data point is
𝑁 far different from the rest of the distribution when compared between
∑𝑁 | | them. Contextual or conditional outliers can be noise in data, such as
|(𝑥𝑖 − 𝑥̂ 𝑖 )|
𝑀𝐴𝐸 = 𝑖=1 , (12) background noise signal when doing image recognition or RF noise
𝑁
when receiving a radio signal. It depends on the context where the
1 ∑ || (𝑥𝑖 − 𝑥̂ 𝑖 ) ||
𝑁
dataset is induced and how it is specified as a part of the problem
𝑀𝐴𝑃 𝐸 = |. (13)
𝑁 𝑖=1 || 𝑥𝑖 | formulation. For instance, a power utility company usually provides
between 5 GW and 10 GW during the year. For a couple of hours, it
4.1.2. Cross-validation is possible to see into the dataset power measurements below 1 GW
Cross-validation is one of the model validation techniques for eval- or above 50 GW, due to some sensor error or failures in the power
uating how the results of a statistical analysis will generalize to an station, which is typically not seen as possible valleys or peaks in the
independent dataset. It is mainly used in settings where the goal is measurement graphics [49].

6
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Fig. 3. Expanding window example for cross-validation [47].

Fig. 4. Sliding window example for cross-validation [47].

Fig. 5. Forecast using 10-fold cross-validation on ONS dataset.

Even if some event could cause this peak or valley, due to inten- 4.2.2. Interpolation
tional shutdown for maintenance or emergency action, this is not help- After detecting and removing outliers, there is a way to replace
ful for forecasting models since these values are way different compared them properly, for instance, using the interpolation technique. Usually,
with the regular measurements and only happen once or twice every a linear interpolation is sufficient to fill this gap on the dataset. Never-
three years. Collective outliers can be subsets of novelties in data, such theless, it has a better response for point outliers, where only a single
as a signal indicating the discovery of new phenomena. The outliers or couple of points are outliers in an interval of samples, as shown in
of ONS dataset shown in Figs. 6(a), 6(b), 6(c), 6(d), for each different Fig. 6(a). Other techniques must be considered and tested to evaluate a
region, which has been detected by Local Outlier Factor (LOF) algo- better response to the regression model. The outliers, identified as NaN
rithm [50], where it was required to adjust the negative outlier factor and zero values, were normalized by overwriting them, replacing them
to increase its detection performance for both datasets researched. with the result of linear interpolating neighbor values.

7
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Fig. 6. ONS dataset analysis.

4.3. Feature engineering maximum number of lags is limited by a quarter of the series, which
has been used [53]. To measure the significance of lagged values from
Feature engineering is part of the pre-processing stage, which de- time series, the confidence interval of the lag must not be correlated
termines the entries or independent variables of the learning model with the actual value. Thereby, its confidence coefficient might be
and directly impacts model performance. This work has considered null. A rule often applied √ in the literature as 95% for confidence
weather data, lagged demand values, and calendar values. To enhance interval, considering 1.96 𝑁, where 𝑁 is the total number of time
the model training, new data entries have been included in the dataset series samples. Any lag with correlation coefficient out of confidence
to have more specific events related to dates, such as holidays and interval would be considered significant.
weekdays. Day, month, and year have been readjusted to be on single Figs. 7(a) and 7(b) show the PACF for ISO-NE dataset with 700
columns to model process it as a unique feature. There was an incon- lags (around 30 days) and 200 lags (around eight days), respectively.
sistency regarding the hour column, which has been fixed by shifting The lags are in black, and the confidence interval is in red. Values can
one hour earlier since the day length was between 1 h to 24 h instead notice the significant lags out of the confidence interval, in which most
of 0 h to 23 h. significant terms are before 200 lags. That makes sense since the first
values above 0.25 correlation coefficient, as better shown in Fig. 7(b),
The weather data consists of temperature values or indicates the
are the most important ones, followed by some lags at interval 140–
year’s season (e.g., summer, fall, winter, spring), classified as an exoge-
170, which marks one week (168 steps-ahead), some patterns start
nous variable that can add relevant information for model prediction.
repeating, like weekdays and working-days (e.g., Monday to Monday),
As [51] observed, the weather and calendar data are 50% or more
as the PACF identify very well the people behavior pattern.
significant when compared with other exogenous variables such as
socio-economic data (economic trends, GDP, no. of employment), de- 4.4. Feature importance
mographic information (birth rate, dwelling count, population) and
others (no. of sensors, occupants, devices). To determine which features are the most important for the model,
The Partial Autocorrelation Function (PACF) identifies the order of it is possible to use Random Forest to perform a pre-regression showing
an AR, which can be used to calculate the most significant lags for which features contribute more to the output, as shown in Fig. 8.
a time series sequence, as the author’s [52] have been done in their The features Hour, followed by DryBulb, the air temperature, are the
respective work. To select the most significant lags of the series, the most important in the rank defined by the Random Forest algorithm,
maximum number of lags must be defined, which could be the total which is sufficient to perform time series forecasts. The hour of the day
number of samples available in the series. If there is only one sample defines the people’s behavior which will demand more or less power,
(one step ahead), then the model would not have any lagged data to depending on temperature and seasons of the year, where usually low
train, making forecasting impractical. Thus, to avoid this problem, the temperatures demand more power consumption.

8
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Fig. 7. ISO-NE dataset — Partial autocorrelation.

Fig. 8. Feature importance - Relative importance over each dataset feature presented.

4.5. Data transformation different variables such as year, month, day, and hour, and mathemati-
cal transformations using logarithms, Fourier Series, power transforma-
Data transformation is part of machine learning pre-processing, tions, and more. Predictive models take simple patterns more quickly,
where some valuable transformations can be made to extract more so the data transformation purpose is to simplify historical patterns
meaningful inputs for the machine learning model. Examples include by removing known sources of variation or making the pattern more
calendar adjustments, converting a simple date 2021-28-03 10:00:00 to consistent across the entire dataset.

9
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Fig. 9. AutoML Neural Network Intelligence (NNI) overview — trial details.

4.5.1. StandardScaler is part of the Sequential Model-Based Optimization (SMBO) approach,


Standard scaler or center scaling transformation is a way to stan- which sequentially create models to approximate the performance of
dardize the observable data by subtracting the arithmetic mean 𝜇 of hyperparameters based on historical measurements.
the training samples and dividing by the standard deviation of the The TPE is modeled by 𝑝(ℎ|𝑚), and 𝑝(𝑚), where ℎ represents hyper-
training samples, as shown in Eq. (14). Standardizing datasets is a
parameters and 𝑚 is related to evaluation metrics. TPE models 𝑝(ℎ|𝑚)
known requirement for most machine learning estimators, and they
by transforming the generative process replacing the distributions of
might behave unsatisfactorily if this requirement is not fulfilled for the
the configuration prior with non-parametric densities, which is defined
individual features.
As an example, many elements from the objective function of a by:
{
learning algorithm, such as the Radial Basis Function (RBF) kernel of
𝓁(ℎ), if 𝑚 < 𝑚∗ ,
SVR or the L1 and L2 regularizers of linear models, already assume 𝑝(ℎ|𝑚) = (16)
𝑔(ℎ), if 𝑚 ≥ 𝑚∗
that all features are centered around zero and have variance in the
same order, so if a feature has a variance that its order of magnitude is where 𝓁(𝑚) is the density formed by using the observations ℎ(𝑗) such
larger than the others, it might control the objective function and may the corresponding loss 𝑓 (ℎ(𝑗) ) was less than 𝑚∗ and 𝑔(ℎ) is the density
lead the estimator to be unable to learn from other features correctly
created regarding the remaining observations.
as expected. The adopted scaler is given by:
𝑥−𝜇
𝑍= . (14)
𝜎 5. Results and discussion
4.5.2. Box–Cox transformation
Electricity demand time series data are usually pre-processed by
The experiments are evaluated considering two datasets ONS and
Box–Cox transformations to remove any embedded trend and approxi-
ISO-NE. The training and validation stage consisted in evaluating the
mate to normal or Gaussian distribution:
{ mode by the error metrics, such as RMSE, MAE, and MAPE, using the
𝑙𝑜𝑔(𝑦𝑡 ), if 𝜆 = 0; approach with 10-fold cross-validation, with a training stage of 130
𝜔𝑡 = (15)
(𝑦𝜆𝑡 − 1)∕𝜆, otherwise. days or 3120 steps and validation stage of 15 days-ahead or 360 steps
where 𝜆 is the estimated parameter by maximized log-likelihood and ahead each fold (3600 steps in total forecasted), along the four years of
𝜔𝑡 represents the time series data after Box–Cox transformation and 𝑦𝑡 dataset, from 2015 to 2018. The test stage consisted of evaluating the
is the original time series data [54]. Power transformations are often model and forecasting the unforeseen data, which has yet to be used in
defined as 𝜔𝑡 = 𝑦𝑝𝑡 . the training stage.
The test set comprises a training stage of 17 days or 408 steps, and
4.5.3. Time series dimension transformation a validation stage of 20 sets of 24 steps ahead, which means the model
Transforming calendar data from one-dimensional (1D) space to forecasts one day or 24 steps within 10 different days throughout the
two-dimensional (2D) space may improve the model performance,
2019 year. Then it was calculated the average and standard deviation
increasing its accuracy on load forecasting by at least 1.5%; depending
of the error metrics RMSE, MAE, and MAPE were used for result
on the current model performance circumstances, it may be a wide
evaluation. The period of 2019 for the test set was explicitly chosen
step on output results. Moon et al. [55] has shown a comparative table
where 2D space results explain their correlation more effectively than to be out of sight of the model training since they are unforeseen data
1D space results. and the tests are consistent with the actual use case.
For the results presented, adding lags as an additional entry for
4.6. Automated Machine Learning (AutoML) the models was not helping to increase the accuracy, instead was
decreasing the models’ performance for load forecasting. Somehow, the
To define the model’s best hyperparameters, the AutoML tool has lags misled the model predictions instead of guiding it to better results.
been used in this research. The algorithm runs trial jobs by tuning Thus, they were not included in the model.
different structures to search for the best hyperparameters setup. A
The computer configuration used to run the experiments has a CPU
rapid hyperparameter evaluation can be made by filtering out the top
AMD Ryzen 3600x 3.8 GHz with 32 MB of cash memory, 16 GB of
20% trials, as shown in Fig. 9; thus, the red curves are the best results,
followed by yellow and green, which are the worst ones. The best DDR4 random-access memory (RAM), 500 GB of Solid State Drive
hyperparameters were used to tune the model and to compute the (SSD), and a Graphics Processing Unit (GPU) GeForce RTX 2060 with
comparisons. 6 GB DDR6 192 bits of internal memory. Considering the k-fold cross-
Considering hyperparameter tuning, the Tree-structured Parzen Es- validation method, for 𝑘 = 10 the data was split in 90% for training and
timator (TPE) is one of the state-of-the-art algorithms to speed up 10% for validation, for 𝑘 = 20 the data was split in 95% for training
finding the best hyperparameters for the forecasting models. The TPE and 5% for validation.

10
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Table 2 Table 4
Computation time for IMFs considering different decomposition methods. RMSE for different IMFs for each decomposition method using XGBoost algorithm for
IMFs EMD EEMD CEEMDAN forecasting.
Number of IMFs Decomposition method
Time (s)
STL-A EWT EMD EEMD CEEMDAN None
1 2.5 84.0 573.0
2 2.9 150.7 632.1 1 2892.4 2930.2 2968.8 2922.9
3 3.0 202.7 649.7 2 2854.6 2938.9 2967.1 2972.7
4 3.2 221.1 737.8 3 2918.8 3024.0 3298.5 2962.7 3377.4
5 3.2 240.2 767.8 4 2949.9 3691.0 3036.1 3618.0
6 3.2 252.6 751.9 5 2978.7 3999.9 2882.0 3632.3
7 3.1 254.7 749.4 6 2950.9 3920.9 2970.6 3546.6
8 3.1 265.5 737.8 7 3099.7 3666.0 2991.5 3546.0
9 3.5 267.9 734.7 8 3112.5 3679.3 2883.4 3560.7
9 2162.8 3690.8 2903.4 3554.6

Table 3
RMSE for different IMFs for each decomposition method using GBR algorithm for Table 5
forecasting. RMSE for different IMFs for each decomposition method using SVR algorithm for
Number of IMFs Decomposition methods forecasting.
Number of IMFs Decomposition method
STL-A EWT EMD EEMD CEEMDAN None
STL-A EWT EMD EEMD CEEMDAN None
1 3029.1 3008.0 2961.8 3030.0
2 2916.1 3233.2 2975.0 3156.7 1 4368.9 4469.3 4374.6 4305.8
3 2998.9 3022.0 3252.3 3139.2 3236.2 2 5123.7 4777.6 4476.2 4751.3
4 3021.8 3328.1 3117.6 3319.3 3 5280.4 4801.7 5113.7 5217.2 5154.7
5 3005.2 3538.9 3106.4 3216.8 4 4821.5 5612.5 5138.8 5654.3
6 3006.4 3593.3 2933.7 3413.5 5 4684.2 5591.9 5704.7 6134.8
7 3037.6 3455.1 3002.7 3378.8 6 4798.7 5828.8 5881.3 6540.6
8 3031.6 3486.6 2971.3 3356.1 7 4934.8 5925.8 6452.9 6526.3
9 3018.9 3519.8 2976.5 3357.0 8 5166.7 6087.2 6485.1 6613.9
9 4894.2 6120.6 6513.4 6633.0

5.1. ONS data analysis Table 6


RMSE for different IMFs for each decomposition method using kNN algorithm for
forecasting.
This section shows the results of experiments made applying dif- Number of IMFs Decomposition method
ferent ML models and decomposition techniques for ONS dataset. The STL-A EWT EMD EEMD CEEMDAN None
decomposition tests were performed to evaluate the number of IMFs for
1 4670.2 4672.6 4670.2 4670.2
each decomposition method, and how the RMSE and the computation 2 4745.8 4670.2 4672.6 4670.2
time are affected due to different configurations of IMF quantity. 3 4670.2 4722.9 4670.2 4672.6 4670.2
In the following subsections, both tests for computation complexity 4 4675.4 4670.2 4672.6 4670.2
5 4671.8 4670.2 4672.6 4670.2
and RMSE are shown in detail. Considering the computation time
6 4669.2 4670.2 4672.6 4670.2
with nine IMFs to decompose each dataset, ONS and ISO-NE, from 7 4669.2 4670.2 4672.6 4670.2
2015 to 2018 years range, STL-A has the shortest duration at 0.027 s, 8 4673.5 4670.2 4672.6 4670.2
followed by EWT at 0.479 s. EMD takes 3.044 s and EEMD takes 9 4677.9 4670.2 4672.6 4670.2
247.368 s. CEEMDAN takes the longest at 713.165 s. Although these
high computation times do not affect the model tuning, either model
training or test since the decomposed values are saved previously and Table 4 presents the best IMF for XGBoost algorithm of each decom-
then used during the forecasting execution. However, they affect when position method, ranking by the lowest RMSE metric and the best IMFs
the tuning of decomposition methods is done. that are 9 modes for EWT, 1 mode for EMD, 5 modes for EEMD, and 1
Table 2 shows the computation time for the used decomposition mode for CEEMDAN. The EWT and EEMD have the lowest RMSE among
methods. The EMD times look pretty much the same for all modes. The the decomposition methods tested, and they are lower than the none
EEMD times grow accordingly with IMF number, and the CEEMDAN decomposition method. Table 5 shows the best IMF for SVR algorithm
times grow until IMF equals to five, then the following timings are of each decomposition method, ranking by the lowest RMSE metric,
similar or shorter. To evaluate the best number of IMFs for each and the best IMFs modes are 5 modes for EWT; 1 mode for EMD; 1
decomposition method, it was performed test measuring the RMSE mode for EEMD; 1 mode for CEEMDAN. EMD and CEEMDAN have
from one to nine IMFs as shown in Table 3. The Seasonal and Trend the lowest RMSE among the decomposition methods tested, and the
decomposition using Locally Estimated Scatterplot Smoothing (STL) none decomposition method has the lowest error. Table 6 shows the
decomposition has no configuration to change the number of IMFs since best IMF for kNN algorithm of each decomposition method, ranking
it is fixed to decompose to trend, seasonal, and residue components. by the lowest RMSE metric, and the best IMFs modes are 6 modes
The EWT decomposition cannot be executed with one IMF. for EWT; 1 mode for EMD; 1 mode for EEMD; 1 mode for CEEMDAN.
In Table 3 is possible to see clearly which IMF is the best for each EWT has the lowest RMSE among the decomposition methods tested,
decomposition method, ranking by the lowest RMSE metric. The STL and it is lower than the none decomposition method. However, all
is only for RMSE reference. For GBR algorithm, the best IMFs modes decomposition methods have very similar results.
are two modes for EWT; 1 mode for EMD; 6 modes for EEMD; 1 mode The bias test was made to verify how far the model was forecasting
for CEEMDAN. EEMD and EWT have the lowest RMSE among the the data for training and test sets, as shown in Figs. 10(a) and 10(b).
decomposition methods tested, which is not relevant right now since no The values over-forecasted by the model are sitting above the cross-
hyperparameters tuning were made. However, they are already lower line, and the values under-forecasted are below the cross-line. For the
than the none decomposition method. training set, there are some over-forecasted values for one fold of the

11
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Fig. 10. ONS dataset bias verification (a) of the training set where each different color of scattered data represents the considered time (given a two weeks time window), (b) of
the test set where each different color of scattered data represents 1 of 20 days selected in 2019.

Fig. 11. ONS dataset — One sample of the forecasted set of best test set model (XGBoost + EWT with nine modes, no tuning.).

training set, which can be observed as light blue circles. Few over- Then, the best model for the test set was not the hyperparameters
forecasted values exist for the test set, but less than the training set. tuned but the one without tuning (Fig. 11). Even using stacking en-
Overall, the forecasted values are spread evenly over the cross-line sembles, XGBoost + GBR, XGBoost + GBR + SVR + kNN and so on,
reference. The test was performed with the best model (with lowest the results were still far from expected. The reason behind that might
RMSE) observed on test set results, in this case, EWT with 9 IMF modes be related to cross-validation setup, 15 days for the validation set with
using XGBoost algorithm, without hyperparameters tuning. 10-fold, in total 360 steps ahead by fold or 3600 steps ahead to forecast,
was not sufficient for hyperparameters tuning, letting too many empty
The overall results presented for the training set, shown in Table 7,
spaces along the period of 2015 to 2018. Maybe using 20-fold, with 75
the EWT decomposition method using XGBoost algorithm performed
days for the training stage and 15 days ahead for the validation stage,
better than other methods, for tuned and not tuned hyperparameters.
was already an adjustment needed for producing better results.
Although the results for the test set, shown in Table 8, the EWT de-
These results were not tuned; however, the models performed bet-
composition using XGBoost algorithm performed better among all tests. ter, probably due to an overfitting problem. The model was trained
For tuned hyperparameters, the EEMD decomposition using XGBoost too much on the same range and data sequence with few disturbances
algorithm had the lowest RMSE. The total time of model training was and differences along the training set. Alternatively, even the amount
around 517.4 h. The total execution time of all tests, including training of data needed to be increased for model robustness produces results
and test set, was around 1.80 h. incompatible with the expected ones for the test set.
The best result for training set ONS dataset was EWT decomposition
method with XGBoost, RMSE with 2477.8 MW, MAE with 1925.3 MW 5.2. ISO-NE Data Analysis
and MAPE with 3.08%, configured to nine IMFs and with model tuning.
The best result for the test set of ONS dataset was EWT with XGBoost, This section shows the results from different ML models and de-
RMSE with 1931.8 MW, MAE with 1564.9 MW and MAPE with 2.54%, composition techniques applied in this dataset for ISO-NE utility. ML
configured to nine IMFs without any hyperparameters tuning. models used were GBR, XGBoost, kNN and SVR.

12
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Table 7
ONS dataset — Training set results from 2015 to 2018 period, with 10-fold cross-validation, 360 steps ahead (15 days ahead) each fold.
Algorithm Method IMFs Tuning RMSE (MW) MAE (MW) MAPE (%) Duration (s) Tuning time (h)
GBR CEEMDAN 1 No 3410.8 (± 1634.7) 2728.6 (± 1270.4) 4.39 (± 2.04) 20 0a
GBR CEEMDAN 1 Yes 3314.9 (± 1728.5) 2653.5 (± 1287.3) 4.30 (± 2.10) 84 26.6
GBR EEMD 6 No 2935.8 (± 1618.8) 2305.8 (± 1294.0) 3.70 (± 2.04) 20 0a
GBR EEMD 6 Yes 3174.1 (± 1487.1) 2534.5 (± 1173.1) 4.06 (± 1.82) 447 69.3
GBR EMD 1 No 3029.1 (± 1576.2) 2352.7 (± 1198.6) 3.78 (± 1.89) 15 0a
GBR EMD 1 Yes 2975.7 (± 1644.5) 2337.4 (± 1280.1) 3.71 (± 1.94) 1862 105.5
GBR EWT 2 No 2916.1 (± 1761.3) 2271.3 (± 1354.2) 3.64 (± 2.18) 14 0a
GBR EWT 2 Yes 2608.5 (± 1566.9) 2105.7 (± 1274.0) 3.36 (± 1.97) 342 22.3
GBR None – No 3030.0 (± 1543.7) 2374.8 (± 1163.0) 3.81 (± 1.82) 12 0a
GBR STL-A – No 2998.9 (± 1650.8) 2307.5 (± 1248.9) 3.70 (± 2.00) 14 0a
GBR STL-A – Yes 2755.5 (± 1570.0) 2203.9 (± 1240.0) 3.54 (± 1.96) 170 31.3
KNN None – No 4670.2 (± 1438.1) 3707.5 (± 1221.5) 6.09 (± 1.93) 4 0a
SVR EMD 1 No 4368.9 (± 1041.3) 3564.8 (± 1041.7) 5.80 (± 1.72) 40 0a
SVR None – No 4305.8 (± 1015.6) 3493.1 (± 1016.2) 5.66 (± 1.65) 41 0a
XGBoost CEEMDAN 1 No 2968.8 (± 1808.9) 2315.7 (± 1365.1) 3.69 (± 2.14) 12 0a
XGBoost CEEMDAN 1 Yes 2743.1 (± 1571.5) 2146.3 (± 1218.2) 3.39 (± 1.88) 74 15.8
XGBoost EEMD 5 No 2882.0 (± 1546.7) 2271.7 (± 1256.6) 3.62 (± 1.91) 18 0a
XGBoost EEMD 5 Yes 2883.5 (± 1517.5) 2272.3 (± 1238.0) 3.62 (± 1.89) 45 19.1
XGBoost EMD 5 No 2892.4 (± 1766.9) 2262.3 (± 1381.6) 3.61 (± 2.15) 14 0a
XGBoost EMD 5 Yes 2682.0 (± 1649.3) 2107.1 (± 1276.8) 3.34 (± 1.99) 38 81.7
XGBoost EWT 9 No 3081.1 (± 2162.8) 2397.8 (± 1616.2) 3.83 (± 2.58) 746 0a
XGBoost EWT 9 Yes 2477.8 (± 1771.8) 1925.3 (± 1296.8) 3.08 (± 2.05) 1214 109.9
XGBoost None – No 2922.9 (± 1845.0) 2247.9 (± 1377.2) 3.59 (± 2.16) 12 0a
XGBoost STL-A – No 2918.8 (± 1612.6) 2319.1 (± 1293.5) 3.70 (± 1.99) 13 0a
XGBoost STL-A – Yes 2704.9 (± 1663.0) 2107.2 (± 1258.0) 3.34 (± 1.96) 33 35.9
XGB+SVR None – No 3004.3 (± 1780.2) 2338.3 (± 1301.2) 3.78 (± 2.05) 218 0a
XGB+GBR+SVR+KNN None – No 2920.7 (± 1638.1) 2282.1 (± 1237.2) 3.69 (± 1.96) 218 0a
XGB+KNN None – No 2878.9 (± 1800.4) 2240.9 (± 1350.9) 3.59 (± 2.12) 251 0a
XGB+GBR None – No 2746.3 (± 1732.4) 2155.0 (± 1291.9) 3.47 (± 2.04) 241 0a
Total 6228 517.4
a
Tuning time (h) with value zero means less than one hour.

Table 8
ONS dataset — Test set results for the 2019 period, with 20 sets of 24 steps ahead (1 day ahead).
Algorithm Method IMFs Tuning RMSE (MW) MAE (MW) MAPE (%) Duration (s)
GBR CEEMDAN 1 No 2627.7 (± 1229.5) 2182.1 (± 1078.7) 3.56 (± 2.02) 3
GBR CEEMDAN 1 Yes 2748.4 (± 1244.1) 2316.5 (± 1047.0) 3.77 (± 1.95) 45
GBR EEMD 6 No 2255.7 (± 1304.2) 1936.1 (± 1186.3) 3.15 (± 2.04) 3
GBR EEMD 6 Yes 2654.8 (± 1275.8) 2258.8 (± 1048.4) 3.72 (± 2.00) 197
GBR EMD 1 No 2251.2 (± 1216.3) 1905.4 (± 1055.8) 3.07 (± 1.78) 1
GBR EMD 1 Yes 2584.4 (± 1568.2) 2219.8 (± 1400.3) 3.54 (± 2.23) 148
GBR EWT 2 No 2572.3 (± 1756.5) 2222.3 (± 1557.6) 3.56 (± 2.53) 1
GBR EWT 2 Yes 2861.1 (± 1835.2) 2480.8 (± 1632.3) 3.96 (± 2.66) 236
GBR None – No 2241.2 (± 1242.7) 1886.8 (± 1072.1) 3.07 (± 1.85) 1
GBR STL-A – No 2319.3 (± 1681.4) 1965.1 (± 1443.6) 3.19 (± 2.49) 2
GBR STL-A – Yes 2511.0 (± 1599.9) 2144.2 (± 1372.4) 3.55 (± 2.46) 78
KNN None – No 4679.7 (± 2580.8) 3990.5 (± 2291.6) 6.65 (± 4.28) 0a
SVR EMD 1 No 5359.5 (± 2942.2) 4518.3 (± 2718.9) 7.32 (± 4.72) 1
SVR None – No 5838.5 (± 4906.5) 4767.9 (± 3882.2) 7.75 (± 6.85) 1
XGBoost CEEMDAN 1 No 2344.0 (± 1524.2) 1966.0 (± 1333.4) 3.16 (± 2.14) 1
XGBoost CEEMDAN 1 Yes 2588.2 (± 1284.1) 2219.8 (± 1124.4) 3.63 (± 1.91) 24
XGBoost EEMD 5 No 2443.8 (± 1396.7) 2053.6 (± 1203.9) 3.31 (± 1.98) 3
XGBoost EEMD 5 Yes 2510.3 (± 1303.0) 2111.9 (± 1128.0) 3.41 (± 1.86) 17
XGBoost EMD 5 No 2333.2 (± 1552.0) 1948.4 (± 1344.9) 3.12 (± 2.17) 1
XGBoost EMD 5 Yes 2601.7 (± 1576.4) 2174.9 (± 1244.0) 3.59 (± 2.25) 13
XGBoost EWT 9 No 1931.8 (± 1511.3) 1564.9 (± 1246.9) 2.54 (± 2.14) 55
XGBoost EWT 9 Yes 2885.4 (± 1940.2) 2543.2 (± 1664.0) 4.29 (± 3.15) 345
XGBoost None – No 2308.9 (± 1588.5) 1954.8 (± 1405.9) 3.13 (± 2.26) 1
XGBoost STL-A – No 2193.4 (± 1697.9) 1899.9 (± 1520.9) 3.14 (± 2.65) 1
XGBoost STL-A – Yes 2603.4 (± 1335.4) 2321.7 (± 1202.5) 3.85 (± 2.31) 12
XGB+SVR None – No 2422.7 (± 1440.6) 2064.7 (± 1266.1) 3.25 (± 1.96) 9
XGB+GBR+SVR+KNN None – No 2363.7 (± 1310.8) 1981.3 (± 1108.3) 3.16 (± 1.83) 12
XGB+KNN None – No 2476.5 (± 1462.9) 2135.7 (± 1314.1) 3.35 (± 2.01) 9
XGB+GBR None – No 2283.0 (± 1306.4) 1910.1 (± 1129.3) 3.05 (± 1.86) 12
Total 1234
a
Duration (s) with value zero means less than one second.

13
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Table 11
RMSE for different IMFs for each decomposition method using kNN algorithm for
forecasting.
Table 9
RMSE for different IMFs for each decomposition method using GBR algorithm for Number of IMFs Decomposition method
forecasting. STL-A EWT EMD EEMD CEEMDAN None
Number of IMFs Decomposition method 1 2095.0 2095.9 2095.0 2095.0
STL-A EWT EMD EEMD CEEMDAN None 2 2136.2 2095.0 2095.9 2095.0
3 2095.0 2112.6 2095.0 2095.9 2095.0
1 1399.1 1442.2 1412.9 1375.0
4 2100.3 2095.0 2095.9 2095.0
2 1500.8 1509.8 1433.5 1527.4
5 2101.0 2095.0 2095.9 2095.0
3 1520.8 1552.6 1525.3 1409.6 1546.0
6 2100.8 2095.0 2095.9 2095.0
4 1508.7 1468.0 1484.8 1588.5
7 2104.9 2095.0 2095.9 2095.0
5 1499.3 1518.5 1497.5 1527.1
8 2110.5 2095.0 2095.9 2095.0
6 1545.4 1490.7 1465.2 1512.6
9 2114.0 2095.0 2095.9 2095.0
7 1579.2 1569.2 1507.4 1519.3
8 1518.9 1539.0 1539.1 1543.8
9 1536.2 1525.0 1536.9 1487.2

Table 10 Table 12
RMSE for different IMFs for each decomposition method using XGBoost algorithm for RMSE for different IMFs for each decomposition method using SVR algorithm for
forecasting. forecasting.
Number of IMFs Decomposition method Number of IMFs Decomposition method
STL-A EWT EMD EEMD CEEMDAN None STL-A EWT EMD EEMD CEEMDAN None
1 1726.2 1662.2 1644.8 1643.3 1 2807.0 2831.8 2816.4 2833.1
2 1740.0 1931.0 1642.2 1735.0 2 2948.3 2783.9 2834.9 2791.2
3 1647.9 1666.3 1885.4 1625.7 1737.6 3 3041.3 2960.6 2800.3 2807.0 2795.0
4 1541.3 1826.4 1663.7 1741.6 4 2964.9 2803.1 2824.4 2793.2
5 1506.4 1724.4 1647.2 1633.7 5 2964.1 2807.6 2812.6 2807.8
6 1552.7 1737.7 1615.7 1627.8 6 2962.7 2832.9 2817.7 2834.1
7 1520.9 1747.7 1541.0 1651.2 7 2964.4 2825.8 2849.0 2840.4
8 1505.8 1688.1 1572.5 1605.7 8 2964.0 2838.1 2861.6 2844.7
9 1510.4 1683.6 1512.3 1611.2 9 2966.2 2860.7 2852.0 2814.3

The decomposition tests were also performed for ISO-NE dataset, decomposition, got better results. Although the results for the test set,
to determine the best configuration of IMF quantity for each decom- shown in Table 14, the EWT decomposition using XGBoost performed
position method and algorithm, which were based on RMSE metric. better among all tests. The GBR without any decomposition had the
To evaluate the best number of IMFs for each decomposition method, lowest RMSE, MAE, and MAPE. The total execution time of all tests,
it was performed test measuring the RMSE from one to nine IMFs including training and test set, was around 13.1 min. The best result for
as shown in Table 9. The STL decomposition has no configuration to a training set of ISO-NE dataset was EEMD with XGBoost, RMSE with
change the number of IMFs since it is fixed to decompose to trend, 848.85 MW, configured to nine IMFs without hyperparameters tuning.
seasonal, and residue components. The same is true for None, which has The XGBoost without any decomposition with MAE with 736.10 MW
no decomposition method applied. Moreover, the EWT decomposition and MAPE with 5.36% without hyperparameters tuning. The best result
cannot be executed with one IMF. for the test set of ISO-NE dataset was no decomposition method with
Table 9 shows the best IMF for GBR algorithm of each decompo- GBR, RMSE with 1375.0 MW, MAE with 1042.7 MW and MAPE with
sition method, ranking by the lowest RMSE metric and the best IMFs 7.38%, without any hyperparameters tuning.
modes. They are 2 modes for EWT, 1 mode for EMD, 3 modes for EEMD, The bias test was made to verify how far way the model was
and 1 mode for CEEMDAN. EMD and EEMD have the lowest RMSE forecasting the data for training and test sets, as shown in Figs. 12(a)
among the decomposition methods tested, but none decomposition has
and 12(b). As previously explained, the values over-forecasted by the
the lowest error. Table 10 shows the best IMF for XGBoost algorithm of
model are above the cross-line, and the values under-forecasted are
each decomposition method, ranking by the lowest RMSE metric and
below the cross-line. For the training set, the forecasted values are
the best IMFs modes. The selected modes are 5 modes for EWT, 9 modes
widespread, as observed in Fig. 12(a), noticing that the model has a
for EMD, 9 modes for EEMD, and 8 modes for CEEMDAN. The EWT and
hard time predicting precisely. However, no bias was observed. There
EEMD have the lowest RMSE among the decomposition methods tested,
are a few under-forecasted values below the cross-line reference for the
and they are lower than the none decomposition method.
test set, observed as a blue circle in Fig. 12(b), so some bias for this
Table 11 shows the best IMF for kNN algorithm of each decom-
specific set was observed. Besides that, the overall forecasted values
position method, ranking by the lowest RMSE metric and the best
IMFs modes, where 4 modes for EWT, 1 mode for EMD, 1 mode for are spread over the cross-line reference. The test was performed with
EEMD, and 1 mode for CEEMDAN were assumed. All decomposition the best model (with lowest RMSE) observed on test set results, in
methods have very similar results. Table 12 shows the best IMF for this case, None decomposition method using GBR algorithm, without
SVR algorithm of each decomposition method, ranking by the lowest hyperparameters tuning.
RMSE metric and the best IMFs modes. In this case, 2 modes for EWT,
2 modes for EMD, 3 modes for EEMD, and 2 modes for CEEMDAN were 6. Conclusion
considered. As it can be seen, EMD and CEEMDAN have the lowest
RMSE among the decomposition methods tested, and they are lower Load forecasting using ML models combined with decomposition
than the none decomposition method. techniques may be an alternative way to improve model accuracy since
The overall results presented for the training set, shown in Ta- the former forecasting method may have reached its limits. Even with
ble 13, the EEMD decomposition method using XGBoost performed new learning algorithms, some intrinsic components of the time series
better than others models, based on RMSE, for not tuned hyperparam- may be embedded, and extracting these components is the key to the
eters. However, the lowest MAE and MAPE, the XGBoost without any next steps in load forecasting. Signal decomposition opens new avenues

14
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Table 13
ISO-NE dataset — Training set results for the 2015–2018 period, with 20-fold cross-validation, 360 steps ahead (15 days ahead) each fold.
Algorithm Decomposition IMFs Tuning RMSE (MW) MAE (MW) MAPE (%) Duration (s)
GBR CEEMDAN 1 No 932.23 (± 695.39) 816.07 (± 675.05) 5.89 (± 3.75) 50
GBR EEMD 3 No 952.70 (± 753.12) 829.10 (± 734.14) 5.97 (± 4.14) 24
GBR EMD 1 No 940.60 (± 682.84) 821.30 (± 671.46) 5.93 (± 3.68) 1
GBR EWT 2 No 921.57 (± 697.20) 797.20 (± 680.13) 5.82 (± 4.03) 1
GBR None – No 909.54 (± 662.46) 800.30 (± 654.21) 5.79 (± 3.67) 1
GBR STL-A – No 921.83 (± 761.04) 792.70 (± 738.59) 5.76 (± 4.28) 1
kNN CEEMDAN 1 No 1215.2 (± 789.63) 1045.9 (± 762.99) 7.77 (± 4.87) 49
kNN EEMD 1 No 1209.7 (± 786.92) 1038.6 (± 761.77) 7.71 (± 4.85) 9
kNN EMD 1 No 1215.2 (± 789.63) 1045.9 (± 762.99) 7.77 (± 4.87) 0a
kNN EWT 4 No 1267.2 (± 819.54) 1082.1 (± 791.67) 8.08 (± 5.20) 0a
kNN None – No 1215.2 (± 789.63) 1045.9 (± 762.99) 7.77 (± 4.87) 0a
kNN STL-A – No 1215.2 (± 789.63) 1045.9 (± 762.99) 7.77 (± 4.87) 0a
SVR CEEMDAN 2 No 2107.5 (± 896.57) 1819.9 (± 839.33) 14.3 (± 6.18) 51
SVR EEMD 3 No 2100.5 (± 865.46) 1810.5 (± 778.17) 14.2 (± 5.21) 22
SVR EMD 2 No 2102.3 (± 913.05) 1821.5 (± 844.71) 14.2 (± 6.00) 1
SVR EWT 2 No 2158.2 (± 861.28) 1824.0 (± 804.96) 14.6 (± 5.79) 0a
SVR None – No 2128.9 (± 800.70) 1798.0 (± 745.11) 14.4 (± 5.10) 0a
SVR STL-A – No 2219.7 (± 830.03) 1848.9 (± 823.55) 15.1 (± 6.47) 0a
XGBoost CEEMDAN 8 No 973.19 (± 681.51) 830.39 (± 611.81) 6.21 (± 3.85) 112
XGBoost EEMD 9 No 848.85 (± 687.57) 743.41 (± 668.53) 5.37 (± 3.69) 32
XGBoost EMD 9 No 974.08 (± 676.06) 845.62 (± 616.73) 6.27 (± 3.97) 3
XGBoost EWT 5 No 953.72 (± 689.29) 808.66 (± 650.02) 6.03 (± 3.93) 2
XGBoost None – No 851.55 (± 694.01) 736.10 (± 656.25) 5.36 (± 3.94) 1
XGBoost STL-A – No 892.79 (± 798.27) 781.42 (± 775.56) 5.70 (± 4.68) 1
Total 363
a
Duration (s) with value zero means less than one second.

Table 14
ISO-NE dataset — Test set results for the 2019 period, with 20 sets of 24 steps ahead (1 day ahead).
Algorithm Decomposition IMFs Tuning RMSE (MW) MAE (MW) MAPE (%) Duration (s)
GBR CEEMDAN 1 No 1412.9 (± 414.91) 1089.8 (± 358.84) 7.69 (± 2.07) 52
GBR EEMD 3 No 1409.6 (± 428.64) 1088.4 (± 360.02) 7.76 (± 2.24) 27
GBR EMD 1 No 1399.1 (± 426.75) 1084.0 (± 376.45) 7.70 (± 2.36) 3
GBR EWT 2 No 1500.8 (± 369.28) 1186.7 (± 311.75) 8.48 (± 1.93) 4
GBR None – No 1375.0 (± 407.86) 1042.7 (± 355.13) 7.38 (± 2.22) 3
GBR STL-A – No 1520.8 (± 486.34) 1191.3 (± 456.28) 8.49 (± 3.01) 4
kNN CEEMDAN 1 No 2095.0 (± 912.16) 1653.3 (± 804.72) 11.7 (± 5.35) 5
kNN EEMD 1 No 2095.9 (± 918.92) 1653.7 (± 808.60) 11.7 (± 5.37) 4
kNN EMD 1 No 2095.0 (± 912.16) 1653.3 (± 804.72) 11.7 (± 5.35) 56
kNN EWT 4 No 2100.3 (± 904.45) 1661.9 (± 801.43) 11.8 (± 5.36) 30
kNN None – No 2095.0 (± 912.16) 1653.3 (± 804.72) 11.7 (± 5.35) 2
kNN STL-A – No 2095.0 (± 912.16) 1653.3 (± 804.72) 11.7 (± 5.35) 8
SVR CEEMDAN 2 No 2791.2 (± 1282.6) 2273.4 (± 1051.3) 16.4 (± 4.38) 3
SVR EEMD 3 No 2807.0 (± 1303.8) 2281.1 (± 1081.1) 16.5 (± 4.39) 3
SVR EMD 2 No 2783.9 (± 1328.4) 2267.3 (± 1097.1) 16.2 (± 4.45) 2
SVR EWT 2 No 2948.3 (± 1200.7) 2408.0 (± 991.79) 17.8 (± 4.69) 113
SVR None – No 2833.1 (± 1221.5) 2289.8 (± 1011.5) 16.7 (± 4.04) 4
SVR STL-A – No 3041.3 (± 1004.7) 2504.0 (± 835.30) 19.1 (± 5.23) 34
XGBoost CEEMDAN 8 No 1605.7 (± 374.06) 1240.8 (± 295.97) 8.93 (± 2.19) 3
XGBoost EEMD 9 No 1512.3 (± 416.44) 1156.0 (± 375.99) 8.19 (± 2.54) 51
XGBoost EMD 9 No 1683.6 (± 367.17) 1298.6 (± 323.66) 9.29 (± 2.11) 11
XGBoost EWT 5 No 1740.0 (± 492.59) 1368.6 (± 478.92) 9.76 (± 3.17) 1
XGBoost None – No 1643.3 (± 470.69) 1290.4 (± 449.28) 9.26 (± 3.11) 3
XGBoost STL-A – No 1647.9 (± 507.71) 1272.0 (± 454.99) 9.16 (± 3.18) 1
Total 425

for forecasting time series research. Hyperparameter tuning must be for each decomposed IMF, the following tests may yield better results
done carefully to avoid wasting hours of model training. Overfitting than this research.
may explain why untuned models outperformed tuned models after Future research, as informed by the literature review on very short-
517 h of training model time. Thus, increasing the amount and variety
term load forecasting (VSTLF), suggests that investigating deep recur-
of validation data and setting up cross-validation folds to cover more
rent models could be a valuable next step. The integration of these
training data may solve the overfitting problem. For ONS dataset, de-
models into the proposed research framework is especially promising.
composition methods outperformed non-decomposition methods. The
Additionally, the literature indicates the potential benefits of leveraging
test set results for ISO-NE dataset showed that the none decomposition
method outperformed other models, but the error metrics were close. stochastic multiobjective techniques for optimizing hyperparameters,
As mentioned, the tests were time-consuming due to many techniques which could enhance results. However, it is critical to note that both
and setups, but they yielded significant results. Using the optimal methods mentioned above might significantly elevate computational
configuration, such as more training set coverage for hyperparameter time. Lastly, to assess the versatility of this hybrid model, the proposed
tuning, and different combination ensembles of regression techniques framework should be subjected to tests on various datasets.

15
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

Fig. 12. ISO-NE dataset bias verification (a) of training set where each different color of scattered data represents the considered time (given a two weeks time window), (b) of
the test set where each color of scattered data represents 1 of 20 days selected in 2019.

CRediT authorship contribution statement [2] Hao P, Yin S, Wang D, Wang J. Exploring the influencing factors of urban
residential electricity consumption in China. Energy Sustain Dev 2023;72:278–89.
[3] Wabukala BM, Bergland O, Rudaheranwa N, Watundu S, Adaramola MS,
Marcos Yamasaki Junior: Software, Conceptualization, Method-
Ngoma M, Rwaheru AA. Unbundling barriers to electricity security in Uganda:
ology, Formal analysis, Validation, Writing – original draft, Writing A review. Energy Strategy Rev 2022;44:100984.
– review & editing. Roberto Zanetti Freire: Supervision, Conceptu- [4] Dao V, Ishii H, Takenobu Y, Yoshizawa S, Hayashi Y. Intensive quadratic
alization, Writing – review & editing. Laio Oriel Seman: Writing – programming approach for home energy management systems with power utility
requirements. Int J Electr Power Energy Syst 2020;115:105473.
review & editing. Stefano Frizzo Stefenon: Writing – review & editing.
[5] Fan G-F, Zhang L-Z, Yu M, Hong W-C, Dong S-Q. Applications of random forest
Viviana Cocco Mariani: Writing – review & editing. Leandro dos in multivariable response surface for short-term load forecasting. Int J Electr
Santos Coelho: Supervision, Conceptualization, Writing – review & Power Energy Syst 2022;139:108073.
editing. [6] Stefenon SF, Seman LO, Mariani VC, Coelho LdS. Aggregating prophet and
seasonal trend decomposition for time series forecasting of Italian electricity spot
prices. Energies 2023;16(3):1371.
Declaration of competing interest [7] Sopelsa Neto NF, Stefenon SF, Meyer LH, Ovejero RG, Leithardt VRQ. Fault
prediction based on leakage current in contaminated insulators using enhanced
The authors declare that they have no known competing finan- time series forecasting models. Sensors 2022;22(16):6121.
cial interests or personal relationships that could have appeared to [8] Branco NW, Cavalca MSM, Stefenon SF, Leithardt VRQ. Wavelet LSTM for fault
forecasting in electrical power grids. Sensors 2022;22(21):8323.
influence the work reported in this paper. [9] Stefenon SF, Corso MP, Nied A, Perez FL, Yow K-C, Gonzalez GV, Leithardt VRQ.
Classification of insulators using neural network based on computer vision. IET
Data availability Gener, Transm Distrib 2021;16(6):1096–107.
[10] Ali M, Adnan M, Tariq M. Optimum control strategies for short term load
forecasting in smart grids. Int J Electr Power Energy Syst 2019;113:792–806.
Data will be made available on request.
[11] Stefenon SF, Singh G, Yow K-C, Cimatti A. Semi-ProtoPNet deep neural network
for the classification of defective power grid distribution structures. Sensors
Acknowledgments 2022;22(13):4859.
[12] da Silva RG, Moreno SR, Ribeiro MHDM, Larcher JHK, Mariani VC, dos
Santos Coelho L. Multi-step short-term wind speed forecasting based on multi-
The authors Mariani and Coelho would like to thank the Na-
stage decomposition coupled with stacking-ensemble learning approach. Int J
tional Council of Scientific and Technologic Development of Brazil - Electr Power Energy Syst 2022;143:108504.
CNPq (Grants number: 307958/2019-1-PQ, 307966/2019-4-PQ, and [13] Wang Y, Sun S, Chen X, Zeng X, Kong Y, Chen J, Guo Y, Wang T. Short-term
408164/2021-2-Universal), and Fundação Araucária PRONEX Grant load forecasting of industrial customers based on SVMD and XGBoost. Int J Electr
042/2018 for its financial support of this work. The author Freire Power Energy Syst 2021;129:106830.
[14] Zhang Q, Zhang J. Short-term load forecasting method based on EWT and
would like to thank CNPq (Grant number: 312688/2021-0-PQ). The IDBSCAN. J Electr Eng Technol 2020;15(2):635–44.
author Seman would like to thank the CNPq (Grants number: [15] Sankalpa C, Kittipiyakul S, Laitrakun S. Forecasting short-term electricity load
404576/2021-4-Universal, 308361/2022-9-PQ). The author Yamasaki using valiyeard ensemble learning. Energies 2022;15(22):8567.
would like to thank the co-authors for their collaborative efforts and [16] Duan Y. A novel interval energy-forecasting method for sustainable building
management based on deep learning. Sustainability 2022;14(14):8584.
contributions to this paper. Acknowledgment is also due to the Pontifi- [17] Yu F, Wang L, Jiang Q, Yan Q, Qiao S. Self-attention-based short-term load
cal Catholic University of Parana (PUCPR) for academic support and forecasting considering demand-side management. Energies 2022;15(12):4198.
to Siemens for their support and flexibility, which made this work [18] Baliyan A, Gaurav K, Mishra SK. A review of short term load forecasting using
possible. artificial neural network models. Procedia Comput Sci 2015;48:121–5.
[19] Groß A, Lenders A, Schwenker F, Braun DA, Fischer D. Comparison of short-
term electrical load forecasting methods for different building types. Energy Inf
References 2021;4(3):13.
[20] Yu Z, Yang J, Wu Y, Huang Y. Short-term power load forecasting under COVID-
[1] Sarkodie SA, Adams S. Electricity access, human development index, governance 19 based on graph representation learning with heterogeneous features. Front
and income inequality in Sub-Saharan Africa. Energy Rep 2020;6:455–66. Energy Res 2021;865.

16
M. Yamasaki Junior et al. International Journal of Electrical Power and Energy Systems 155 (2024) 109579

[21] Sun J, Dong H, Gao Y, Fang Y, Kong Y. The short-term load forecasting using [38] Qiu Y, Zhou J, Khandelwal M, Yang H, Yang P, Li C. Performance evaluation
an artificial neural network approach with periodic and nonperiodic factors: of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict
A case study of Tai’an, Shandong Province, China. Comput Intell Neurosci blast-induced ground vibration. Eng Comput 2021;1–18.
2021;2021:1–8. [39] Demir S, Sahin EK. An investigation of feature selection methods for soil
[22] Genov E, Petridis S, Iliadis P, Nikopoulos N, Coosemans T, Messagie M, Ca- liquefaction prediction based on tree-based ensemble algorithms using AdaBoost,
margo LR. Short-term load forecasting in a microgrid environment: Investigating gradient boosting, and XGBoost. Neural Comput Appl 2022;1–18.
the series-specific and cross-learning forecasting methods. J Phys Conf Ser [40] Corso MP, Perez FL, Stefenon SF, Yow K-C, Ovejero RG, Leithardt VRQ.
2021;2042(1):12035. Classification of contaminated insulators using k-nearest neighbors based on
[23] Subbiah SS, Chinnappan J. An improved short term load forecasting with ranker computer vision. Computers 2021;10(9):112.
based feature selection technique. J Intell Fuzzy Systems 2020;39(5):6783–800. [41] Stefenon SF, Bruns R, Sartori A, Meyer LH, Ovejero RG, Leithardt VRQ. Analysis
[24] Candela Esclapez A, López García M, Valero Verdú S, Senabre Blanes C. of the ultrasonic signal in polymeric contaminated insulators through ensemble
Automatic selection of temperature variables for short-term load forecasting. learning methods. IEEE Access 2022;10:33980–91.
Sustainability 2022;14(20):13339. [42] Zhou T, Thung K-H, Liu M, Shi F, Zhang C, Shen D. Multi-modal latent
[25] Stefenon SF, Seman LO, Aquino LS, dos Santos Coelho L. Wavelet-Seq2Seq-LSTM space inducing ensemble SVM classifier for early dementia diagnosis with
with attention for time series forecasting of level of dams in hydroelectric power neuroimaging data. Med Image Anal 2020;60:101630.
plants. Energy 2023;274:127350. [43] Liu K, Li Y, Hu X, Lucu M, Widanage WD. Gaussian process regression with
[26] Klaar ACR, Stefenon SF, Seman LO, Mariani VC, Coelho LS. Structure optimiza- automatic relevance determination kernel for calendar aging prediction of
tion of ensemble learning methods and seasonal decomposition approaches to lithium-ion batteries. IEEE Trans Ind Inf 2020;16(6):3767–77.
energy price forecasting in Latin America: A case study about Mexico. Energies [44] Wong T-T, Yeh P-Y. Reliable accuracy estimates from k-fold cross validation.
2023;16(7):3184. IEEE Trans Knowl Data Eng 2019;32(8):1586–94.
[27] Bokde N, Feijóo A, Villanueva D, Kulat K. A review on hybrid empirical mode [45] Saud S, Jamil B, Upadhyay Y, Irshad K. Performance improvement of empirical
decomposition models for wind speed and wind power prediction. Energies models for estimation of global solar radiation in India: A k-fold cross-validation
2019;12(2):254. approach. Sustain Energy Technol Assess 2020;40:100768.
[28] Ali M, Prasad R. Significant wave height forecasting via an extreme learning [46] Huang C-G, Huang H-Z, Li Y-F, Peng W. A novel deep convolutional neural
machine model integrated with improved complete ensemble empirical mode network-bootstrap integrated method for RUL prediction of rolling bearing. J
decomposition. Renew Sustain Energy Rev 2019;104:281–95. Manuf Syst 2021;61:757–72.
[29] Gao Y, Hang Y, Yang M. A cooling load prediction method using improved [47] Löning M, Bagnall AJ, Ganesh S, Kazakov V, Lines J, Király FJ. Sktime: A unified
CEEMDAN and Markov Chains correction. J Build Eng 2021;42:103041. interface for machine learning with time series. CoRR 2019;abs/1909.07872.
[30] Klaar ACR, Stefenon SF, Seman LO, Mariani VC, Coelho LS. Optimized EWT- http://arxiv.org/abs/1909.07872.
Seq2Seq-LSTM with attention mechanism to insulators fault prediction. Sensors [48] Stefenon SF, Kasburg C, Nied A, Klaar ACR, Ferreira FCS, Branco NW. Hybrid
2023;23(6):3202. deep learning for power generation forecasting in active solar trackers. IET
[31] Wang Y, Wang D, Tang Y. Clustered hybrid wind power prediction model based Gener, Transm Distrib 2020;14(23):5667–74.
on ARMA, PSO-SVM, and clustering methods. IEEE Access 2020;8:17071–9. [49] Divya D, Babu SS. Methods to detect different types of outliers. In: 2016
[32] Fan D, Sun H, Yao J, Zhang K, Yan X, Sun Z. Well production forecast- International conference on data mining and advanced computing (SAPIENCE).
ing based on ARIMA-LSTM model considering manual operations. Energy Ernakulam, India: IEEE; 2016, p. 23–8.
2021;220:119708. [50] Alghushairy O, Alsini R, Soule T, Ma X. A review of local outlier factor algorithms
[33] Liu K, Hu X, Zhou H, Tong L, Widanage WD, Marco J. Feature analyses for outlier detection in big data streams. Big Data Cogn Comput 2020;5(1):1.
and modeling of lithium-ion battery manufacturing based on random forest [51] Christen R, Mazzola L, Denzler A, Portmann E. Exogenous data for load
classification. IEEE/ASME Trans Mechatronics 2021;26(6):2944–55. forecasting: A review. In: Proceedings of the 12th international joint conference
[34] Mussumeci E, Coelho FC. Large-scale multivariate forecasting models for Dengue- on computational intelligence. Budapest, Hungary: Science and Technology
LSTM versus random forest regression. Spatial Spatio-Temporal Epidemiol Publications; 2020, p. 489–500.
2020;35:100372. [52] Sun G, Chen T, Wei Z, Sun Y, Zang H, Chen S. A carbon price forecasting model
[35] Chen Y, Zheng W, Li W, Huang Y. Large group activity security risk assessment based on variational mode decomposition and spiking neural networks. Energies
and risk early warning based on random forest algorithm. Pattern Recognit Lett 2016;9(1):54.
2021;144:1–5. [53] Xie T, Zhang G, Hou J, Xie J, Lv M, Liu F. Hybrid forecasting model for non-
[36] Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient stationary daily runoff series: A case study in the Han River Basin, China. J
boosting algorithms. Artif Intell Rev 2021;54:1937–67. Hydrol 2019;577:123915.
[37] Zhang W, Wu C, Zhong H, Li Y, Wang L. Prediction of undrained shear [54] Voyant C, Notton G, Duchaud J-L, Almorox J, Yaseen ZM. Solar irradiation pre-
strength using extreme gradient boosting and random forest based on Bayesian diction intervals based on Box–Cox transformation and univariate representation
optimization. Geosci Front 2021;12(1):469–77. of periodic autoregressive model. Renew Energy Focus 2020;33:43–53.
[55] Moon J, Park S, Rho S, Hwang E. A comparative analysis of artificial neural
network architectures for building energy consumption forecasting. Int J Distrib
Sens Netw 2019;15(9):1–19.

17

You might also like