0% found this document useful (0 votes)
3 views

IET Generation Trans Dist - 2023 - Yi - A deep LSTM‐CNN based on self‐attention mechanism with input data reduction for

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

IET Generation Trans Dist - 2023 - Yi - A deep LSTM‐CNN based on self‐attention mechanism with input data reduction for

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received: 11 August 2022 Revised: 2 December 2022 Accepted: 8 January 2023 IET Generation, Transmission & Distribution

DOI: 10.1049/gtd2.12763

ORIGINAL RESEARCH

A deep LSTM-CNN based on self-attention mechanism with input


data reduction for short-term load forecasting

Shiyan Yi1 Haichun Liu2 Tao Chen3,4 Jianwen Zhang5 Yibo Fan1
1
The State Key Laboratory of ASIC and System, Fudan University, Shanghai, China
2
Department of Automation, Shanghai Jiao Tong University, Shanghai, China
3
Department of Economics, Department of Statistic and Actuarial Science, Big Data Research Lab, University of Waterloo, Waterloo, Canada
4
Senior Research Fellow, Harvard University, Cambridge, Massachusetts, USA
5
The Key Laboratory of Control of Power Transmission and Conversion of Ministry of Education, Department of Electrical Engineering, School of Electronic Information and Electrical
Engineering, Shanghai Jiao Tong University, Minhang District, Shanghai, China

Correspondence Abstract
Jianwen Zhang, The Key Laboratory of Control of
Power Transmission and Conversion of Ministry of
Numerous studies on short-term load forecasting (STLF) have used feature extraction
Education, Department of Electrical Engineering, methods to increase the model’s accuracy by incorporating multidimensional features con-
School of Electronic Information and Electrical taining time, weather and distance information. However, less attention has been paid to
Engineering, Shanghai Jiao Tong University,
Minhang District, Shanghai 200240, China.
the input data size and output dimensions in STLF. To address these two issues, an STLF
Email: [email protected] model is proposed based on output dimensions using only load data. First, the load data’s
Yibo Fan, The State Key Laboratory of ASIC and long-term behavior (trend and seasonality) is extracted through the long short-term mem-
System, Fudan University, Shanghai 201203, China. ory network (LSTM), followed by convolution to obtain the load data’s non-stationarity.
Email: [email protected] Then, using the self-attention mechanism (SAM), the crucial input load information is
emphasized in the forecasting process. The calculation example shows that the proposed
Funding information
Fudan-ZTE Joint Lab; National Natural Science
algorithm outperforms LSTM, LSTM-based SAM, and CNN-GRU-based SAM by more
Foundation of China, Grant/Award Numbers: than 10% in eight different buildings, demonstrating its suitability for forecasting with only
52277193, 62031009; Pioneering Project of load data. Additionally, compared to earlier research utilizing two well-known public data
Academy for Engineering and Technology Fudan
University, Grant/Award Number: gyy2021-001;
sets, the MAPE is optimized by 2.2% and 5%, respectively. Also, the method has good
Fudan University-CIOMP Joint Fund, Grant/Award prediction accuracy for a wide variety of time granularities and load aggregation levels, so
Number: FC2019-001; CCF-Alibaba Innovative it can be applied to various load forecasting scenarios and has good reference significance
Research Fund For Young Scholars; Alibaba
Innovative Research (AIR) Program; Shanghai
for load forecasting instrumentation.
Committee of Science and Technology,
Grant/Award Number: 19DZ1205403

1 INTRODUCTION of distributed generation resources (DGRs) such as small


and medium photovoltaic (PV) generators and flexible loads
Short-term load forecasting (STLF) is a technique that fore- such as renewable electric vehicles (EVs) in order to achieve
casts the next few hours’ electric loads based on the changing amicable microgrid and grid access [4, 5]. For reliable load
pattern of load data. STLF is the foundation for micro- forecasting, the majority of existing load forecasting algorithms
grid “source–grid–load–storage” operation and dispatch [1, 2]. require multidimensional data inputs such as the load, mete-
STLF can coordinate flexible energy storage (ES) systems or orological, and calendar [6–9]. However, precise fine-grained
advanced demand response (DR) technologies for peak regula- meteorological data and long-term calendar data are challeng-
tion [3]. In addition, STLF can effectively reduce the uncertainty ing to obtain in microgrid applications [10, 11]. The short-term

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
© 2023 The Authors. IET Generation, Transmission & Distribution published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

1538 wileyonlinelibrary.com/iet-gtd IET Gener. Transm. Distrib. 2023;17:1538–1552.


17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1539

TABLE 1 Related works on short-term load forecasting

Category Author Method Description Advantage Shortcoming

DNN X. Kong [21] DBN Improved DBN for STLF Can abstract nonlinear interactions Cannot extract temporal
layer by layer features
K. Chen [22] ResNet An end-to-end model
W. Kong [23] LSTM Extract temporal features More accurate than other DNNs The structure is too simple.

Feature Kim [6] CNN-LSTM Accurate but many parameters Handle raw data at a higher degree Needs many input features as
extraction of abstraction temperature, humidity, date
Yao [24] CNN-GRU GRU can reduce parameters.
information. Besides, the SD
H. Chen [25] SD-FOA Each factor’s weight assigned information may not be
Zheng [26] SD-Xgboost by optimization algorithm suitable for STLF.
Park [27] SD-RL Using RL to choose SD

Load modeling W. Kong [28] LSTM Can learn residential behavior Appliance-level forecasting Needs appliances’ load
techniques provide finer
Ji [29] HSMM Can adjust different situation Necessitates complex dynamical
granularity and more accuracy models

Data Liu and K-means Using K-means to classify Can handle low-dimensional data Requires massive amount of
decompose Zhao [30] EMD IMFs and Res calculations and time
Deng [31] EEMD Reduce mode mixing in EMD

Model selection Feng [32] Q-learning Selects the most appropriate Able to handle complex situations Model pool affects accuracy,
model in the model pool calculations lead to
Li [33] Meta-learning
tremendous complexity

load data generated by distributed smart meters can represent may realize multilayer nonlinear mapping. With their nonlin-
meteorological and calendar information. Effective mining of ear mapping and adaptable capacities, systems based on deep
real-time load data can produce accurate load forecasts with- learning provide a viable solution for load forecasting [17–20].
out the need for meteorological and calendar data. This can Table 1 shows the related works on STLF. X. Kong [21] pro-
effectively reduce the required data dimension and avoid the posed an improved deep belief network (DBN) method from
transmission and storage of large amounts of data, meeting the the three aspects of model structure, parameter optimization,
requirements of the independent operation and autonomous and data selection for better use in load forecasting. In [22], a
architecture of the microgrid, the improved overall performance prediction method of a novel end-to-end neural network model
of the microgrid, and the friendliness of the system dispatch. using deep residual neural networks (ResNets) was proposed.
Conventional STLF approaches often relate to statistical Deep learning can abstract nonlinear interactions layer by
models, such as the autoregressive moving average (ARMA) layer, which can increase the accuracy of STLF in comparison to
[12], autoregressive integrated moving average (ARIMA) [13], machine learning. However, a deep neural network (DNN), like
and exponential smoothing model [14]. These approaches are DBN and ResNet, is insensitive to temporal correlation, mak-
straightforward to implement and interpretable, but their ability ing it difficult to obtain better prediction outcomes. Considering
to anticipate nonlinear series is low, so they cannot effec- energy consumption is a typical time series, the model’s output
tively handle the STLF problem. In comparison to conventional should relate to the past input. LSTM, a variant of the recurrent
STLF approaches, machine learning has better nonlinear fit- neural network (RNN), is highly competitive in various fields
ting capabilities. The back-propagation neural network [15] and and can take the temporal dependencies of the time series. In
support vector regression (SVR) [16] are commonly employed [23], LSTM was employed to predict loads for individual users,
machine learning time-series prediction algorithms. The back- and its superiority to traditional methods was demonstrated. In
propagation neural network has excellent parameter fitting consideration of the temporal features of the load data, LSTM
skills, but owing to its simple structure, it has low generaliza- was used as a basis for the model in consideration of the tem-
tion abilities and is prone to the local optimum. As SVR is poral characteristics of the load data.There are numerous STLF
based on the learning criterion of structural reduction, it has a methods based on LSTM. The most common method is fea-
high degree of generalizability. SVR performs well with small ture extraction. CNN is the most popular feature extraction
data sample sizes, but it fails to converge with high sample method in STLF, which can extract high-dimensional correla-
sizes. Machine learning can learn nonlinear correlations between tions of a feature set [34–36]. Kim [6] proposed a CNN-LSTM
sequences better than conventional methods, but it typically neural network, which can improve prediction accuracy by ana-
has drawbacks including weak generalization capabilities, trou- lyzing features that influence the load. In [24], LSTM is being
ble establishing hyper-parameters, and prediction accuracy that replaced by a gated recurrent unit (GRU) to reduce the number
frequently depends on the quality and quantity of input data. of parameters. Similar-day selection is another feature extraction
Deep learning superimposes multilayer neural networks, which method in STLF used to discover the similar forecasting day
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1540 YI ET AL.

among historical days based on multiple characteristics, includ- applied to the prediction model, the amount of calculation is
ing temperature, wind speed, and national holidays [7, 8, 37]. several times larger than that of other methods. The model
By weighting features manually, traditional similar-day selection selection approach uses optimization algorithms to select rele-
cannot identify the main factors effectively. A fruit fly opti- vant and effective models from model pools in order to manage
mization algorithm (FOA) was developed to weight features in various circumstances. In [32], Q-learning worked well when it
an intelligent manner in order to solve this problem[25]. Simi- was used in dynamic model selection. This algorithm enables
larly, [26] proposed a xgboost algorithm for evaluating feature the first choice to be deterministic load forecasting, followed by
importance. Unlike most similar-day selection models that uti- a second choice of probabilistic load forecasting. Meta-learning
lize feature-weight learning, reinforcement learning (RL) is used was also used to determine the accuracy of the candidate load
to select similar days directly [27]. Feature extraction mostly forecast models [33]. The main disadvantage of this method is
uses CNN and similar-day selection to handle raw data at a that the model pool plays a significant role, and every model’s
higher degree of abstraction. It can automatically extract inter- accuracy needs to be calculated, causing massive calculations
nal features from data to reduce the error caused by human and complexity.
feature determination. Modeling from the overall level can also In order to answer the challenge of attaining low-cost and
be an efficient way of improving prediction performance [38, efficient load forecasting utilizing accessible electric load data,
39]. W. Kong et al. incorporated residential electricity consump- after extracting long-term power consumption behavior using
tion data (dryers, washing machines, dishwashers, heat pumps, LSTM, we employ CNN to mine the data’s nonlinear prop-
TVs, and wall-hung boilers) to determine residents’ behaviors, erties in depth. Although studies have investigated CNN for
which could explain some of the forecasting variability [28]. feature extraction from input data, few studies have explored
Ji et al. [29] proposed a hidden semi-Markov model that pro- CNN performance in the output dimension. Yin [41] posited
vides a framework for dynamic statistical models of various that CNN has a strong nonlinear mapping ability and that using
household appliances to describe the demands for household CNN in load forecasting can improve the model’s accuracy. In
appliances. The load modeling method is appliance-level fore- [42], CNN was used to enhance the embedding instead of pool-
casting, which predicts the total load by simulating the load ing after Bi-LSTM. Wang et al. demonstrated that the order of
on the appliances. Compared to general forecasting approaches, CNNs and LSTMs had a significant impact on the accuracy
appliance-level forecasting techniques provide finer granularity of their method [43]. Jiang et al. presented a method that uti-
and more accuracy. lizes LSTMs to monitor current random usage and CNNs to
However, these approaches need electrical, meteorological, track long-term regular patterns [44]. Hence, this paper employs
and calendar features. Due to the tiny size of the microgrid an innovative approach by focusing on output dimensions to
region, it is difficult to obtain fine-grained data [10, 11]. The reduce the input data size while keeping accuracy in mind. Com-
majority of the present literature on electrical, meteorological, pared to single models, hybrid models incorporate the strengths
and calendar data presents data gathered in test sites. These of several different models [43, 45]. In recent years, the self-
testing regions are uncommon worldwide. In addition, the trials attention mechanism (SAM) has become a hot spot in load
were developed expressly for microgrid research and collected forecasting. Zhou’s work [46] and the work of IBM Watson
data by distributing questionnaires to consumers and installing [47] both used SAM after an LSTM neural network for one-
measuring tools. Obtaining adequate data for load forecast- dimensional inputs in natural language processing. SAM is also
ing by adapting measuring equipment is costly and subject to useful for load forecasting [48]. Using SAM, Zhao highlighted
large-scale communication challenges in practice. Furthermore, the important information by relating different probabilities
access to fine-grained data is complex and limited for most [49].
microgrids. Consequently, the focus of the study is on how to An innovative method of enhancing LSTM output using
conduct cost-effective and efficient load forecasting using the CNN is presented in this paper, which focuses on the output
currently available electric load data. For predicting with small- dimension called LSTM-CNN-based SAM. This paper makes
dimensional data (electric data), the primary available hybrid the following main contributions:
forecasting methods that may be employed include data decom-
position, integrated learning, and other methods. Researchers 1. We are the first to employ LSTM-CNN in STLF. The revolu-
have used empirical mode decomposition (EMD) to decompose tionary LSTM-CNN-based SAM model can reduce the size
complex nonlinear time series into intrinsic mode functions of input data without compromising forecasting accuracy.
(IMFs) of different frequencies with excellent smoothness and This research presents, in consideration of the output dimen-
regularity [40]. In contrast to wavelet decomposition, in EMD, sion, a hybrid framework for forecasting that uses solely load
there is no need to predefine any base functions. In [30], data.
the EMD results were classified using K-means to reduce 2. We innovatively use convolution kernels to extract user
the number of calculations. In order to solve mode mixing, randomness, resolve non-stationarity characteristics, and
Deng proposed the ensemble empirical mode decomposition circumvent SAM’s local dependence issue.
(EEMD) algorithm [31]. The EMD approach can handle low- 3. We conduct exhaustive experiments at various aggregation
dimensional data without requiring any a priori information, levels and time periods to demonstrate the superiority of the
which can mitigate the impact of complicated environmental model. Compared to the benchmarks, the proposed strategy
factors on STLF. As every IMF after decomposition must be is superior by more than 10%. In addition, the MAPE has
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1541

FIGURE 2 LSTM memory block

capture dynamic time characteristics of load data and CNN to


capture randomness.
The upper layer of LSTM-CNN consists of LSTM. Owing
to their feedback connections, RNNs can capture temporal
dynamics by conveying past information, thereby making them
more suitable for time-series prediction than CNNs and feed-
FIGURE 1 The basic structure of LSTM-CNN-based SAM network forward networks. However, RNNs cannot deal with long-range
sequences owing to gradient vanishing and exploding. In order
to overcome the problems of gradient vanishing, Hochreiter
been increased by 2.2% and 5%, respectively, compared to and Schmidhuber proposed LSTM in 1997 [51]. LSTM adds
earlier research [23, 32]. gate mechanisms in memory cells to store previous time steps,
enabling it to perform better in extracting long-distance depen-
This paper is organized as follows. The LSTM-CNN-based dencies. The memory cell, composed of a single LSTM unit, is
SAM method is analyzed in Sections 2.1, 2.2 and 2.3. The exper- in charge of updating the hidden states that include past tempo-
imental results of the model are presented in Section 3, leading ral information. As shown in Figure 2, the memory cell includes
to the conclusion in Section 4. the forget gate, input gate, and output gate. First, the sigmoid
activation is applied to the forget gate to determine whether the
cell states should be overlooked.
2 THE PROPOSED METHOD
ft = 𝜎(W f [ht −1 , xt ] + b f ). (1)
Figure 1 shows the overall architecture of the LSTM-CNN
model based on SAM. First, we preprocess the actual energy
Second, to decide how much new information should be stored
consumption with Z-scores and construct the input data set
in the cell state, we design the input gate. In the beginning,
using the sliding window algorithm. Second, the input data are
the “input layer” consists of a sigmoid layer that decides which
scaled to the range of (0,1) by a fully connected layer because
values will be updated. Then, a vector of new candidate values,
the LSTM is sensitive to the data scale. Third, the irregular
c˜t , is created by a tanh layer that can be added to the cell state.
time series is modeled by a two-layer LSTM model for predic-
tion. Fourth, convolution captures the randomness parts of the
LSTM output and solves the nonstationary problem. Finally, the it = 𝜎(Wi [ht −1 , xt ] + bi ), (2)
prediction of the LSTM-CNN model is weighted to emphasize
the essential characteristics, and long-dependencies are modeled c˜t = tanh(Wc [ht −1 , xt ] + bc ). (3)
using SAM. The data is then placed in the linear connected layer
for prediction purposes. Next, combining the ft , it , c˜t , the old ct −1 is updated to the new
Below is a detailed description of the LSTM-CNN model and cell state ct . We multiply ft with ct −1 to forget the old informa-
the SAM. tion and add it ∗ c˜t , which indicates what new information will
be updated to the new cell state.

2.1 LSTM-CNN neural networks ct = ft ∗ ct −1 + it ∗ c˜t . (4)

The LSTM-CNN approach is composed of a series of LSTM Last, the output is decided. We use a sigmoid layer named “the
and CNN connections. Load data are typically separated into output layer” to decide which part of the cell state should be
trend, seasonal, and residual categories [50], using LSTM to output. Then, we multiply it with the cell states after tanh to
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1542 YI ET AL.

FIGURE 4 The decomposition of the load

It can be observed that the CNN value at 0 to 0.1 frequency is


greater than the LSTM value in this interval. This portion cor-
responds to the residual component of the expected sequence,
as determined by examination of Figure 5a. Consequently, the
CNN can capture the residual component.
Analyze the CNN filter applied to the trend data extracted by
LSTM. As shown by the Convolution Theorem, time-domain
convolution is equivalent to the frequency-domain product.

F [ f1 (t ) ∗ f2 (t )] = F1 (w) ⋅ F2 (w), (7)


FIGURE 3 Frequency domain of the convolution kernel
where f1 (t ), f2 (t ) is the time series at time-domain; F1 (w), F2 (w)
is the time series at frequency-domain. Figure 3 shows the
calculate the actual output. frequency domain of the convolution kernel, where time is hor-
izontal and amplitude is vertical. As can be seen in Figure 5a,
ot = 𝜎(Wo[ht −1 , xt ] + bo ), (5) the amplitude of the trend component is only evident when
the frequency is close to zero, indicating that the trend consists
ht = ot ∗ tanh ct , (6) entirely of low frequency. According to the residual category,
the frequency spectrum likewise varies greatly. As shown in
where xt is the input at time t ; ht −1 is the hidden state at Figure 3, the amplitude at zero is not equal to zero. Hence
time t − 1; W f , Wi , Wc , and Wo are the weight matrices; and the convolution does not filter the trend information extracted
b f , bi , bc , and bo are the bias vectors. The CNN, the lower by LSTM.
layer of LSTM-CNN, can capture randomness. Residual com- Analyze the capacity of the LSTM-CNN. We employ LSTM
ponents, which represent user unpredictability, can be captured to track trend and seasonal features and CNN to extract
by convolution[23]. Residual components in the sequence indi- user behavior uncertainty and augment local information.
cate randomness and noise [31]. As described in [50], blur can be Figure 5d–f depicted on the time domain are the CNN input
used to extract trend features, while noise enhances local infor- sequence (Figure 5d), CNN output sequence (Figure 5e), and
mation. In [52], blur and noise were both used to achieve the real load sequence (Figure 5f). The horizontal axis indicates the
enhancement effect. LSTM-CNN acts in this work as it did in number of days, the vertical axis indicates the time point, and
[52]. We track the trend and seasonal features by LSTM. Con- the color indicates the amplitude of the value. From the graph,
volution extracts the uncertainty of user behavior and enhances we can see that LSTM is able to obtain the general tendency,
local information. whereas CNN is able to obtain more specific information and
Analyze CNN’s capability of extracting uncertainty about is closer to the real load sequence.
user behavior. We separate the real load into trend, seasonal,
and residual categories, as can be seen in Figure 4. We then
performed the Fourier transform on these three components. 2.2 Resolve the nonstationary problem
The trend, seasonal, and residual spectrums are displayed in
Figure 5a. The horizontal axis denotes frequency, and the ver- The following two conditions statistically determine a time
tical axis represents amplitude. We investigate the frequency series Y = {y1 , y2 , y3 , … , yt , … , yT } (yt is the value of Y taken at
domain characteristics of the CNN input sequence (LSTM time t ) whether or not there is a stationary time series.
layer’s output) and the output sequence (CNN layer’s output). In
this model, the output sequences of LSTM and LSTM-CNN are 1) The expectation E (Y ) of the time series Y is time-invariant.
Fourier transformed. The outcomes are depicted in Figure 5b,c. 2) The variance V (Y ) of the sequence Y is time-invariant.
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1543

FIGURE 5 The frequency domain and time domain characteristics of LSTM layer’s output, CNN layer’s output, and real load

When these two conditions are met, the sequence is considered For backward propagation, the work [53] provides a novel
stationary. For statistical characteristics of nonstationary time optimization strategy to address the non-stationarity of the time
series (the expectation and variance vary over time), the data dis- series. It is argued that for a nonstationary time series, the
tribution changes continuously, and the value of the data at the parameter should not be updated with gradient values that are
current time is not related to that a long time ago [53]. The pre- far off from the present time. It provides a time frame k, where
diction is made by determining the relationship between the past k≪T, for updating the gradient with k.
and the present based on the premise of stationary time series.
Owing to the continually shifting distribution of time series, 𝜃i = 𝜃i−1 − 𝛼gi , (9)
the relationship between the past and present for nonstationary
sequences changes with time, making prediction challenging. where 𝜃i is parameter in iter step i, 𝜃i−1 is parameter in iter step
Finally, inspired by the behavior of nonstationary time series, i − 1, 𝛼 is the learning rate, and gi is the calculated gradient of
this study resolves the nonstationary time-series problem by parameter 𝜃.
employing a convolution kernel. In contrast to fully connected The gradient calculation process is done in this manner both
networks, convolution has local connectivity. That is, it only before and after joining the convolution operation.
extracts the connections between input elements within the fil-
ter size m [42]. This procedure can forecast current load data 1) Before joining, gradients are equal at this stage because the
using historical data, which can reduce the dimension of the output of the LSTM layer is directly coupled to the input of
input data. the SAM layer.
The role of the convolution operation on nonstationary is 𝛿LSTM = 𝛿SAM , (10)
then investigated. It is evident from Figure 4 that load data
are typically nonstationary, having time-varying variances and where 𝛿LSTM is the gradient of the LSTM layer’s output, and
expectations. The model’s input is a sequential load in time T , 𝛿SAM is the gradient of the SAM layer.
that is L = {l1 , l2 , l3 , … , lt , … , lT }. 2) After joining,
For forward propagation, the output is proportional to the 𝜕l
filter size m, which is less than the time T . 𝛿LSTM = 𝛿SAM ∗ cnnout , (11)
𝜕lcnnin
( ) ( (∑ ))
∑ 𝜕 𝜎
lcnnout = 𝜎 (lcnnin ⊗ Wcnn + bcnn ) , (8) m (lcnnin ⊗ Wcnn + bcnn )
𝛿LSTM = 𝛿SAM ∗ , (12)
m 𝜕lcnnin

where lcnnout is the output of layer CNN; lcnnin is the input of layer 𝛿LSTM = 𝛿SAM ∗ ROT 180◦Wcnn ⊙ 𝜎′ (lcnnin ), (13)
CNN; Wcnn is the weight matrices, and bcnn is the bias vectors; 𝜎
is the activation function sigmoid. where 𝜎′ (lcnnin ) represents the sigmoid’s gradient at value lcnnin .
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1544 YI ET AL.

In other words, the gradient computed is exclusively related


to the filter size m. In this study, we employ the CNN operation
and obtain results comparable to those in [53].

2.3 SAM optimization

SAM simulates the human brain’s attention, focusing more


on the needed information and less on irrelevant informa-
tion by resource allocation. SAM contributes to the variation
in the importance of hidden features that are not recognized
by the LSTM network by allocating different weights for hid-
den features at different steps. Thus, SAM extracts advanced
characteristics and improves long-term dependencies.
SAM can extract long-term sequence dependencies, whereas FIGURE 6 Structure of self-attention model
CNN-SAM can remedy SAM’s deficiencies in extracting local
dependencies. Sequence temporal information is extracted via
the gating structure of LSTM. However, LSTM accuracy is Because all signals are analyzed concurrently via weighted
affected by a lack of time-series dependencies. This study averaging, SAM cannot capture the local interdependence of
applies LSTM-CNN-based SAM to enhance LSTM’s applica- sequences in the distraction distribution [55]. We can first
bility to STLF by increasing SAM and addressing SAM’s local extract the local dependencies and then compute the sequence
dependency learning deficiencies. dependencies using SAM. In this study, we add the convolution
LSTM lacks long-term dependencies. LSTM’s input and out- layer before the SAM layer, use the local connection character-
put equations are given in Equations (1) through (6). LSTM’s istic of convolution to gain the local sequence dependency, and
forgetting gate is Equation (1). According to Equation (4), the then use SAM to obtain the long sequence dependence so as to
forgetting gate determines how much of the state from the pre- achieve a more accurate prediction impact. In addition, employ-
ceding instant can be kept. The forgetting gate of the LSTM, ing an even number of filters can enhance the generalization
which sets it apart from other DNNs, allows it to retain histori- capability of the model [56].
cal data. According to Equation (14), the input at the beginning Figure 6 shows the structure of SAM. First, calculate the
may no longer influence ct following the final output. This is correlation among hidden features at different time steps:
determined by the gate’s value.
et ,d = u tanh(wht ,d + b). (15)
ct = it ∗ c˜t + ft ∗ ct −1
Second, normalize the relevant scores into a probability by using
= it ∗ c˜t + ft ∗ (it −1 ∗ c˜
t −1 + ft −1 ∗ ct −2 ). (14) the softmax function:
exp(et ,d )
This reduces the output reliability of the LSTM. The fact that 𝛼t ,d = ∑n . (16)
d =1 exp(e t ,d )
SAM can describe sequence dependency while ignoring distance
makes it a suitable tool for STLF applications. Wang [54] prior- Finally, multiply the calculated probability by the hidden
itized the AM over the LSTM to filter the input data to aid in features:
LSTM prediction. The weight value of the current instant was
calculated by factoring in the input value for this moment, the st ,d = 𝛼t ,d ht ,d , (17)
weight value, and the output value of the previous moment. The
approach can utilize the previously learned temporal character- where et ,d is the correlation of the dth hidden feature at t time
istics to provide assistance in the current moment. However, steps, 𝛼t ,d is the relevant scores of the dth hidden feature at t
applying the forecasted values from the preceding instant can time steps, st ,d is the calculated probability of the dth hidden
result in an erroneous accumulation. Instead of being influ- feature at t time steps, ht ,d is the dth hidden feature at t time
enced by other sequences like AM, the SAM mechanism can steps, u and w are weighted vectors, and b is the bias vector.
concentrate solely on the sequence we are concerned about.
Zang [48] sandwiched the SAM between two LSTM networks
and optimized the prediction accuracy by weighting the LSTM 2.4 Implementation details
output features in the SAM. Here, SAM is capable of enhanc-
ing the crucial information that contributes to the following The model’s implementation details are shown in Figure 7.
LSTM input. SAM was placed after the LSTM in the [49] not Through the sliding window, the input data of size [256,1,24]
only to highlight critical information but also to extract long- is obtained, which consists of 24-hours-ahead load data, with
term dependencies. To improve key information and extract a batch size of 256, a time step of 1, and a feature count
long-term dependencies, we use SAM placed after LSTM in of 24. To adapt to the sensitivity of LSTM to the infor-
this paper. mation, we convert the data to the range (0,1) using the
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1545

FIGURE 7 The LSTM-CNN-based SAM network details

fully connected layer with a sigmoid activation function. With 3 CASE STUDIES
two layers of dropout LSTMs, the output dimension remains
[256,1,128], where the hidden size is 128. The data are made 3.1 Data preprocessing and evaluation
four-dimensional [1,256,128,1], followed by two-dimensional index
convolution with a kernel size of 8 and output dimensions of
[256,1,128]. We use the SAM to obtain outstanding features, Missing data pose a serious problem in load forecasting. Delet-
and then we transform the output dimension to the predicted ing or filling is one way to deal with missing data and outliers
dimension using the linear fully connected layer, where the caused by objective factors. Considering that direct elimination
output size is 1. The proposed LSTM-CNN-SAM architecture results in discontinuity, we apply a judgment condition to select-
is shown in Table 2. The number of parameters of the model is ing the filling methods by comparing 24 hours’ of missing data
52,416. of less or more than 10%.
All models are implemented on the Nvidia RTX 2080 Ti To reduce the effect of dimension and effectively improve
GPU using Tensorflow 1.1.0 as the back end in the Python accuracy, we use standardization methods such as min-max
3.6 environment [57]. Moreover, the models are trained by the standardization and Z-score standardization. While the min-
Adam optimizer with default parameters to minimize the mean max standardization is near zero when the data are concen-
square error (MSE) [58]. trated, the Z-score standardization avoids this problem by
keeping the value in a normal distribution, showing how feature
1∑
n
L= (̂y − yi )2 . (18) changes impact load forecasting. In this case, raw data are pro-
n i=0 i cessed using the Z-score standardization method. The formula
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1546 YI ET AL.

TABLE 2 The proposed LSTM-CNN-based SAM architecture

Models Configuration #Param

Fully connected Neurons = 128 3072


Activation Activation=’Sigmoid’ 0
LSTM Units1 Units1 = 128 16384
Units2 Units2 = 128 16384
LSTM-CNN-based SAM Dropout Dropout = 0.75 0
CNN Convolution Filter=1;kernel_size=8;stride= 1 64
Activation Activation=’Sigmoid’ 0
SAM Neurons = 128 16384
Fully connected Neurons = 128 128
Total number of parameters 52416

is shown as (19) pieces of information. This method can be used to forecast


time series, most notably load forecasting. Another often-used
′ x − mean(x) technique is CNN-GRU-based SAM. It is crucial for CNN to
x = , (19)
std (x) be able to extract high-dimensional features for LSTM. How-

ever, CNN-GRU-based SAM is applicable to a wide variety of
where x represents the original value, x represents the stan- features. We conduct our experiments on the hourly load data
dardized value, mean(x) represents the average value of all provided by the University of Texas at Dallas (UTD) from 1
data corresponding to x, and std (x) represents the standard January 2014 to 31 December 2015 [59]. We split the data into
deviation of all data corresponding to x. training, validating, and testing sets in a ratio of 0.6/0.2/0.2.
This paper evaluates the prediction accuracy by using the Following is a list of all benchmarks’ hyperparameters. For
mean absolute percentage error (MAPE), the mean absolute LSTM and LSTM-based SAM, the total number of epochs is
error (MAE), and the root mean square error (RMSE). The set to 150; for CNN-GRU-based SAM and LSTM-CNN-based
calculation formula of each evaluation index is as follows: SAM, it is set to 400. For all benchmarks, the learning rate,
batch size, and dropout are set to 0.006, 256, 0.75. Table 3
100% ∑ |̂yi − yi |
n
MAPE = , (20) shows the performance of the models for forecasting based
n i=0 yi on ten calculations on eight different structures. As shown in
the table, the recommended method ranks first in this experi-
1∑
n
MAE = |̂y − yi |, (21) ment in terms of accuracy, surpassing others by more than 10%
n i=0 i among B1 to B8. Figure 8 depicts the load forecasting find-

√ n ings. As can be seen in Figure 8a, these four models show the
√1 ∑
RMSE = √
same trend as in the original data. The magnified Figure 8b,c
(̂y − yi )2 , (22)
n i=0 i demonstrates that the suggested approach is closer to the
anticipated value even at peaks and troughs. Figure 9 illustrates
where n indicates the number of predicted points, and ŷi and yi a histogram of the percentage of metrics between the differ-
represent the predicted and real data for the ith loading point in ent models. The box plot in Figure 10 is used to compare the
the load set, respectively. models’ overall performance.
When LSTM is compared to LSTM-based SAM, it is clear
that the SAM method improves both the accuracy and sta-
3.2 Comparison with benchmarks bility of the model. The SAM model is better at learning
long-sequence dependencies than the LSTM model, which is
We use three fundamental deep-learning load prediction algo- prone to losing them. LSTM-based SAM can address the defi-
rithms as benchmarks in this article. The most often used LSTM ciency of LSTM in long time-sequence dependency extraction.
is presented first. By storing the information from the previ- CNN-GRU-based SAM is a regularly used model that outper-
ous instant, LSTM makes a relationship between the current forms LSTM-based SAM in B1, B5, B6, and B7 prediction but
and previous moments. Studies on time series prediction have performs worse than the LSTM model in B2, B3, B4, and B8
shown that LSTM is the most widely used model to can extract prediction. This is because of the model’s limited capacity for
long-term dependencies and temporal features. Hybrid mod- generalization, which is dependent on the dimension of the
els have outperformed signal models such as the LSTM in input data. Owing to the low dimensions of the input data and
recent years owing to the nonstationary nature of load data. the small amount of available data, connecting CNN before
The model’s accuracy is increased with the use of LSTM-based GRU-SAM does not improve accuracy but increases the train-
SAM. This strategy is useful for assigning weights to certain ing and inference time owing to the increase in the number
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1547

TABLE 3 Comparison of performance to benchmarks. The experiments are performed in eight different buildings (B1 to B8)

Predictin results Operating time

MAPE MAE RMSE Train_time Test_time


Buildings Model (%)↓ (KW)↓ (KW)↓ (s)↓ (s)↓

B1 LSTM 0.047 11.862 15.962 22.959 0.184


LSTM-based SAM 0.044 10.738 14.063 85.206 0.198
CNN-GRU-based SAM 0.037 9.221 12.302 212.478 0.386
LSTM-CNN-based SAM 0.020 5.235 8.092 90.714 0.214
B2 LSTM 0.070 2.991 3.904 17.842 0.160
LSTM-based SAM 0.067 2.884 3.826 63.377 0.176
CNN-GRU-based SAM 0.079 3.370 4.323 164.457 0.331
LSTM-CNN-based SAM 0.027 1.144 1.589 69.224 0.203
B3 LSTM 0.058 14.784 21.830 23.896 0.187
LSTM-based SAM 0.045 11.288 17.020 88.131 0.203
CNN-GRU-based SAM 0.055 13.413 20.038 223.774 0.414
LSTM-CNN-based SAM 0.026 6.456 9.563 92.730 0.214
B4 LSTM 0.037 12.290 16.503 24.845 0.194
LSTM-based SAM 0.026 8.505 12.089 90.195 0.197
CNN-GRU-based SAM 0.046 15.701 19.712 218.794 0.396
LSTM-CNN-based SAM 0.018 6.119 8.382 95.337 0.215
B5 LSTM 0.063 1.887 2.706 25.015 0.193
LSTM-based SAM 0.059 1.820 2.643 91.419 0.203
CNN-GRU-based SAM 0.058 1.719 2.385 223.653 0.421
LSTM-CNN-based SAM 0.027 0.850 1.463 97.628 0.214
B6 LSTM 0.063 14.247 18.765 25.015 0.189
LSTM-based SAM 0.060 13.541 18.080 90.697 0.206
CNN-GRU-based SAM 0.049 12.476 16.752 224.640 0.396
LSTM-CNN-based SAM 0.029 7.036 9.574 97.024 0.213
B7 LSTM 0.096 14.699 18.308 24.834 0.193
LSTM-based SAM 0.101 14.239 17.440 89.814 0.204
CNN-GRU-based SAM 0.089 12.857 16.342 228.991 0.438
LSTM-CNN-based SAM 0.061 9.411 11.819 97.303 0.207
B8 LSTM 0.021 13.906 24.247 24.319 0.188
LSTM-based SAM 0.017 11.219 19.589 91.623 0.204
CNN-GRU-based SAM 0.021 13.768 20.962 218.607 0.409
LSTM-CNN-based SAM 0.008 5.479 12.572 96.009 0.218

of parameters. The LSTM-CNN-based SAM model provides the proposed LSTM-CNN-based SAM model achieves superior
a greater performance improvement in B1–B8 because even a performance to the other methods for power forecasting, and it
kernel of the CNN can increase the model’s generalization abil- is a competitive method for power consumption prediction.
ity and resolve the sequence’s nonstationary. The connection
with SAM enables the determination of local and long-term
sequence relationships. 3.3 Performance on the different
Additionally, as illustrated in the figures, our proposed aggregations and time horizons
technique offers the highest accuracy and stability. Experi-
ments show that the proposed LSTM-CNN-based SAM model 3.3.1 Performance on the industrial data set
outperforms existing power forecasting methods and is a
competitive method for power consumption prediction. The This paper first validates the proposed method’s performance
proposed method is closer to the predicted value, even at peaks on an industrial data set, which includes the consumption data
or valleys. In general, the experimental results indicate that of one factory in an actual area in China. It covers 1 July 2020 to
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1548 YI ET AL.

FIGURE 8 Load forecasting results. The closer it is to the original value (which can be shown as closer to the black line in the graph), the better the prediction is

TABLE 4 Performance on the industrial data set there is a significant amount of noise and outliers. As a result,
Model MAPE↓ MAE↓ RMSE↓ the relative indicators MAE and RMSE are subject to increas-
ing values. The aggregation of plant load is greater than that of
LSTM 0.196 83.866 111.850
individual consumers. Because the plant operates on a schedule,
LSTM-based SAM 0.202 91.362 124.304 holidays have a more significant impact on the load. As seen
CNN-GRU-based SAM 0.209 94.659 123.279 in the table, it outperforms the other three models by more
LSTM-CNN-based SAM 0.164 70.154 94.030 than 15% on the indicators MAE and MAPE, which are prone
to noise and outliers. As a result, the suggested model retains
superior predictive ability on the actual data set.

31 December 2020, with a sampling interval of 15 min. We split


the data into training, validating, and testing sets in a ratio of 3.3.2 Performance on the building data set
0.6/0.2/0.2. The prediction results of the algorithm proposed
in this paper are compared with those of LSTM, LSTM-based We conducted our second test to ensure performance on the
SAM, and CNN-GRU-based SAM methods to demonstrate building data set derived from hourly load data provided by
its superiority. the UTD from 1 January 2014 to 31 December 2015. We split
Table 4 summarizes the experimental findings on the indus- the data into training, validating, and testing sets in a ratio of
trial data set. Because load prediction is based on actual data, 0.6/0.2/0.2. This paper compares the prediction results of the
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1549

FIGURE 9 The histogram of the normalized promotion percentages by LSTM

TABLE 5 Performance on the building data set

Model MAPE↓ MAE↓ RMSE↓

LSTM 0.070 2.991 3.904


LSTM-based SAM 0.067 2.884 3.826
CNN-GRU-based SAM 0.079 3.370 4.323
LSTM-CNN-based SAM 0.027 1.144 1.589
Model selection[32] 0.049 2.840 4.130

has a 0.022 lower MAPE, a 1.696 lower MAE, and a 2.541 lower
RMSE than the model selection method in [32], implying that
the proposed method has a higher degree of precision. The lit-
erature makes use of RL to pick models. This is the first study
to propose this concept for load prediction, which is novel.
However, the proposed model is better than the method in
[32] using machine learning models in the model pool, which
FIGURE 10 MAPE of the models. The horizontal line in the box plot has been confirmed to have poorer performance than the deep
represents the mean value, and the scattered points represent the convergence. learning method.
The more scattered the points are, the more unstable the model is

3.3.3 Performance on the residential data set


proposed algorithm to the model selection method proposed in
[32], concluding that the proposed algorithm is more precise. Third, we perform a test involving the residential data set
Specifically, we use only load data for the proposed model. In obtained from the Smart Grid Smart City (SGSC) project with
addition to load data, Feng [32] used weather data. 30-min intervals from 12 January to 31 December 2015 [60].
In [32], the model selection method uses reinforcement In order to retain the experimental split ratio consistent with
learning (RL) to choose models from a model pool that includes W. Kong’s work[23], we split the data into training, validating,
deterministic load forecasting methods like the artificial neural and testing sets in a ratio of 0.7/0.2/0.1. Based on the compar-
network (ANN), support vector machine (SVM), and gradient isons of this paper and W. Kong’s work, the present algorithm is
boosting machine (GBM), as well as probabilistic load forecast- significantly more effective.
ing methods like normal, Gamma, Laplace, and non-central-t Residential load forecasting is the most difficult forecasting
distributions. Furthermore, this project makes use of weather because of its diversity and volatility. Residents living patterns
and calendar data to aid in forecasting. According to Table 5, and work schedules can have a significant impact on their
the proposed method performs well. The suggested technique power consumption habits. Residential electricity consumption,
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1550 YI ET AL.

TABLE 6 Performance on the residential data set 15-min intervals. The essential characteristic of this model is
Model MAPE↓ its ability to reduce data inputs and improve load forecasting
accuracy. We develop an output-dimensioned load forecasting
LSTM 1.095
model. For the extraction of time features, non-stationarity,
LSTM-based SAM 1.180 and long-term dependency, we use LSTM-CNN-based SAM,
CNN-GRU-based SAM 1.356 respectively, based on the characteristics of load data. First, the
LSTM-CNN-based SAM 0.351 model is compared with LSTM, LSTM-based SAM, and CNN-
LSTM with weather data input[23] 0.401 GRU-based SAM in three different aggregation levels and time
horizons obtained from two widely known public data sets and
one real area in China. The proposed method increases the per-
formance of the fundamental load forecasting models by more
TABLE 7 Generalization capability than 10%, and by 2.2% and 5% compared to earlier research
Model MAPE↓ MAE↓ RMSE↓ (Feng’s work [32] and Kong’s work [23]). After that, we con-
duct experiments to verify the model’s generalization ability. We
LSTM 0.070 2.991 3.904 train the model using different data sets and then test its results
LSTM-based SAM 0.067 2.884 3.826 with the same data set. The results demonstrate that the model
CNN-GRU-based SAM 0.079 3.370 4.323 is capable of generalizing well. In the future, we intend to use
LSTM-CNN-based SAM_B2 0.027 1.144 1.589 the proposed method to tackle nonstationary forecasting issues
LSTM-CNN-based SAM_B1_B2 0.040 1.748 2.460
like wind power forecasting and stock forecasting.

AUTHOR CONTRIBUTIONS
Shiyan Yi: Conceptualization, investigation, methodology, soft-
in particular, has a higher degree of randomness because the ware, visualization, writing - original draft, writing - review and
turning on and off of electrical appliances is entirely random, editing. Haichun Liu: Conceptualization, formal analysis, val-
and the total amount of residential electricity consumption is idation, writing - review and editing. Tao Chen: Validation,
small. Thus the use of individual high-powered electrical appli- writing - review and editing. Jianwen Zhang: Conceptualiza-
ances can cause significant load fluctuations. W. Kong’s work tion, data curation, funding acquisition, supervision, validation,
used multilayer LSTM with weather data for residential load writing - review and editing. Yibo Fan: Funding acquisition,
forecasting. The results, given in Table 6, show the performance project administration, resources, supervision, writing - review
of the methods. Compared with W. Kong’s work, the suggested and editing.
method has a 0.05 lower MAPE, suggesting that the proposed
method is superior. ACKNOWLEDGEMENTS
This work was supported in part by the National Natu-
ral Science Foundation of China (Grant No. 52277193), in
3.4 Generalization capability of the part by the Shanghai Committee of Science and Technology
proposed model (19DZ1205403), in part by the National Natural Science Foun-
dation of China (Grant No. 62031009), in part by Alibaba
We verify the model’s generalization capability by using training Innovative Research (AIR) Program, in part by the Fudan
sets and test sets with different data distributions. The first 80% University-CIOMP Joint Fund (FC2019-001), in part by the
of the B1 building load data in the UTD data set is used as the Fudan-ZTE Joint Lab, in part by Pioneering Project of
training set and verification set at a ratio of 0.6/0.2, and the Academy for Engineering and Technology Fudan University
last 20% of the B2 building load data in the UTD data set is (gyy2021-001), in part by CCF-Alibaba Innovative Research
used as the test set. Experimental results are compared to test Fund For Young Scholars.
results, which use B2 building load data to divide the training
and test sets. The experimental results are shown in Table 7. CONFLICT OF INTEREST
While the experimental results are not as good as the LSTM- The authors declare no conflict of interest.
CNN-based SAM that uses B2 directly as the test set, they still
exceed the performance of the other three benchmarks. As a DATA AVAILABILITY STATEMENT
result, the model has a greater capacity for generalization. The data that support the findings of this study are openly avail-
able in Australian Government Department of the Environ-
ment and Energy at https://data.gov.au/data/dataset/smart-
4 CONCLUSION grid-smart-city-customer-trial-data, The data that support the
findings of this study are openly available in IEEE Dataport at
An LSTM-CNN-based SAM model is proposed in this paper. https://doi.org/10.21227/jdw5-z996.
The model is suitable for load forecasting in different aggrega-
tion levels, including residential, building, and industrial levels, ORCID
and different time horizons, including hourly, half-hourly, and Shiyan Yi https://orcid.org/0000-0002-0744-358X
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YI ET AL. 1551

REFERENCES 21. Kong, X., Li, C., Zheng, F., Wang, C.: Improved deep belief network for
1. Zhang, G., Guo, J.: A novel ensemble method for hourly residential elec- short-term load forecasting considering demand-side management. IEEE
tricity consumption forecasting by imaging time series. Energy 203, 117858 Trans. Power Syst. 35(2), 1531–1538 (2019)
(2020) 22. Chen, K., Chen, K., Wang, Q., He, Z., Hu, J., He, J.: Short-term load
2. Wang, Y., Chen, Q., Hong, T., Kang, C.: Review of smart meter data analyt- forecasting with deep residual networks. IEEE Trans. Smart Grid 10(4),
ics: Applications, methodologies, and challenges. IEEE Trans. Smart Grid 3943–3952 (2018)
10(3), 3125–3148 (2018) 23. Kong, W., Dong, Z.Y., Jia, Y., Hill, D.J., Xu, Y., Zhang, Y.: Short-term res-
3. Lu, W., Sun, Y., Li, B., Chang, L.: Research and application of “source- idential load forecasting based on lstm recurrent neural network. IEEE
network-load-storage” coordination optimization technology based on Trans. Smart Grid 10(1), 841–851 (2017)
cloud platform. In: 2021 4th International Conference on Energy, Elec- 24. Yao, C., Yang, P., Liu, Z.: Load forecasting method based on cnn-gru
trical and Power Engineering (CEEPE), pp. 813–818. IEEE, Piscataway hybrid neural network. Power Syst. Technol. 44, 3416–3424 (2020)
(2021) 25. Hongchuan, C., Xu, C., Guoqi, S., Xiaobin, W., Yunfeng, C., Xuefeng,
4. Da Silva, P.G., Ilić, D., Karnouskos, S.: The impact of smart grid pro- S., Hui, S., Lingyan, Z.: Similar day short-term load forecasting based on
sumer grouping on forecasting accuracy and its benefits for local electricity intelligent optimization method. Power Syst. Protect. Control 49, 121–127
market trading. IEEE Trans. Smart Grid 5(1), 402–410 (2013) (2021)
5. Douglas, A.P., Breipohl, A.M., Lee, F.N., Adapa, R.: Risk due to load fore- 26. Zheng, H., Yuan, J., Chen, L.: Short-term load forecasting using emd-
cast uncertainty in short term power system planning. IEEE Trans. Power lstm neural networks with a xgboost algorithm for feature importance
Syst. 13(4), 1493–1499 (1998) evaluation. Energies 10(8), 1168 (2017)
6. Kim, T.-Y., Cho, S.-B.: Predicting residential energy consumption using 27. Park, R.-J., Song, K.-B., Kwon, B.-S.: Short-term load forecasting algorithm
cnn-lstm neural networks. Energy 182, 72–81 (2019) using a similar day selection method based on reinforcement learning.
7. Chen, Y., Luh, P.B., Guan, C., Zhao, Y., Michel, L.D., Coolbeth, M.A., Energies 13(10), 2640 (2020)
Friedland, P.B., Rourke, S.J.: Short-term load forecasting: Similar day-based 28. Kong, W., Dong, Z.Y., Hill, D.J., Luo, F., Xu, Y.: Short-term residential load
wavelet neural networks. IEEE Trans. Power Syst. 25(1), 322–330 (2009) forecasting based on resident behaviour learning. IEEE Trans. Power Syst.
8. Barman, M., Choudhury, N.D., Sutradhar, S.: A regional hybrid goa-svm 33(1), 1087–1088 (2017)
model based on similar day approach for short-term load forecasting in 29. Ji, Y., Buechler, E., Rajagopal, R.: Data-driven load modeling and forecast-
assam, india. Energy 145, 710–720 (2018) ing of residential appliances. IEEE Trans. Smart Grid 11(3), 2652–2661
9. Massaoudi, M., Refaat, S.S., Chihi, I., Trabelsi, M., Oueslati, F.S., Abu-Rub, (2019)
H.: A novel stacked generalization ensemble-based hybrid lgbm-xgb-mlp 30. Yahui, L., Qian, Z.: Ultra-short-term power load forecasting method
model for short-term load forecasting. Energy 214, 118874 (2021) based on cluster empirical mode decomposition of cnn-lstm. Power Syst.
10. Cai, L., Gu, J., Jin, Z.: Two-layer transfer-learning-based architecture for Technol. 44, 1–8 (2021)
short-term load forecasting. IEEE Trans. Ind. Inf. 16(3), 1722–1732 31. Deng, D., Li, J., Zhang, Z., Teng, Y., Huang, Q.: Short-term electric load
(2019) forecasting based on eemd-gru-mlr. Power Syst. Technol. 44(2), 593–602
11. Li, J., Deng, D., Zhao, J., Cai, D., Hu, W., Zhang, M., Huang, Q.: A novel (2020)
hybrid short-term load forecasting method of smart grid using mlr and 32. Feng, C., Sun, M., Zhang, J.: Reinforced deterministic and probabilis-
lstm neural network. IEEE Trans. Ind. Inf. 17(4), 2443–2452 (2020) tic load forecasting via q-learning dynamic model selection. IEEE Trans.
12. Vu, D.H., Muttaqi, K.M., Agalgaonkar, A.P., Bouzerdoum, A.: Short-term Smart Grid 11(2), 1377–1386 (2019)
electricity demand forecasting using autoregressive based time varying 33. Li, Y., Zhang, S., Hu, R., Lu, N.: A meta-learning based distribution system
model incorporating representative data adjustment. Appl. Energy 205, load forecasting model selection framework. Appl. Energy 294, 116991
790–801 (2017) (2021)
13. Fang, T., Lahdelma, R.: Evaluation of a multiple linear regression model 34. Somu, N., MR, G.R., Ramamritham, K.: A deep learning framework for
and sarima model in forecasting heat demand for district heating system. building energy consumption forecast. Renew. Sustain. Energy Rev. 137,
Appl. Energy 179, 544–552 (2016) 110591 (2021)
14. Mi, J., Fan, L., Duan, X., Qiu, Y.: Short-term power load forecasting 35. Eskandari, H., Imani, M., Moghaddam, M.P.: Convolutional and recurrent
method based on improved exponential smoothing grey model. Math. neural network based model for short-term load forecasting. Electr. Power
Prob. Eng. 2018, 3894723 (2018) Syst. Res. 195, 107173 (2021)
15. Wang, D., Luo, H., Grunder, O., Lin, Y., Guo, H.: Multi-step ahead 36. Tian, C., Ma, J., Zhang, C., Zhan, P.: A deep neural network model for
electricity price forecasting using a hybrid model based on two-layer short-term load forecast based on long short-term memory network and
decomposition technique and bp neural network optimized by firefly convolutional neural network. Energies 11(12), 3493 (2018)
algorithm. Appl. Energy 190, 390–407 (2017) 37. Senjyu, T., Takara, H., Uezato, K., Funabashi, T.: One-hour-ahead load
16. Chen, Y., Xu, P., Chu, Y., Li, W., Wu, Y., Ni, L., Bao, Y., Wang, K.: Short- forecasting using neural network. IEEE Trans. Power Syst. 17(1), 113–118
term electrical load forecasting using the support vector regression (svr) (2002)
model to calculate the demand response baseline for office buildings. Appl. 38. Ziel, F.: Modeling public holidays in load forecasting: a german case study.
Energy 195, 659–670 (2017) J. Mod. Power Syst. Clean Energy 6(2), 191–207 (2018)
17. Cai, M., Pipattanasomporn, M., Rahman, S.: Day-ahead building-level load 39. Amjady, N.: Short-term hourly load forecasting using time-series modeling
forecasts using deep learning vs. traditional time-series techniques. Appl. with peak load estimation capability. IEEE Trans. Power Syst. 16(3), 498–
Energy 236, 1078–1088 (2019) 505 (2001)
18. Bouktif, S., Fiaz, A., Ouni, A., Serhani, M.A.: Optimal deep learning lstm 40. Boudraa, A.-O., Cexus, J.-C.: Emd-based signal filtering. IEEE Trans.
model for electric load forecasting using feature selection and genetic Instrum. Meas. 56(6), 2196–2202 (2007)
algorithm: Comparison with machine learning approaches. Energies 11(7), 41. Yin, L., Xie, J.: Multi-temporal-spatial-scale temporal convolution net-
1636 (2018) work for short-term load forecasting of power systems. Appl. Energy 283,
19. Marino, D.L., Amarasinghe, K., Manic, M.: Building energy load fore- 116328 (2021)
casting using deep neural networks. In: IECON 2016-42nd Annual 42. Tan, M., Santos, C.d., Xiang, B., Zhou, B.: Lstm-based deep learning
Conference of the IEEE Industrial Electronics Society, pp. 7046–7051. models for non-factoid answer selection. arXiv preprint arXiv:1511.04108
IEEE, Piscataway (2016) (2015)
20. Shi, H., Xu, M., Li, R.: Deep learning for household load forecasting- novel 43. Wang, K., Qi, X., Liu, H.: Photovoltaic power forecasting based
pooling deep rnn. IEEE Trans. Smart Grid 9(5), 5271–5280 (2017) lstm-convolutional network. Energy 189, 116225 (2019)
17518695, 2023, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/gtd2.12763 by Cochrane Japan, Wiley Online Library on [09/12/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1552 YI ET AL.

44. Jiang, L., Wang, X., Li, W., Wang, L., Yin, X., Jia, L.: Hybrid multitask short-term load forecasting. Int. J. Electr. Power Energy Syst. 109, 470–479
multi-information fusion deep learning for household short-term load (2019)
forecasting. IEEE Trans. Smart Grid 12(6), 5362–5372 (2021) 55. Gao, C., Zhang, N., Li, Y., Bian, F., Wan, H.: Self-attention-based time-
45. Aslam, S., Herodotou, H., Mohsin, S.M., Javaid, N., Ashraf, N., Aslam, S.: variant neural networks for multi-step time series forecasting. Neural
A survey on deep learning methods for power load and renewable energy Comput. Appl. 34(11), 8737–8754 (2022)
forecasting in smart microgrids. Renew. Sustain. Energy Rev. 144, 110992 56. Wu, S., Wang, G., Tang, P., Chen, F., Shi, L.: Convolution with even-
(2021) sized kernels and symmetric padding. Advances in Neural Information
46. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based Processing Systems, vol. 32. MIT Press, Cambridge, MA (2019)
bidirectional long short-term memory networks for relation classifica- 57. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin,
tion. In: Proceedings of the 54th Annual Meeting of the Association for M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for
Computational Linguistics, vol. 2, pp. 207–212. ACM, New York (2016) large-scale machine learning. In: 12th USENIX symposium on operating
47. Lin, Z., Feng, M., Santos, C.N.d., Yu, M., Xiang, B., Zhou, B., systems design and implementation (OSDI 16), pp. 265–283. USENIX
Bengio, Y.: A structured self-attentive sentence embedding. arXiv preprint Association, Berkeley, CA (2016)
arXiv:1703.03130 (2017) 58. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv
48. Zang, H., Xu, R., Cheng, L., Ding, T., Liu, L., Wei, Z., Sun, G.: Residen- preprint arXiv:1412.6980 (2014)
tial load forecasting based on lstm fusing self-attention mechanism with 59. Zhang, J., Feng, C.: Short-term load forecasting data with hierarchical
pooling. Energy 229, 120682 (2021) advanced metering infrastructure and weather features (2019). Available:
49. Zhao, B., Wang, Z., Ji, W., Gao, X., Li, X.: A short-term power load fore- https://doi.org/10.21227/jdw5-z996
casting method based on attention mechanism of cnn-gru. Power Syst. 60. Ausgrid: Smart-grid smart-city customer trial data (2015). Avail-
Technol. 43(12), 4370–4376 (2019) able: https://data.gov.au/data/dataset/smart-grid-smart-city-customer-
50. Hyndman, R.J., Athanasopoulos, G.: Forecasting: Principles and Practice. trial-data
OTexts, Melbourne, Australia (2018)
51. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
put. 9(8), 1735–1780 (1997)
52. Pan, Q., Hu, W., Chen, N.: Two birds with one stone: Series saliency for
How to cite this article: Yi, S., Liu, H., Chen, T.,
accurate and interpretable multivariate time series forecasting. In: Interna-
tional Joint Conference on Artificial Intelligence (IJCAI). Springer, Cham Zhang, J., Fan, Y.: A deep LSTM-CNN based on
(2021) self-attention mechanism with input data reduction for
53. Zhang, Y., Wang, Y., Luo, G.: A new optimization algorithm for non- short-term load forecasting. IET Gener. Transm.
stationary time series prediction based on recurrent neural networks. Distrib. 17, 1538–1552 (2023).
Future Gener. Comput. Syst. 102, 738–745 (2020)
https://doi.org/10.1049/gtd2.12763
54. Wang, S., Wang, X., Wang, S., Wang, D.: Bi-directional long short-term
memory method based on attention mechanism and rolling update for

You might also like