A Transformer That Tends To Mine Metaphorical-Level Information
A Transformer That Tends To Mine Metaphorical-Level Information
Article
Metaformer: A Transformer That Tends to Mine
Metaphorical-Level Information
Bo Peng 1,2 , Yuanming Ding 1,2, * and Wei Kang 1,2
Abstract: Since introducing the Transformer model, it has dramatically influenced various fields of
machine learning. The field of time series prediction has also been significantly impacted, where
Transformer family models have flourished, and many variants have been differentiated. These
Transformer models mainly use attention mechanisms to implement feature extraction and multi-
head attention mechanisms to enhance the strength of feature extraction. However, multi-head
attention is essentially a simple superposition of the same attention, so they do not guarantee that
the model can capture different features. Conversely, multi-head attention mechanisms may lead
to much information redundancy and computational resource waste. In order to ensure that the
Transformer can capture information from multiple perspectives and increase the diversity of its
captured features, this paper proposes a hierarchical attention mechanism, for the first time, to
improve the shortcomings of insufficient information diversity captured by the traditional multi-head
attention mechanisms and the lack of information interaction among the heads. Additionally, global
feature aggregation using graph networks is used to mitigate inductive bias. Finally, we conducted
experiments on four benchmark datasets, and the experimental results show that the proposed model
can outperform the baseline model in several metrics.
Keywords: hierarchical attention; multi-head attention; graph neural networks; feature diversity;
information interaction
Sequential data are more suitable for processing using Transformer than computer
vision. In the traditional field of time series prediction, most of them rely on Recurrent
Neural Network (RNN) [23,24] models, among which the more influential ones include
Gated Recurrent Unit (GRU) [25] and Long Short-term Memory (LSTM) [26,27] networks.
For example, Mou et al. [28] proposed a Time-Aware LSTM (T-LSTM) with temporal
information enhancement, whose main idea is to divide memory states into short-term
memory and long-term memory, adjust the influence of short-term memory according to
the time interval between inputs (the longer the time interval, the smaller the influence of
short-term memory), and then reorganise the adjusted short-term memory and long-term
memory into a new memory state. However, the emergence of Transformer soon shook
the dominance of RNN family models in the field of time series prediction because of the
following bottlenecks of RNNs in dealing with long-time prediction problems.
(1) Parallelism bottleneck: The RNN family of models requires the input data to be ar-
ranged in temporal order and computed sequentially according to the order of arrangement.
This serial structure has the advantage that it inherently contains the portrayal of positional
relationships, but it also constrains the model from being computed in parallel. Especially
when facing long sequences, the inability to parallelise means more time and cost.
(2) Gradient bottleneck [29]: One performance bottleneck of RNN networks is the
frequent problem of gradient disappearance or gradient explosion during training. Most
neural network models optimise model parameters by computing gradients. Gradient
disappearance or gradient explosion can cause the model to fail to converge or converge
too slowly, which means that for the RNN family of networks, it is difficult to make the
model better by increasing the number of iterations or increasing the size of the network.
(3) Memory bottleneck: For each moment, the RNN network requires a positional
input xt and a hidden input ht−1 , which will be fused within the model according to the
inherent rules to produce a hidden state ht . Therefore, when the sequence length is too
long, the ht almost no longer contains the earlier positional input; that is, the “forgetting”
phenomenon occurs.
Compared with the RNN family of models, Transformer portrays the positional rela-
tionships between sequences by positional encoding without recursively feeding sequential
data. This processing makes the model more flexible and provides the maximum possible
parallelisation for time series data. The positional encoding also ensures that no forgetting
occurs. The information at each location has an equal status for the Transformer. Addition-
ally, using an attention mechanism to extract internal features allows the model to choose
to focus on important information. The problem of gradient disappearance or gradient
explosion can be avoided by ignoring irrelevant and redundant information. Therefore,
based on the above advantages of Transformer models, many scholars are now trying to
use Transformer models for time series tasks.
2. Research Background
Transformer is a typical encoder-decoder-based sequence-to-sequence [30] model, and
this structure is well suited for processing sequence data. Several researchers have tried
to improve the Transformer model to meet the needs of more complex applications. For
example, Kitaev et al. [31] proposed a Reformer model that uses Locality Sensitive Hashing
Attention (LSH) to reduce the complexity of the original model from O( L2 ) to O( L log( L)).
Zhou et al. [32] proposed an Informer model for Long Sequence Time Series Forecasting
(LSTF), which accurately captures the long-term dependence between output and input
and exhibits high predictive power. Wu et al. [33] proposed the Autoformer model, which
uses a deep decomposition architecture and an autocorrelation mechanism to improve
LSTF accuracy. The Autoformer model achieves desirable results even when the series is
predicted much longer than the length of the input series, i.e., it can predict the longer-term
future based on limited information. Zhou et al. [34] proposed the FEDformer model,
which provides a way to apply the attention mechanism in the frequency domain and can
be used as an essential complement to the time domain analysis.
Sensors 2023, 23, 5093 3 of 16
The Transformer model described above focuses on reducing its temporal and spatial
complexity, but needs to enhance the diversity of the information it captures. The attention
mechanism is the core part of the Transformer used for feature extraction. It is designed to
allow the model to focus on more important information, which means there is a certain
amount of information loss. The multi-head attention mechanism can compensate for this.
However, since each attention head captures similarly, there is no way to ensure that each at-
tention head is capturing different vital features. Since the multi-head attention mechanism
essentially divides multiple mutually independent subspaces, this approach completely
cuts off the connection between each subspace, which leads to a lack of interaction between
the information captured by multiple heads. Based on these problems, this paper proposes
a hierarchical attention mechanism that features each layer using a different attention
mechanism to capture features. The higher layers will use the information captured by the
lower layers, thus enhancing the Transformer’s ability to perceive deeper information.
3. Research Methodology
3.1. Problem Description
Initially, the Transformer model was proposed by Waswani et al. to solve the machine
translation problem, so Vanilla Transformer is more suitable for processing textual data.
For example, the primary processing unit of the Vanilla Transformer model is a word vector,
and each word vector is called a token. In contrast, in the time series prediction problem,
our basic processing unit becomes a timestamp. If we want to apply Transformer to a time
series problem, the reasonable idea is to encode the multivariate sequence information of
each timestamp into a token vector. This modelling approach is also the treatment of many
mainstream Transformer-like models.
Here, for the convenience of the subsequent description, we define the dimension of
the token as d, the input length of the model as I, and the output length as O. Further,
the model’s input can be defined as X = {x1 , · · · , x I } ∈ R I ×d , and the model’s output as
X̂ = {x̂1 , · · · , x̂O } ∈ RO×d . Therefore, this paper aims to learn a mapping T (·) from the
input space to the output space.
X̂ = T (X ) (1)
3.2.1. Decomposer
The main difficulty of time series forecasting lies in discovering the hidden trend-
cyclical and seasonal parts information from the historical series. The trend-cyclical records
the overall trend of the series, which has an essential influence on the long-term trend-
cyclical of the series. The seasonal parts record the hidden cyclical pattern of the series,
which mainly shows the regular fluctuation of the series in the short term. It is generally
difficult to predict these two pieces of information simultaneously. The basic idea is
to decompose the two, extracting the trend-cyclical from the sequence using average
pooling and filtering the seasonal period using the trend-cyclical, which is how Decomposer
implements the decomposed information, as shown in Algorithm 1.
Algorithm 1 Decomposer
Require: X
Ensure: S , T
1: T ← avgpool(padding(X ))
2: S ← X − T
3.2.2. Encoder
The encoder is mainly responsible for encoding the input data and realizing the
transformation from the input space to the feature space. The decomposer in the encoder
is more like a filter because, in the encoder, we focus more on the seasonal parts of the
sequence and ignore the trend-cyclical. The input data are passed through a hierarchical
attention layer for initial key feature extraction. After which, the decomposer extracts the
seasonal part’s features in the sequence and they are further fed into the graph network to
mitigate inductive bias. After stacking N layers, The seasonal parts features thus obtained
will be auxiliary inputs to the decoder. Algorithm 2 describes the computation procedure.
Sensors 2023, 23, 5093 5 of 16
Algorithm 2 Encoder
Require: Xen
Ensure: Xen N
1: for l = 1, · · · , N do
2: if l = 0 then
3: Xen l −1 ← X
en
4: end if
l,1 l −1 ) + X l −1
5: Sen , _ ← D H (Xen en
l,2 l,1 l,1
6: Sen , _ ← D G (Sen ) + Sen
7: l ← S l,2
Xen en
8: end for
Here, Xen ∈ R I ×d denotes the historical observation sequence. N denotes the num-
ber of stacked layers of the encoder. XenN denotes the output of the N-th layer encoder.
D denotes the decomposer operator. G denotes the graph network operator and H de-
notes the hierarchical attention mechanism, the concrete implementation of which will be
described later.
3.2.3. Decoder
The structure of the decoder is more complex than that of the encoder. However, its
internal modules are identical to the encoder’s, but use a multi-input structure. It goes
through two hierarchical attention calculations and three sequence decompositions in turn.
Assuming that the model’s encoder is a feature catcher, the decoder is a feature fuser
that fuses and corrects the inputs from different sources to obtain the correct prediction
sequence. The decoder has three primary input sources: the seasonal parts Xdes and the
N captured
trend-cyclical Xdet extracted from the original series, and the seasonal parts Xen
by the decoder. The computation of the trend-cyclical and seasonal parts is kept relatively
independent throughout the computation process. Only at the final output is a linear
layer used to fuse the two to obtain the final prediction Xpred . The computation process is
described in Algorithm 3.
Algorithm 3 Decoder
Require: Xen , XenN
Ensure: Xpred
1: Xens , Xent ← D (Xen I :I )
2
2: Xdes ← Xens k 00: I
2
3: Xdet ← Xent k X 0: I
2
4: for l = 1, · · · , M do
5: if l = 1 then
l −1
6: Xde ← Xdes
l −1
7: Tde ← Xdet
8: end if
l,1 l,1 l −1 l −1
9: Sen , Tde ← D H (Xde ) + Xde
l,2 l,2 l,1 N ) + S l,1
10: Sde , Tde ← D H (Sde , Xen de
l,3 l,3 l,2 l,2
11: Sde , Tde ← D G (Sde ) + Sde
12: l ← S l,3
Xde de
13: l ← T l −1 + W ∗ T l,1 + W ∗ T l,2 + W ∗ T l,3
Tde de l,1 de l,2 de l,3 de
14: end for
15: Xpred ← WS ∗ Xde M+T M
de
Sensors 2023, 23, 5093 6 of 16
Here, Xen denotes the original sequence, which is also the input to the encoder. It is
decomposed into trend-cyclical and season parts Xens , Xent before feeding into the decoder
as the initial input.
Equation (2) calculates the multi-headed attention mechanism, where Lθq , Lθk , Lθv , Lθo
denotes the linear layer with projection parameter matrix W Q ∈ Rdm ×dk , W K ∈ Rdm ×dk ,
W V ∈ Rdm ×dv , W O ∈ Rhdv ×dm , respectively. h denotes the number of heads of attention.
A denotes scaled dot-product attention. ä denotes sequential cascade.
Here, Lθq , Lθk , Lθv , Lθo has the same meaning as in Equation (2). R denotes the GRU
unit. Y records the information of each layer and finally maps it to the specified dimension
as the model’s output by a linear layer. Ai denotes different attention calculation methods.
This paper mainly uses four common attention mechanisms: Vanilla Attention, ProbSparse
Attention, LSH Attention, and AutoCorrelation. AutoCorrelation is not, strictly speaking,
part of the attention mechanism family. However, its effect is similar to or even better than
attention mechanisms, so it is introduced into our model and involved in feature extraction.
Attention is the core building block of Transformer and is considered an essential tool
for information capture in both CV and NLP domains. Many researchers have worked on
designing more efficient attention, so many variants based on Vanilla Attention have been
proposed in succession. The following briefly describes the four attention mechanisms
used in our model.
QK>
A (Q, K, V) = σ† ( √ )V (3)
dk
LK qi k >
√l 1 LK qi k >
j
KL(q k p) = ln ∑ e d − ∑ √d − ln LK (5)
l =1
LK j =1
Sensors 2023, 23, 5093 8 of 16
a ( qi , k j )
A (qi , K, V) = ∑ ∑l ∈Pi a(qi , kl )
vj (6)
j∈Pi
3.3.6. AutoCorrelation
AutoCorrelation mechanisms are different from the types of attention mechanisms
above. Whereas the self-attentive family focuses on the correlation between points, the
AutoCorrelation mechanism focuses on the correlation between segments. Therefore,
AutoCorrelation mechanisms are an excellent complement to self-attentive mechanisms.
L
1
RQ,K (τ ) =
L ∑ Qt Kt−τ (9)
t =1
heuristic or “value” for selecting hypotheses in ample hypothesis space. For example,
convolutional networks assume that information is spatially local, spatially invariant, and
translational equivalent, so that the parameter space can be reduced by sliding convolu-
tional weight sharing; recurrent neural networks assume that information is sequential and
invariant to temporal transformations, so that weight sharing is also possible. Similarly,
the attention mechanism also has some assumptions, such as the uselessness of some
information. If the attention mechanism is stacked, some critical information will be lost, so
adding a layer of FNN can somehow alleviate the accumulation of inductive bias and avoid
network collapse. Of course, not only does the FFN layer have a mitigating effect, but we
find that a similar effect can be achieved using a Graph Neural Network (GNN) [38–40].
Here, we use a two-layer GAT [41,42] network instead of the original FFN layer. The
graph network has the property of aggregating the information of neighbouring nodes,
i.e., through the aggregation of the graph network, each node will fuse some features of
its neighbouring nodes. Additionally, we use random sampling to reduce the complexity.
The reason is that our goal is not feature aggregation, but to mitigate the loss of crucial
information. In particular, when the number of samples per node is 0, the graph network
can be considered to ultimately degenerate into an FFN layer with a similar role to the
original FFN.
Here, we model each token as a node in the graph and mine the dependencies
between nodes using the graph attention algorithm. The input to GAT is defined as
H = {~h1 ,~h2 , · · · ,~h N }. Here, ~hi ∈ RF denotes the input vector of the i-th node, N denotes
the number of nodes in the graph, and F denotes the dimensionality of the input vector.
Through the computation of the GAT network, this layer generates a new set of node
0
features H 0 = {~h10 ,~h20 , · · · ,~h0N }. Similarly, here ~hi0 ∈ RF denotes the output vector of the
i-th node, and F 0 denotes the dimensionality of the output vector.
Figure 3 gives the general flow of information aggregation for a single node.
Equation (11) is a concrete implementation of calculating the attention coefficient eij for
the i-th node and its neighbour node j one by one. Equation (12) is used to calculate the
normalised attention factor αij :
exp(eij )
αij = σ† (eij ) = (12)
∑k∈Ni exp(eik )
Here, Ni denotes the set of all neighbouring nodes of the i-th node, and W is a shared
parameter for linear mapping of node features. F is a single-layer feedforward neural
network for mapping the spliced high-dimensional features into a real number eij . eij is the
attention coefficient of node j → i, and αij is its normalised value.
Finally, the new feature vector ~hi0 of the current node i is obtained by weighting
and summing the feature vectors of each neighbouring node according to the calculated
attention coefficients, where ~hi0 records the neighbourhood information of the current node.
!
~h = σ ∑ αij W~h j
0
(13)
i
j∈Ni
Here, σ represents applying a non-linear activation function logistic sigmoid at the end.
Furthermore, if information aggregation is accomplished through the K head attention
mechanism, the final output vector can be obtained by taking the average.
!
K
1
K k∑ ∑ αij W ~h j
0
~h = σ k k
i (14)
=1 j∈N i
Sensors 2023, 23, 5093 10 of 16
4. Experiment
4.1. Dataset Description
To evaluate the Metaformer model, we conducted experiments on four popular real-
world datasets encompassing energy, economy, disease, and transportation domains. The
Electricity (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014,
accessed on 24 February 2023) dataset describes the hourly electricity consumption of 321
customers; the Exchange [43] dataset describes the daily exchange rates of eight countries;
the Illness (https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html, accessed on 24
February 2023) dataset is the weekly data of influenza-like illnesses recorded by the Centers
for Disease Control; and the Traffic (http://pems.dot.ca.gov/, accessed on 24 February
2023) dataset describes the occupancy rate of roads in the San Francisco Bay area.
Table 1 shows detailed dataset statistics, where #Sample is the total number of samples,
#Features is the number of features acquired per sampling, Period is the sampling period,
and Span is the sampling time span.
Since the scale of each element in the dataset is not uniform, we need to normalise the
data before formal training for the model to treat different features equally during training.
Equations (15) and (16) are the normalisation and denormalisation calculation methods,
respectively, where X denotes the original sampled dataset and X ∗ denotes the normalised
dataset. Figure 4 shows the variation of the four normalised features randomly selected
from the four data sets.
X − E( x )
X∗ = p (15)
D(X )
q
X = X ∗ · D ( X ) + E( X ) (16)
Sensors 2023, 23, 5093 11 of 16
(a) (b)
(c) (d)
Figure 4. Variation of the four randomly selected normalised features in the dataset. (a) Electricity.
(b) Exchange. (c) Illness. (d) Traffic.
decoder layers is set to M = 1. We use MSE as the loss function and Adam as the optimiser
with a learning rate of 0.0001. We train the model for 20 iterations, but employ an early
termination strategy with a tolerance of 3.
Table 2. Multivariate long-term series prediction results for four datasets with an input length of
I = 96 and prediction length of O ∈ {96, 192, 336, 720}. Lower MSE and MAE values indicate better
results, and the best results are highlighted in bold.
(a) (b)
Figure 5. Cont.
Sensors 2023, 23, 5093 13 of 16
(c) (d)
(e) (f)
(g) (h)
Figure 5. The MSE loss plots for the Metaformer model on the training and test sets for the four
datasets. (a) Electricity (train), (b) Electricity (vali), (c) Exchange (train), (d) Exchange (vali), (e) Traffic
(train), (f) Traffic (vali), (g) ILI (train), (h) ILI (vali).
Table 3. Three variants of Metaformer. 3 and 7 indicate that the specified structure was or was not
used, respectively.
As shown in Table 4, the Meta-v1 variant of the model, which uses only the self-
loop graph, generally outperforms the other variants across multiple measures. This
phenomenon may be because the self-loop edges are self-weighted, which is more effective
in reducing the inductive bias of the attention mechanism in the Metaformer model by
reinforcing the features of specific nodes. Conversely, adding a fully connected mech-
anism may further exacerbate the information perturbation. However, due to limited
experimental resources, we cannot conduct a more in-depth study. In future work, we
will further investigate how random sampling of neighbouring nodes, including more
attention mechanisms, and the stacking order of these attention mechanisms affect the
model’s performance.
5. Conclusions
This paper presents a redesigned sequence-to-sequence model based on the Trans-
former architecture. We draw inspiration from the sequence decomposition model of Auto-
former and introduce a similar approach to separate trend and seasonal items. Additionally,
we propose a hierarchical attention mechanism to address the problem of incomplete and in-
sufficient information mining by multiple attention mechanisms in the Vanilla Transformer
model. Our hierarchical attention mechanism employs different attention mechanisms
simultaneously to ensure diversity in information mining. The hierarchical structure recur-
sively passes information captured by lower-level attention upward, enabling interaction
between multiple attention mechanisms and deepening the network’s understanding of
more profound information. This mechanism is beneficial in capturing the metaphorical
information present in both text and images. We also add a graph attention network to the
model, allowing it to stand in a high-dimensional perspective to aggregate and mitigate the
inductive bias of the information. Our experimental results demonstrate that our proposed
model outperforms the baseline model across multiple datasets and significantly improves
all evaluation metrics.
Author Contributions: Conceptualization, B.P. and Y.D.; data curation, B.P. and W.K.; formal analysis,
B.P.; funding acquisition, Y.D.; investigation, B.P.; methodology, B.P.; project administration, Y.D.;
writing—review and editing, B.P., Y.D. and W.K. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by the National Science Foundation of China and General
Project Fund in the Field of Equipment Development Department, grant number No. 61901079,
No. 61403110308. The APC was funded by Dalian University.
Institutional Review Board Statement: Not applicable.
Sensors 2023, 23, 5093 15 of 16
References
1. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473.
2. Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning deep transformer models for machine translation. arXiv
2019, arXiv:1906.01787.
3. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
2017. Available online: http://arxiv.org/abs/1706.03762 (accessed on 23 January 2021).
4. Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022. [CrossRef]
5. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg,
Germany, 2020; pp. 213–229.
6. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
7. Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the
International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064.
8. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
9. Chen, X.; Wu, Y.; Wang, Z.; Liu, S.; Li, J. Developing real-time streaming transformer transducer for speech recognition on
large-scale dataset. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Toronto, ON, Canada, 6–11 July 2021; pp. 5904–5908.
10. Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings
of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20
April 2018; pp. 5884–5888.
11. Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-
augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100.
12. Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck,
D. Music transformer. arXiv 2018, arXiv:1809.04281.
13. Huang, Y.S.; Yang, Y.H. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In
Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1180–1188.
14. Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, C.A.; Bekas, C.; Lee, A.A. Molecular transformer: A model for uncertainty-
calibrated chemical reaction prediction. ACS Cent. Sci. 2019, 5, 1572–1583. [CrossRef]
15. Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function
emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118.
[CrossRef] [PubMed]
16. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 July 2016; pp. 770–778.
17. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 1, 91–99. [CrossRef]
18. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten
zip code recognition. Neural Comput. 1989, 1, 541–551. [CrossRef]
19. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. Adv.
Neural Inf. Process. Syst. 2019, 32, 68–80.
20. Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the
International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1691–1703.
21. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
22. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation
from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890.
23. Van Der Westhuizen, J.; Lasenby, J. The unreasonable effectiveness of the forget gate. arXiv 2018, arXiv:1804.04849.
24. Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer:
Berlin/Heidelberg, Germany, 2012; pp. 37–45.
25. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
26. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
Sensors 2023, 23, 5093 16 of 16
27. Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955.
[CrossRef]
28. Mou, L.; Zhao, P.; Xie, H.; Chen, Y. T-LSTM: A long short-term memory neural network enhanced by temporal information for
traffic flow prediction. IEEE Access 2019, 7, 98053–98060. [CrossRef]
29. Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International
Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318.
30. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27,
3104–3112.
31. Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451.
32. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence
time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35,
pp. 11106–11115.
33. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.
Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430.
34. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency enhanced decomposed transformer for long-term
series forecasting. arXiv 2022, arXiv:2201.12740.
35. Chollet, F. On the measure of intelligence. arXiv 2019, arXiv:1911.01547.
36. Baxter, J. A model of inductive bias learning. J. Artif. Intell. Res. 2000, 12, 149–198. [CrossRef]
37. Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [CrossRef]
38. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907.
39. Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493.
40. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30,
1025–1035.
41. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903.
42. Chen, H.; Hong, P.; Han, W.; Majumder, N.; Poria, S. Dialogue relation extraction with document-level heterogeneous graph
attention networks. Cogn. Comput. 2023, 15, 793–802. [CrossRef]
43. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. In
Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor,
MI, USA, 8–12 July 2018; pp. 95–104.
44. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of
transformer on time series forecasting. Adv. Neural Inf. Process. Syst. 2019, 32, 5243–5253.
45. Hao, H.; Wang, Y.; Xia, Y.; Zhao, J.; Shen, F. Temporal convolutional attention-based network for sequence modeling. arXiv 2020,
arXiv:2002.12530.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.