0% found this document useful (0 votes)
11 views

A Transformer That Tends To Mine Metaphorical-Level Information

Uploaded by

Mukenze junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

A Transformer That Tends To Mine Metaphorical-Level Information

Uploaded by

Mukenze junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

sensors

Article
Metaformer: A Transformer That Tends to Mine
Metaphorical-Level Information
Bo Peng 1,2 , Yuanming Ding 1,2, * and Wei Kang 1,2

1 Communication and Network Laboratory, Dalian University, Dalian 116622, China


2 School of Information Engineering, Dalian University, Dalian 116622, China
* Correspondence: [email protected]

Abstract: Since introducing the Transformer model, it has dramatically influenced various fields of
machine learning. The field of time series prediction has also been significantly impacted, where
Transformer family models have flourished, and many variants have been differentiated. These
Transformer models mainly use attention mechanisms to implement feature extraction and multi-
head attention mechanisms to enhance the strength of feature extraction. However, multi-head
attention is essentially a simple superposition of the same attention, so they do not guarantee that
the model can capture different features. Conversely, multi-head attention mechanisms may lead
to much information redundancy and computational resource waste. In order to ensure that the
Transformer can capture information from multiple perspectives and increase the diversity of its
captured features, this paper proposes a hierarchical attention mechanism, for the first time, to
improve the shortcomings of insufficient information diversity captured by the traditional multi-head
attention mechanisms and the lack of information interaction among the heads. Additionally, global
feature aggregation using graph networks is used to mitigate inductive bias. Finally, we conducted
experiments on four benchmark datasets, and the experimental results show that the proposed model
can outperform the baseline model in several metrics.

Keywords: hierarchical attention; multi-head attention; graph neural networks; feature diversity;
information interaction

Citation: Peng, B.; Ding, Y.; Kang, W.


Metaformer: A Transformer That 1. Introduction
Tends to Mine Metaphorical-Level Originally designed to solve machine translation problems [1,2], the Transformer [3,4]
Information. Sensors 2023, 23, 5093. model has been widely introduced into computer vision (CV) [5–7], natural language
https://doi.org/10.3390/s23115093 processing (NLP) [8], speech processing [9–11], audio processing [12,13], chemistry [14],
Academic Editors: Zahir M. Hussain and life sciences [15] due to its powerful modelling capabilities and applicability. It has
and Shih-Chia Huang contributed significantly to the development of these fields.
In computer vision, Convolutional Neural Networks (CNNs) [16–18] are traditionally
Received: 25 February 2023 used as the primary means of processing. Convolution is well suited for processing regular,
Revised: 22 May 2023
high-dimensional data and allows for automatic feature extraction. However, convolution
Accepted: 23 May 2023
suffers from obvious localisation constraints. The conditional assumption is that points in
Published: 26 May 2023
the space are only associated with their neighbouring grids, whereas distant grids are not
associated with each other. Although this limitation can be alleviated to some extent by
expanding the convolution kernel, it still cannot solve the problem fundamentally. After
Copyright: © 2023 by the authors.
introducing the Transformer, some researchers have tried to introduce the Transformer
Licensee MDPI, Basel, Switzerland. model architecture into the field of computer vision. Transformer has a larger field of
This article is an open access article perception than CNN, so it captures rich global information and can better understand
distributed under the terms and the whole image. Ramachandran et al. [19] constructed a vision model without using
conditions of the Creative Commons convolution, which uses a full-attention mechanism instead of convolution to improve
Attribution (CC BY) license (https:// the localisation constraint in convolution. In addition, Transformer has shown excellent
creativecommons.org/licenses/by/ performance in other CV areas such as image classification [6,20], object detection [5,21],
4.0/). semantic segmentation [22], image processing [22], and video understanding [5].

Sensors 2023, 23, 5093. https://doi.org/10.3390/s23115093 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 5093 2 of 16

Sequential data are more suitable for processing using Transformer than computer
vision. In the traditional field of time series prediction, most of them rely on Recurrent
Neural Network (RNN) [23,24] models, among which the more influential ones include
Gated Recurrent Unit (GRU) [25] and Long Short-term Memory (LSTM) [26,27] networks.
For example, Mou et al. [28] proposed a Time-Aware LSTM (T-LSTM) with temporal
information enhancement, whose main idea is to divide memory states into short-term
memory and long-term memory, adjust the influence of short-term memory according to
the time interval between inputs (the longer the time interval, the smaller the influence of
short-term memory), and then reorganise the adjusted short-term memory and long-term
memory into a new memory state. However, the emergence of Transformer soon shook
the dominance of RNN family models in the field of time series prediction because of the
following bottlenecks of RNNs in dealing with long-time prediction problems.
(1) Parallelism bottleneck: The RNN family of models requires the input data to be ar-
ranged in temporal order and computed sequentially according to the order of arrangement.
This serial structure has the advantage that it inherently contains the portrayal of positional
relationships, but it also constrains the model from being computed in parallel. Especially
when facing long sequences, the inability to parallelise means more time and cost.
(2) Gradient bottleneck [29]: One performance bottleneck of RNN networks is the
frequent problem of gradient disappearance or gradient explosion during training. Most
neural network models optimise model parameters by computing gradients. Gradient
disappearance or gradient explosion can cause the model to fail to converge or converge
too slowly, which means that for the RNN family of networks, it is difficult to make the
model better by increasing the number of iterations or increasing the size of the network.
(3) Memory bottleneck: For each moment, the RNN network requires a positional
input xt and a hidden input ht−1 , which will be fused within the model according to the
inherent rules to produce a hidden state ht . Therefore, when the sequence length is too
long, the ht almost no longer contains the earlier positional input; that is, the “forgetting”
phenomenon occurs.
Compared with the RNN family of models, Transformer portrays the positional rela-
tionships between sequences by positional encoding without recursively feeding sequential
data. This processing makes the model more flexible and provides the maximum possible
parallelisation for time series data. The positional encoding also ensures that no forgetting
occurs. The information at each location has an equal status for the Transformer. Addition-
ally, using an attention mechanism to extract internal features allows the model to choose
to focus on important information. The problem of gradient disappearance or gradient
explosion can be avoided by ignoring irrelevant and redundant information. Therefore,
based on the above advantages of Transformer models, many scholars are now trying to
use Transformer models for time series tasks.

2. Research Background
Transformer is a typical encoder-decoder-based sequence-to-sequence [30] model, and
this structure is well suited for processing sequence data. Several researchers have tried
to improve the Transformer model to meet the needs of more complex applications. For
example, Kitaev et al. [31] proposed a Reformer model that uses Locality Sensitive Hashing
Attention (LSH) to reduce the complexity of the original model from O( L2 ) to O( L log( L)).
Zhou et al. [32] proposed an Informer model for Long Sequence Time Series Forecasting
(LSTF), which accurately captures the long-term dependence between output and input
and exhibits high predictive power. Wu et al. [33] proposed the Autoformer model, which
uses a deep decomposition architecture and an autocorrelation mechanism to improve
LSTF accuracy. The Autoformer model achieves desirable results even when the series is
predicted much longer than the length of the input series, i.e., it can predict the longer-term
future based on limited information. Zhou et al. [34] proposed the FEDformer model,
which provides a way to apply the attention mechanism in the frequency domain and can
be used as an essential complement to the time domain analysis.
Sensors 2023, 23, 5093 3 of 16

The Transformer model described above focuses on reducing its temporal and spatial
complexity, but needs to enhance the diversity of the information it captures. The attention
mechanism is the core part of the Transformer used for feature extraction. It is designed to
allow the model to focus on more important information, which means there is a certain
amount of information loss. The multi-head attention mechanism can compensate for this.
However, since each attention head captures similarly, there is no way to ensure that each at-
tention head is capturing different vital features. Since the multi-head attention mechanism
essentially divides multiple mutually independent subspaces, this approach completely
cuts off the connection between each subspace, which leads to a lack of interaction between
the information captured by multiple heads. Based on these problems, this paper proposes
a hierarchical attention mechanism that features each layer using a different attention
mechanism to capture features. The higher layers will use the information captured by the
lower layers, thus enhancing the Transformer’s ability to perceive deeper information.

3. Research Methodology
3.1. Problem Description
Initially, the Transformer model was proposed by Waswani et al. to solve the machine
translation problem, so Vanilla Transformer is more suitable for processing textual data.
For example, the primary processing unit of the Vanilla Transformer model is a word vector,
and each word vector is called a token. In contrast, in the time series prediction problem,
our basic processing unit becomes a timestamp. If we want to apply Transformer to a time
series problem, the reasonable idea is to encode the multivariate sequence information of
each timestamp into a token vector. This modelling approach is also the treatment of many
mainstream Transformer-like models.
Here, for the convenience of the subsequent description, we define the dimension of
the token as d, the input length of the model as I, and the output length as O. Further,
the model’s input can be defined as X = {x1 , · · · , x I } ∈ R I ×d , and the model’s output as
X̂ = {x̂1 , · · · , x̂O } ∈ RO×d . Therefore, this paper aims to learn a mapping T (·) from the
input space to the output space.
X̂ = T (X ) (1)

3.2. Model Architecture


Our model (Figure 1) continues the Transformer architecture in the main body, and we
also added a decomposer to the model by referring to Autoformer’s sequence decomposi-
tion model. The function of the decomposer is to filter trend-cyclical and seasonal parts.
The advantage is that removing trend parts from the series allows the model to focus better
on the hidden periodic information of the series, and Wu et al. [33] have shown that this
decomposition is effective. In addition, the model uses a coder–decoder structure, where
the encoder is responsible for mapping the information from the input space to the feature
space, and the decoder is responsible for mapping the information from the feature space to
the target space. The model is a typical sequence-to-sequence model, since both the input
and output of the model are sequence-type data. In addition, we try to use a hierarchical
attention mechanism instead of the original multi-head attention mechanism and a graph
network instead of the original feedforward neural network inside the codec, which can
improve the diversity of captured information and the mitigate token-uniformity inductive
bias [35,36] of the model, respectively.
Sensors 2023, 23, 5093 4 of 16

Figure 1. Model Body Structure.

3.2.1. Decomposer
The main difficulty of time series forecasting lies in discovering the hidden trend-
cyclical and seasonal parts information from the historical series. The trend-cyclical records
the overall trend of the series, which has an essential influence on the long-term trend-
cyclical of the series. The seasonal parts record the hidden cyclical pattern of the series,
which mainly shows the regular fluctuation of the series in the short term. It is generally
difficult to predict these two pieces of information simultaneously. The basic idea is
to decompose the two, extracting the trend-cyclical from the sequence using average
pooling and filtering the seasonal period using the trend-cyclical, which is how Decomposer
implements the decomposed information, as shown in Algorithm 1.

Algorithm 1 Decomposer
Require: X
Ensure: S , T
1: T ← avgpool(padding(X ))
2: S ← X − T

Here, X ∈ R L×d is the input sequence of length L. T , S ∈ R L×d is the decomposed


trend-cyclical and seasonal parts where the role of padding is to ensure that the decomposed
series remains equal in dimension to the input sequence.
The decomposer module has a relatively simple structure. However, it can decompose
the forecasting task into two subtasks, i.e., mining hidden periodic patterns and forecasting
overall trends. This decomposition can reduce the difficulty of prediction to a certain extent
and, thus, improve the final prediction results.

3.2.2. Encoder
The encoder is mainly responsible for encoding the input data and realizing the
transformation from the input space to the feature space. The decomposer in the encoder
is more like a filter because, in the encoder, we focus more on the seasonal parts of the
sequence and ignore the trend-cyclical. The input data are passed through a hierarchical
attention layer for initial key feature extraction. After which, the decomposer extracts the
seasonal part’s features in the sequence and they are further fed into the graph network to
mitigate inductive bias. After stacking N layers, The seasonal parts features thus obtained
will be auxiliary inputs to the decoder. Algorithm 2 describes the computation procedure.
Sensors 2023, 23, 5093 5 of 16

Algorithm 2 Encoder
Require: Xen
Ensure: Xen N

1: for l = 1, · · · , N do
2: if l = 0 then
3: Xen l −1 ← X
en
4: end if  
l,1 l −1 ) + X l −1
5: Sen , _ ← D H (Xen en
 
l,2 l,1 l,1
6: Sen , _ ← D G (Sen ) + Sen
7: l ← S l,2
Xen en
8: end for

Here, Xen ∈ R I ×d denotes the historical observation sequence. N denotes the num-
ber of stacked layers of the encoder. XenN denotes the output of the N-th layer encoder.

D denotes the decomposer operator. G denotes the graph network operator and H de-
notes the hierarchical attention mechanism, the concrete implementation of which will be
described later.

3.2.3. Decoder
The structure of the decoder is more complex than that of the encoder. However, its
internal modules are identical to the encoder’s, but use a multi-input structure. It goes
through two hierarchical attention calculations and three sequence decompositions in turn.
Assuming that the model’s encoder is a feature catcher, the decoder is a feature fuser
that fuses and corrects the inputs from different sources to obtain the correct prediction
sequence. The decoder has three primary input sources: the seasonal parts Xdes and the
N captured
trend-cyclical Xdet extracted from the original series, and the seasonal parts Xen
by the decoder. The computation of the trend-cyclical and seasonal parts is kept relatively
independent throughout the computation process. Only at the final output is a linear
layer used to fuse the two to obtain the final prediction Xpred . The computation process is
described in Algorithm 3.

Algorithm 3 Decoder
Require: Xen , XenN

Ensure: Xpred
1: Xens , Xent ← D (Xen I :I )
2
2: Xdes ← Xens k 00: I
2
3: Xdet ← Xent k X 0: I
2
4: for l = 1, · · · , M do
5: if l = 1 then
l −1
6: Xde ← Xdes
l −1
7: Tde ← Xdet
8: end if  
l,1 l,1 l −1 l −1
9: Sen , Tde ← D H (Xde ) + Xde
 
l,2 l,2 l,1 N ) + S l,1
10: Sde , Tde ← D H (Sde , Xen de
 
l,3 l,3 l,2 l,2
11: Sde , Tde ← D G (Sde ) + Sde
12: l ← S l,3
Xde de
13: l ← T l −1 + W ∗ T l,1 + W ∗ T l,2 + W ∗ T l,3
Tde de l,1 de l,2 de l,3 de
14: end for
15: Xpred ← WS ∗ Xde M+T M
de
Sensors 2023, 23, 5093 6 of 16

Here, Xen denotes the original sequence, which is also the input to the encoder. It is
decomposed into trend-cyclical and season parts Xens , Xent before feeding into the decoder
as the initial input.

3.3. Hierarchical Attention Mechanism


The hierarchical attention mechanism, as the first feature capture unit of Metaformer,
is at the model’s core and, therefore, has a significant impact on the subsequent work. Most
Transformer-like models use the multi-head attention mechanism to complete the first step
of feature extraction. However, the multi-head attention mechanism itself has significant
drawbacks: (1) each head uses the exact attention mechanism, which cannot guarantee
the diversity of captured information and may even miss some critical information. (2)
Each head belongs to a separate subspace, and the lack of information interaction between
heads is not conducive to the deep understanding of information by the model. Therefore,
we propose a hierarchical attention mechanism for the first time. First, a hierarchical
structure is used, where each layer uses a different attention mechanism to capture features
separately, which ensures the diversity of information circulating in the network; second, a
cascading interaction is used, where the information captured by the lower layer will be
reused by the upper layer, which will deepen the depth of information understanding by
the model. We know that when we humans understand language, we not only focus on
the surface meaning of words, but can also understand the metaphors behind the words.
Inspired by this, we use a hierarchical structure to model this phenomenon and, thus,
improve the network’s ability to perceive information in three dimensions.

3.3.1. Traditional Multi-Head Attention Mechanism


In the multi-head attention mechanism, only one type of attention computation scaled
dot-product attention is used. The multi-head attention mechanism first takes as input
three vectors of queries, keys, and values with dm dimension, and each head is projected
to dk , dk and dv dimensions using a linear layer. The attention function is then computed
to produce a dv dimensional output value. Finally, the output of each attention head is
stitched together and passed through a linear layer to obtain the final output.
!
h  
H (Q, K, V) = Lθo ä A Lθiq (Q), Lθik (K), Lθiv (V) (2)
i =1

Equation (2) calculates the multi-headed attention mechanism, where Lθq , Lθk , Lθv , Lθo
denotes the linear layer with projection parameter matrix W Q ∈ Rdm ×dk , W K ∈ Rdm ×dk ,
W V ∈ Rdm ×dv , W O ∈ Rhdv ×dm , respectively. h denotes the number of heads of attention.
A denotes scaled dot-product attention. ä denotes sequential cascade.

3.3.2. Hierarchical Attention Mechanism


We propose a hierarchical attention mechanism to address the shortcomings in the
multi-head attention mechanism, aiming to enhance the model’s deep understanding of
the information. Figure 2 depicts the central architecture of the hierarchical attention
mechanism, and Algorithm 4 describes its implementation.

Figure 2. Hierarchical attention mechanism.


Sensors 2023, 23, 5093 7 of 16

Algorithm 4 Hierachical Attention


Require: Q, K, V
Ensure: Y
1: for i = 1, · · · , N do
2: if i = 1 then
3: H ← Random Initialisation
4: end if    
5: H ← R Ai Lθiq (Q), Lθi (K), Lθiv (V) , H
k
6: Y←YkH
7: end for
8: Y ← Lθo (Y )

Here, Lθq , Lθk , Lθv , Lθo has the same meaning as in Equation (2). R denotes the GRU
unit. Y records the information of each layer and finally maps it to the specified dimension
as the model’s output by a linear layer. Ai denotes different attention calculation methods.
This paper mainly uses four common attention mechanisms: Vanilla Attention, ProbSparse
Attention, LSH Attention, and AutoCorrelation. AutoCorrelation is not, strictly speaking,
part of the attention mechanism family. However, its effect is similar to or even better than
attention mechanisms, so it is introduced into our model and involved in feature extraction.
Attention is the core building block of Transformer and is considered an essential tool
for information capture in both CV and NLP domains. Many researchers have worked on
designing more efficient attention, so many variants based on Vanilla Attention have been
proposed in succession. The following briefly describes the four attention mechanisms
used in our model.

3.3.3. Vanilla Attention


Vanilla Attention was first proposed in the Transformer [3], and its input consists
of three vectors: queries, keys, and values(Q, K, V), whose dimensions are dk , dk , dv , re-
spectively. Vanilla Attention is also known as Scaled Dot Product
√ Attention because it is
computed by dot product using Q and K and then scaled by dk . The specific calculation
process is shown in Equation (3).

QK>
A (Q, K, V) = σ† ( √ )V (3)
dk

Here, A denotes the attention or autocorrelation mechanism. σ† denotes the softmax


activation function.

3.3.4. ProbSparse Attention


This attention mechanism, first proposed in Informer, considers the attention coef-
ficients’ sparsity and specifies the query matrix Q using the exact query sparsity mea-
surement method (Algorithm 5). Equation (4) gives the ProbSparse Attention calcula-
tion method.
Q̃K>
A (Q, K, V) = σ† ( √ V) (4)
dk
Here, Q̃ is the sparse matrix obtained by the sparsity measure. The prototype of
M̃ (qi , K) is Kullback–Leibler (KL) divergence, see Equation (5).

LK qi k >
√l 1 LK qi k >
j
KL(q k p) = ln ∑ e d − ∑ √d − ln LK (5)
l =1
LK j =1
Sensors 2023, 23, 5093 8 of 16

Algorithm 5 Explicit Query Sparisity Measurement


Require: Q, K
Ensure: Q̃
qi k >
j 1 LK qi k j
>
1: Define M̃ (qi , K) = max j { √ } − ∑ j =1 √
dK LK dK
2: Define U = argTopuqi ∈[1,··· ,LQ ] ( M̃ (qi , K))
3: for u ∈ [1, · · · , LQ ] do
4: if u ∈ U then
5: Q̃u,: ← Qu,:
6: else
7: Q̃u,: ← 0
8: end if
9: end for

3.3.5. LSH Attention


Like ProbSparse Attention, LSH Attention also uses a sparsification method to reduce
the complexity of Vanilla Attention. The main idea is that for each query, only the nearest
keys are focused on, where the nearest neighbour selection is achieved by locally sensitive
hashing. The specific attentional process of LSH Attention is given in Equation (6), where
the hash function used is Equation (7):

a ( qi , k j )
A (qi , K, V) = ∑ ∑l ∈Pi a(qi , kl )
vj (6)
j∈Pi

h( x ) = arg max( xR k − xR) (7)


where Pi = { j : h(qi ) = h(k j )} denotes the set of key vectors that the i-th query focuses on.
qi k >
j
a(qi , k j ) = exp( √ ) is used to measure the association of nodes i and j.
d

3.3.6. AutoCorrelation
AutoCorrelation mechanisms are different from the types of attention mechanisms
above. Whereas the self-attentive family focuses on the correlation between points, the
AutoCorrelation mechanism focuses on the correlation between segments. Therefore,
AutoCorrelation mechanisms are an excellent complement to self-attentive mechanisms.

A (Q, K, V) = ∑ roll(V, τ ) · σ† (RQ,K (τ )) (8)


τ ∈T

L
1
RQ,K (τ ) =
L ∑ Qt Kt−τ (9)
t =1

T = {τ1 , τ2 , · · · , τk } = argTopkτ ∈{1,··· ,L} (RQ,K (τ )) (10)


Equation (8) gives the procedure of calculating the AutoCorrelation mechanism, where
Equation (9) is used to measure the correlation between two sequences, and τ denotes the
order of the lag term. roll(V, τ ) denotes the vector of τ-order lagged terms of vector V
obtained in a self-looping manner. Equation (10) is the Topk algorithm used to filter the set
T of k lagged terms with the highest correlation.

3.4. GAT Network


The Vanilla Transformer model embeds a Feedforward Network (FFN) [37] layer at
the end of each encoder–decoder layer. The FFN plays a crucial role in mitigating token-
uniformity inductive bias. Inductive bias can be considered a learning algorithm as a
Sensors 2023, 23, 5093 9 of 16

heuristic or “value” for selecting hypotheses in ample hypothesis space. For example,
convolutional networks assume that information is spatially local, spatially invariant, and
translational equivalent, so that the parameter space can be reduced by sliding convolu-
tional weight sharing; recurrent neural networks assume that information is sequential and
invariant to temporal transformations, so that weight sharing is also possible. Similarly,
the attention mechanism also has some assumptions, such as the uselessness of some
information. If the attention mechanism is stacked, some critical information will be lost, so
adding a layer of FNN can somehow alleviate the accumulation of inductive bias and avoid
network collapse. Of course, not only does the FFN layer have a mitigating effect, but we
find that a similar effect can be achieved using a Graph Neural Network (GNN) [38–40].
Here, we use a two-layer GAT [41,42] network instead of the original FFN layer. The
graph network has the property of aggregating the information of neighbouring nodes,
i.e., through the aggregation of the graph network, each node will fuse some features of
its neighbouring nodes. Additionally, we use random sampling to reduce the complexity.
The reason is that our goal is not feature aggregation, but to mitigate the loss of crucial
information. In particular, when the number of samples per node is 0, the graph network
can be considered to ultimately degenerate into an FFN layer with a similar role to the
original FFN.
Here, we model each token as a node in the graph and mine the dependencies
between nodes using the graph attention algorithm. The input to GAT is defined as
H = {~h1 ,~h2 , · · · ,~h N }. Here, ~hi ∈ RF denotes the input vector of the i-th node, N denotes
the number of nodes in the graph, and F denotes the dimensionality of the input vector.
Through the computation of the GAT network, this layer generates a new set of node
0
features H 0 = {~h10 ,~h20 , · · · ,~h0N }. Similarly, here ~hi0 ∈ RF denotes the output vector of the
i-th node, and F 0 denotes the dimensionality of the output vector.
Figure 3 gives the general flow of information aggregation for a single node.
Equation (11) is a concrete implementation of calculating the attention coefficient eij for
the i-th node and its neighbour node j one by one. Equation (12) is used to calculate the
normalised attention factor αij :

eij = F (W~hi k W~h j ) (11)

exp(eij )
αij = σ† (eij ) = (12)
∑k∈Ni exp(eik )
Here, Ni denotes the set of all neighbouring nodes of the i-th node, and W is a shared
parameter for linear mapping of node features. F is a single-layer feedforward neural
network for mapping the spliced high-dimensional features into a real number eij . eij is the
attention coefficient of node j → i, and αij is its normalised value.
Finally, the new feature vector ~hi0 of the current node i is obtained by weighting
and summing the feature vectors of each neighbouring node according to the calculated
attention coefficients, where ~hi0 records the neighbourhood information of the current node.
!
~h = σ ∑ αij W~h j
0
(13)
i
j∈Ni

Here, σ represents applying a non-linear activation function logistic sigmoid at the end.
Furthermore, if information aggregation is accomplished through the K head attention
mechanism, the final output vector can be obtained by taking the average.
!
K
1
K k∑ ∑ αij W ~h j
0
~h = σ k k
i (14)
=1 j∈N i
Sensors 2023, 23, 5093 10 of 16

Figure 3. Feature aggregation process for node i in GAT network.

4. Experiment
4.1. Dataset Description
To evaluate the Metaformer model, we conducted experiments on four popular real-
world datasets encompassing energy, economy, disease, and transportation domains. The
Electricity (https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014,
accessed on 24 February 2023) dataset describes the hourly electricity consumption of 321
customers; the Exchange [43] dataset describes the daily exchange rates of eight countries;
the Illness (https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html, accessed on 24
February 2023) dataset is the weekly data of influenza-like illnesses recorded by the Centers
for Disease Control; and the Traffic (http://pems.dot.ca.gov/, accessed on 24 February
2023) dataset describes the occupancy rate of roads in the San Francisco Bay area.
Table 1 shows detailed dataset statistics, where #Sample is the total number of samples,
#Features is the number of features acquired per sampling, Period is the sampling period,
and Span is the sampling time span.

Table 1. Overall statistical indicators for the selected dataset.

Dataset #Sample #Features Period Span


Electricity 26,304 321 hourly 2012–2014
Exchange 7588 8 daily 1990–2016
Illness 966 7 hourly 2002–2021
Traffic 17,544 862 weekly 2016–2018

Since the scale of each element in the dataset is not uniform, we need to normalise the
data before formal training for the model to treat different features equally during training.
Equations (15) and (16) are the normalisation and denormalisation calculation methods,
respectively, where X denotes the original sampled dataset and X ∗ denotes the normalised
dataset. Figure 4 shows the variation of the four normalised features randomly selected
from the four data sets.
X − E( x )
X∗ = p (15)
D(X )
q
X = X ∗ · D ( X ) + E( X ) (16)
Sensors 2023, 23, 5093 11 of 16

(a) (b)

(c) (d)
Figure 4. Variation of the four randomly selected normalised features in the dataset. (a) Electricity.
(b) Exchange. (c) Illness. (d) Traffic.

4.2. Comparison Experiments


4.2.1. Baseline Models
To validate the predictive performance of our proposed model, we thoroughly compare
it with some state-of-the-art time series prediction models, including Autoformer [33],
Informer [32], Reformer [31], LogTrans [44], LSTNet [43], LSTM [24], and TCN [45]. Among
them, Autoformer, Informer, Reformer, and LogTrans are all improved models based on
Transformer. Autoformer uses an adaptive attention mechanism and dynamic feature
transformation to adapt to different time steps and missing data, and can handle long
sequences well. LogTrans is an autoregressive model that can take nonlinear and non-
stationary data with good robustness and robustness by the logarithmic transformation of
the input data. LSTM is a classical recurrent neural network model with a gating mechanism
that can effectively deal with the forgetting problem of long-series data prediction. TCN
is a convolutional neural network model that can handle the long-term dependence and
nonlinear variation of long series by adding residual connections between the convolutional
layers, and has high efficiency, good robustness, and small memory occupation.

4.2.2. Experimental Setup


To standardise the sequence input length I = 96 for comparison, we use a 7:1:2
ratio to split the Electricity, Exchange, and Traffic datasets into training, validation, and
test sets, respectively, and set the prediction length O ∈ {96, 192, 336, 720} accordingly.
For the ILI dataset, we use a 6:2:2 split and set the prediction length O ∈ {24, 36, 48, 60}
accordingly. We set the dimensionality of the model to dm = 512 and use a hierarchical
attention mechanism with four layers, which stacks AutoCorrelation, Vanilla Attention,
LSH Attention, and ProbSparse Attention from top to bottom. The number of attention
heads is set to 2. Additionally, to ensure comparability, we uniformly set the number of
heads to 8 for the multi-headed attention mechanism in the other Transformer families
involved in the comparison. In the GAT network, we use a two-layer architecture with
a middle hidden layer dimension of 1024, and each node is assigned to have only one
edge pointing to itself (self-loop graph). The sliding window size of the decoder’s moving
average is set to 25, the number of encoder layers is set to N = 2, and the number of
Sensors 2023, 23, 5093 12 of 16

decoder layers is set to M = 1. We use MSE as the loss function and Adam as the optimiser
with a learning rate of 0.0001. We train the model for 20 iterations, but employ an early
termination strategy with a tolerance of 3.

4.2.3. Experimental Result


Figure 5 shows the decreasing trend of the loss value in the training set and the
loss value in the test set of our model during the training process. Table 2 presents an
overall comparison between our model and other baseline models. The table shows that
the Transformer-based model delivers significantly better predictions than other models.
Autoformer performs well on several datasets and exhibits lower MAE and MSE values than
other models. Informer is also a good model, but does not perform as well as Autoformer
on some datasets, where LSTM and TCN generally exhibit higher MAE and MSE values. In
contrast, our model achieves optimal or suboptimal accuracy levels for different prediction
lengths on different datasets. Its overall performance is better than other baseline models,
indicating that our model can satisfy most sequence prediction tasks.

Table 2. Multivariate long-term series prediction results for four datasets with an input length of
I = 96 and prediction length of O ∈ {96, 192, 336, 720}. Lower MSE and MAE values indicate better
results, and the best results are highlighted in bold.

Electricity Exchange Traffic ILI


Models Metric 96 192 336 720 96 192 336 720 96 192 336 720 24 36 48 60
MSE 0.184 0.194 0.203 0.243 0.157 0.265 0.432 1.387 0.569 0.561 0.592 0.611 3.004 2.852 2.653 2.769
Metaformer
MAE 0.297 0.305 0.319 0.347 0.29 0.38 0.483 0.924 0.345 0.347 0.363 0.368 1.193 1.142 1.088 1.085
MSE 0.201 0.222 0.231 0.254 0.197 0.3 0.509 1.447 0.613 0.616 0.622 0.66 3.483 3.103 2.669 2.77
Autoformer MAE 0.317 0.334 0.338 0.361 0.323 0.369 0.524 0.941 0.388 0.382 0.337 0.408 1.287 1.148 1.085 1.125
MSE 0.274 0.296 0.3 0.373 0.847 1.204 1.672 2.478 0.719 0.696 0.777 0.864 5.764 4.755 4.763 5.264
Informer MAE 0.368 0.386 0.394 0.439 0.752 0.895 1.036 1.31 0.391 0.379 0.42 0.472 1.677 1.467 1.469 1.564
MSE 0.312 0.348 0.35 0.34 1.065 1.188 1.357 1.51 0.732 0.733 0.742 0.755 4.4 4.783 4.832 4.882
Reformer
MAE 0.402 0.433 0.433 0.42 0.829 0.906 0.976 1.016 0.423 0.42 0.42 0.423 1.382 1.448 1.465 1.483
MSE 0.258 0.266 0.28 0.283 0.968 1.04 1.659 1.941 0.684 0.685 0.733 0.717 4.48 4.799 4.8 5.278
LogTrans
MAE 0.357 0.368 0.38 0.376 0.812 0.851 1.081 1.127 0.384 0.39 0.408 0.396 1.444 1.467 1.468 1.56
MSE 0.68 0.725 0.828 0.957 1.551 1.477 1.507 2.285 1.107 1.157 1.216 1.481 6.026 5.34 6.08 5.548
LSTNet
MAE 0.645 0.676 0.727 0.811 1.058 1.028 1.031 1.243 0.685 0.706 0.73 0.805 1.77 1.668 1.787 1.72
MSE 0.375 0.442 0.439 0.98 1.453 1.846 2.136 2.984 0.843 0.847 0.853 1.5 5.914 6.631 6.736 6.87
LSTM
MAE 0.437 0.473 0.473 0.814 1.049 1.179 1.231 1.427 0.453 0.453 0.455 0.805 1.734 1.845 1.857 1.879
MSE 0.985 0.996 1 1.438 3.004 3.048 3.113 3.15 1.438 1.463 1.479 1.499 6.624 6.858 6.968 7.127
TCN
MAE 0.813 0.821 0.824 0.784 1.432 1.444 1.459 1.458 0.784 0.794 0.799 0.804 1.83 1.879 1.892 1.918

(a) (b)
Figure 5. Cont.
Sensors 2023, 23, 5093 13 of 16

(c) (d)

(e) (f)

(g) (h)
Figure 5. The MSE loss plots for the Metaformer model on the training and test sets for the four
datasets. (a) Electricity (train), (b) Electricity (vali), (c) Exchange (train), (d) Exchange (vali), (e) Traffic
(train), (f) Traffic (vali), (g) ILI (train), (h) ILI (vali).

4.3. Ablation Experiments


Additional ablation experiments were conducted to investigate further the impact of
different graph structures in alleviating the inductive bias. Table 3 presents three different
graph structures, where Meta-v1 indicates that all nodes in the graph use only a self-loop
structure; Meta-v2 indicates that all nodes in the graph use full bi-directional connectivity;
and Meta-v3 indicates that all nodes in the graph have a self-loop structure for each node,
in addition to full bi-directional connectivity. Table 4 displays the performance of three
variants of the Metaformer model on the four datasets.
Sensors 2023, 23, 5093 14 of 16

Table 3. Three variants of Metaformer. 3 and 7 indicate that the specified structure was or was not
used, respectively.

Models Self-Loop Full Connection


Meta-v1 3 7
Meta-v2 7 3
Meta-v3 3 3

Table 4. Performance of three variants of the Metaformer model on four datasets.

Electricity Exchange Traffic ILI


Models Metric
96 192 336 720 96 192 336 720 96 192 336 720 24 36 48 60
MSE 0.184 0.194 0.203 0.243 0.157 0.265 0.432 1.387 0.569 0.561 0.592 0.611 3.004 2.852 2.653 2.769
Meta-v1
MAE 0.297 0.305 0.319 0.347 0.29 0.38 0.483 0.924 0.345 0.347 0.363 0.368 1.193 1.142 1.088 1.085
MSE 0.195 0.197 0.212 0.233 0.245 0.702 0.973 1.222 0.607 0.588 0.59 0.617 3.195 3.438 3.318 3.075
Meta-v2
MAE 0.309 0.309 0.324 0.34 0.375 0.647 0.767 0.896 0.397 0.37 0.369 0.383 1.231 1.321 1.301 1.161
MSE 0.196 0.199 0.203 0.236 0.268 0.706 0.963 1.768 0.615 0.596 0.593 0.621 3.643 3.289 2.967 3.317
Meta-v3
MAE 0.309 0.312 0.316 0.344 0.375 0.65 0.756 1.074 0.403 0.379 0.37 0.385 1.375 1.271 1.207 1.246

As shown in Table 4, the Meta-v1 variant of the model, which uses only the self-
loop graph, generally outperforms the other variants across multiple measures. This
phenomenon may be because the self-loop edges are self-weighted, which is more effective
in reducing the inductive bias of the attention mechanism in the Metaformer model by
reinforcing the features of specific nodes. Conversely, adding a fully connected mech-
anism may further exacerbate the information perturbation. However, due to limited
experimental resources, we cannot conduct a more in-depth study. In future work, we
will further investigate how random sampling of neighbouring nodes, including more
attention mechanisms, and the stacking order of these attention mechanisms affect the
model’s performance.

5. Conclusions
This paper presents a redesigned sequence-to-sequence model based on the Trans-
former architecture. We draw inspiration from the sequence decomposition model of Auto-
former and introduce a similar approach to separate trend and seasonal items. Additionally,
we propose a hierarchical attention mechanism to address the problem of incomplete and in-
sufficient information mining by multiple attention mechanisms in the Vanilla Transformer
model. Our hierarchical attention mechanism employs different attention mechanisms
simultaneously to ensure diversity in information mining. The hierarchical structure recur-
sively passes information captured by lower-level attention upward, enabling interaction
between multiple attention mechanisms and deepening the network’s understanding of
more profound information. This mechanism is beneficial in capturing the metaphorical
information present in both text and images. We also add a graph attention network to the
model, allowing it to stand in a high-dimensional perspective to aggregate and mitigate the
inductive bias of the information. Our experimental results demonstrate that our proposed
model outperforms the baseline model across multiple datasets and significantly improves
all evaluation metrics.

Author Contributions: Conceptualization, B.P. and Y.D.; data curation, B.P. and W.K.; formal analysis,
B.P.; funding acquisition, Y.D.; investigation, B.P.; methodology, B.P.; project administration, Y.D.;
writing—review and editing, B.P., Y.D. and W.K. All authors have read and agreed to the published
version of the manuscript.
Funding: This research was funded by the National Science Foundation of China and General
Project Fund in the Field of Equipment Development Department, grant number No. 61901079,
No. 61403110308. The APC was funded by Dalian University.
Institutional Review Board Statement: Not applicable.
Sensors 2023, 23, 5093 15 of 16

Informed Consent Statement: Not applicable.


Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473.
2. Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning deep transformer models for machine translation. arXiv
2019, arXiv:1906.01787.
3. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
2017. Available online: http://arxiv.org/abs/1706.03762 (accessed on 23 January 2021).
4. Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022. [CrossRef]
5. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In
Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg,
Germany, 2020; pp. 213–229.
6. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
7. Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the
International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064.
8. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
9. Chen, X.; Wu, Y.; Wang, Z.; Liu, S.; Li, J. Developing real-time streaming transformer transducer for speech recognition on
large-scale dataset. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Toronto, ON, Canada, 6–11 July 2021; pp. 5904–5908.
10. Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings
of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20
April 2018; pp. 5884–5888.
11. Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-
augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100.
12. Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck,
D. Music transformer. arXiv 2018, arXiv:1809.04281.
13. Huang, Y.S.; Yang, Y.H. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In
Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1180–1188.
14. Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Hunter, C.A.; Bekas, C.; Lee, A.A. Molecular transformer: A model for uncertainty-
calibrated chemical reaction prediction. ACS Cent. Sci. 2019, 5, 1572–1583. [CrossRef]
15. Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function
emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118.
[CrossRef] [PubMed]
16. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 July 2016; pp. 770–778.
17. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 1, 91–99. [CrossRef]
18. LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten
zip code recognition. Neural Comput. 1989, 1, 541–551. [CrossRef]
19. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. Adv.
Neural Inf. Process. Syst. 2019, 32, 68–80.
20. Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the
International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1691–1703.
21. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv
2020, arXiv:2010.04159.
22. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation
from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890.
23. Van Der Westhuizen, J.; Lasenby, J. The unreasonable effectiveness of the forget gate. arXiv 2018, arXiv:1804.04849.
24. Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer:
Berlin/Heidelberg, Germany, 2012; pp. 37–45.
25. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
26. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
Sensors 2023, 23, 5093 16 of 16

27. Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955.
[CrossRef]
28. Mou, L.; Zhao, P.; Xie, H.; Chen, Y. T-LSTM: A long short-term memory neural network enhanced by temporal information for
traffic flow prediction. IEEE Access 2019, 7, 98053–98060. [CrossRef]
29. Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International
Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318.
30. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27,
3104–3112.
31. Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451.
32. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence
time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35,
pp. 11106–11115.
33. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting.
Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430.
34. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency enhanced decomposed transformer for long-term
series forecasting. arXiv 2022, arXiv:2201.12740.
35. Chollet, F. On the measure of intelligence. arXiv 2019, arXiv:1911.01547.
36. Baxter, J. A model of inductive bias learning. J. Artif. Intell. Res. 2000, 12, 149–198. [CrossRef]
37. Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [CrossRef]
38. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907.
39. Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493.
40. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30,
1025–1035.
41. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903.
42. Chen, H.; Hong, P.; Han, W.; Majumder, N.; Poria, S. Dialogue relation extraction with document-level heterogeneous graph
attention networks. Cogn. Comput. 2023, 15, 793–802. [CrossRef]
43. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. In
Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor,
MI, USA, 8–12 July 2018; pp. 95–104.
44. Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of
transformer on time series forecasting. Adv. Neural Inf. Process. Syst. 2019, 32, 5243–5253.
45. Hao, H.; Wang, Y.; Xia, Y.; Zhao, J.; Shen, F. Temporal convolutional attention-based network for sequence modeling. arXiv 2020,
arXiv:2002.12530.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like