0% found this document useful (0 votes)
59 views

Are Transformers Effective For Time Series Forecasting?

Uploaded by

Mukenze junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Are Transformers Effective For Time Series Forecasting?

Uploaded by

Mukenze junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Are Transformers Effective for Time Series Forecasting?

Ailing Zeng1 *, Muxi Chen1 * , Lei Zhang2 , Qiang Xu1


1
The Chinese University of Hong Kong
2
International Digital Economy Academy (IDEA)
{alzeng, mxchen21, qxu}@cse.cuhk.edu.hk
{leizhang}@idea.edu.cn
arXiv:2205.13504v3 [cs.AI] 17 Aug 2022

Abstract ergy management, and financial investment. Over the past


several decades, TSF solutions have undergone a progres-
Recently, there has been a surge of Transformer-based sion from traditional statistical methods (e.g., ARIMA [1])
solutions for the long-term time series forecasting (LTSF) and machine learning techniques (e.g., GBRT [11]) to
task. Despite the growing performance over the past few deep learning-based solutions, e.g., Recurrent Neural Net-
years, we question the validity of this line of research in this works [15] and Temporal Convolutional Networks [3, 17].
work. Specifically, Transformers is arguably the most suc- Transformer [26] is arguably the most successful se-
cessful solution to extract the semantic correlations among quence modeling architecture, demonstrating unparalleled
the elements in a long sequence. However, in time series performances in various applications, such as natural lan-
modeling, we are to extract the temporal relations in an guage processing (NLP) [7], speech recognition [8], and
ordered set of continuous points. While employing posi- computer vision [19, 29]. Recently, there has also been a
tional encoding and using tokens to embed sub-series in surge of Transformer-based solutions for time series anal-
Transformers facilitate preserving some ordering informa- ysis, as surveyed in [27]. Most notable models, which
tion, the nature of the permutation-invariant self-attention focus on the less explored and challenging long-term time
mechanism inevitably results in temporal information loss. series forecasting (LTSF) problem, include LogTrans [16]
To validate our claim, we introduce a set of embarrass- (NeurIPS 2019), Informer [30] (AAAI 2021 Best paper),
ingly simple one-layer linear models named LTSF-Linear Autoformer [28] (NeurIPS 2021), Pyraformer [18] (ICLR
for comparison. Experimental results on nine real-life 2022 Oral), Triformer [5] (IJCAI 2022) and the recent FED-
datasets show that LTSF-Linear surprisingly outperforms former [31] (ICML 2022).
existing sophisticated Transformer-based LTSF models in The main working power of Transformers is from its
all cases, and often by a large margin. Moreover, we con- multi-head self-attention mechanism, which has a remark-
duct comprehensive empirical studies to explore the im- able capability of extracting semantic correlations among
pacts of various design elements of LTSF models on their elements in a long sequence (e.g., words in texts or 2D
temporal relation extraction capability. We hope this sur- patches in images). However, self-attention is permutation-
prising finding opens up new research directions for the invariant and “anti-order” to some extent. While using var-
LTSF task. We also advocate revisiting the validity of ious types of positional encoding techniques can preserve
Transformer-based solutions for other time series analysis some ordering information, it is still inevitable to have tem-
tasks (e.g., anomaly detection) in the future. Code is avail- poral information loss after applying self-attention on top
able at: https://github.com/cure-lab/LTSF- of them. This is usually not a serious concern for semantic-
Linear. rich applications such as NLP, e.g., the semantic meaning
of a sentence is largely preserved even if we reorder some
words in it. However, when analyzing time series data,
1. Introduction there is usually a lack of semantics in the numerical data
itself, and we are mainly interested in modeling the tempo-
Time series are ubiquitous in today’s data-driven world.
ral changes among a continuous set of points. That is, the
Given historical data, time series forecasting (TSF) is a
order itself plays the most crucial role. Consequently, we
long-standing task that has a wide range of applications,
pose the following intriguing question: Are Transformers
including but not limited to traffic flow estimation, en-
really effective for long-term time series forecasting?
* Equal contribution Moreover, while existing Transformer-based LTSF so-

1
lutions have demonstrated considerable prediction accu- accuracy compared to existing works, it merely serves as a
racy improvements over traditional methods, in their exper- simple baseline for future research on the challenging long-
iments, all the compared (non-Transformer) baselines per- term TSF problem. With our findings, we also advocate
form autoregressive or iterated multi-step (IMS) forecast- revisiting the validity of Transformer-based solutions for
ing [1, 2, 22, 24], which are known to suffer from significant other time series analysis tasks in the future.
error accumulation effects for the LTSF problem. There-
fore, in this work, we challenge Transformer-based LTSF 2. Preliminaries: TSF Problem Formulation
solutions with direct multi-step (DMS) forecasting strate-
gies to validate their real performance. For time series containing C variates, given historical
Not all time series are predictable, let alone long-term data X = {X1t , ..., XCt }L t=1 , wherein L is the look-back
forecasting (e.g., for chaotic systems). We hypothesize that window size and Xit is the value of the ith variate at the tth
long-term forecasting is only feasible for those time series time step. The time series forecasting task is to predict the
L+T
with a relatively clear trend and periodicity. As linear mod- values X̂ = {X̂1t , ..., X̂Ct }t=L+1 at the T future time steps.
els can already extract such information, we introduce a set When T > 1, iterated multi-step (IMS) forecasting [23]
of embarrassingly simple models named LTSF-Linear as a learns a single-step forecaster and iteratively applies it to
new baseline for comparison. LTSF-Linear regresses histor- obtain multi-step predictions. Alternatively, direct multi-
ical time series with a one-layer linear model to forecast fu- step (DMS) forecasting [4] directly optimizes the multi-step
ture time series directly. We conduct extensive experiments forecasting objective at once.
on nine widely-used benchmark datasets that cover various Compared to DMS forecasting results, IMS predictions
real-life applications: traffic, energy, economics, weather, have smaller variance thanks to the autoregressive estima-
and disease predictions. Surprisingly, our results show that tion procedure, but they inevitably suffer from error accu-
LTSF-Linear outperforms existing complex Transformer- mulation effects. Consequently, IMS forecasting is prefer-
based models in all cases, and often by a large margin (20% able when there is a highly-accurate single-step forecaster,
∼ 50%). Moreover, we find that, in contrast to the claims in and T is relatively small. In contrast, DMS forecasting gen-
existing Transformers, most of them fail to extract temporal erates more accurate predictions when it is hard to obtain an
relations from long sequences, i.e., the forecasting errors are unbiased single-step forecasting model, or T is large.
not reduced (sometimes even increased) with the increase of
look-back window sizes. Finally, we conduct various abla- 3. Transformer-Based LTSF Solutions
tion studies on existing Transformer-based TSF solutions to Transformer-based models [26] have achieved unparal-
study the impact of various design elements in them. leled performances in many long-standing AI tasks in natu-
To sum up, the contributions of this work include: ral language processing and computer vision fields, thanks
to the effectiveness of the multi-head self-attention mech-
• To the best of our knowledge, this is the first work to
anism. This has also triggered lots of research interest
challenge the effectiveness of the booming Transform-
in Transformer-based time series modeling techniques [20,
ers for the long-term time series forecasting task.
27]. In particular, a large amount of research works are ded-
• To validate our claims, we introduce a set of em- icated to the LTSF task (e.g., [16, 18, 28, 30, 31]). Con-
barrassingly simple one-layer linear models, named sidering the ability to capture long-range dependencies
LTSF-Linear, and compare them with existing with Transformer models, most of them focus on the less-
Transformer-based LTSF solutions on nine bench- explored long-term forecasting problem (T  1)1 .
marks. LTSF-Linear can be a new baseline for the When applying the vanilla Transformer model to the
LTSF problem. LTSF problem, it has some limitations, including the
quadratic time/memory complexity with the original self-
• We conduct comprehensive empirical studies on var- attention scheme and error accumulation caused by the au-
ious aspects of existing Transformer-based solutions, toregressive decoder design. Informer [30] addresses these
including the capability of modeling long inputs, the issues and proposes a novel Transformer architecture with
sensitivity to time series order, the impact of posi- reduced complexity and a DMS forecasting strategy. Later,
tional encoding and sub-series embedding, and effi- more Transformer variants introduce various time series
ciency comparisons. Our findings would benefit future features into their models for performance or efficiency im-
research in this area. provements [18,28,31]. We summarize the design elements
With the above, we conclude that the temporal model- of existing Transformer-based LTSF solutions as follows
ing capabilities of Transformers for time series are exag- (see Figure 1).
gerated, at least for the existing LTSF benchmarks. At the 1 Due to page limit, we leave the discussion of non-Transformer fore-

same time, while LTSF-Linear achieves a better prediction casting solutions in the Appendix.

2
LogSparse and convolutional Iterated Multi-Step
Channel projection self-attention @LogTrans (IMS) @LogTrans
Normalization
ProbSparse and distilling Direct Multi-Step
self-attention @Informer (DMS) @Informer
Fixed position

Output
Series auto-correlation with DMS with auto-correlation and
Input

Timestamp
preparation decomposition @Autoformer decomposition @Autoformer
Local timestamp
Multi-resolution pyramidal DMS along spatio-temporal
Seasonal-trend attention @Pyraformer dimension @Pyraformer
decomposition Global timestamp Frequency enhanced block with DMS with frequency attention
decomposition @FEDformer and decomposition@FEDformer

(a) Preprocessing (b) Embedding (c) Encoder (d) Decoder

Figure 1. The pipeline of existing Transformer-based TSF solutions. In (a) and (b), the solid boxes are essential operations, and the dotted
boxes are applied optionally. (c) and (d) are distinct for different methods [16, 18, 28, 30, 31].
Time series decomposition: For data preprocessing, nor- attention mechanism and a self-attention distilling operation
malization with zero-mean is common in TSF. Besides, to decrease the complexity to O (LlogL), and FEDformer
Autoformer [28] first applies seasonal-trend decomposition designs a Fourier enhanced block and a wavelet enhanced
behind each neural block, which is a standard method in block with random selection to obtain O (L) complexity.
time series analysis to make raw data more predictable [6, Lastly, Autoformer designs a series-wise auto-correlation
13]. Specifically, they use a moving average kernel on the mechanism to replace the original self-attention layer.
input sequence to extract the trend-cyclical component of Decoders: The vanilla Transformer decoder outputs se-
the time series. The difference between the original se- quences in an autoregressive manner, resulting in a slow in-
quence and the trend component is regarded as the seasonal ference speed and error accumulation effects, especially for
component. On top of the decomposition scheme of Aut- long-term predictions. Informer designs a generative-style
oformer, FEDformer [31] further proposes the mixture of decoder for DMS forecasting. Other Transformer variants
experts’ strategies to mix the trend components extracted employ similar DMS strategies. For instance, Pyraformer
by moving average kernels with various kernel sizes. uses a fully-connected layer concatenating Spatio-temporal
Input embedding strategies: The self-attention layer in axes as the decoder. Autoformer sums up two refined de-
the Transformer architecture cannot preserve the positional composed features from trend-cyclical components and the
information of the time series. However, local positional stacked auto-correlation mechanism for seasonal compo-
information, i.e. the ordering of time series, is important. nents to get the final prediction. FEDformer also uses a
Besides, global temporal information, such as hierarchical decomposition scheme with the proposed frequency atten-
timestamps (week, month, year) and agnostic timestamps tion block to decode the final results.
(holidays and events), is also informative [30]. To enhance The premise of Transformer models is the semantic cor-
the temporal context of time-series inputs, a practical design relations between paired elements, while the self-attention
in the SOTA Transformer-based methods is injecting sev- mechanism itself is permutation-invariant, and its capabil-
eral embeddings, like a fixed positional encoding, a channel ity of modeling temporal relations largely depends on posi-
projection embedding, and learnable temporal embeddings tional encodings associated with input tokens. Considering
into the input sequence. Moreover, temporal embeddings the raw numerical data in time series (e.g., stock prices or
with a temporal convolution layer [16] or learnable times- electricity values), there are hardly any point-wise semantic
tamps [28] are introduced. correlations between them. In time series modeling, we are
Self-attention schemes: Transformers rely on the self- mainly interested in the temporal relations among a contin-
attention mechanism to extract the semantic dependen- uous set of points, and the order of these elements instead
cies between paired elements. Motivated by reducing of the paired relationship plays the most crucial role. While
the O L2 time and memory complexity of the vanilla employing positional encoding and using tokens to embed
Transformer, recent works propose two strategies for ef- sub-series facilitate preserving some ordering information,
ficiency. On the one hand, LogTrans and Pyraformer the nature of the permutation-invariant self-attention mech-
explicitly introduce a sparsity bias into the self-attention anism inevitably results in temporal information loss. Due
scheme. Specifically, LogTrans uses a Logsparse mask to to the above observations, we are interested in revisiting the
reduce the computational complexity to O (LlogL) while effectiveness of Transformer-based LTSF solutions.
Pyraformer adopts pyramidal attention that captures hierar-
chically multi-scale temporal dependencies with an O (L) 4. An Embarrassingly Simple Baseline
time and memory complexity. On the other hand, In-
former and FEDformer use the low-rank property in the In the experiments of existing Transformer-based LTSF
self-attention matrix. Informer proposes a ProbSparse self- solutions (T  1), all the compared (non-Transformer)

3
baselines are IMS forecasting techniques, which are known 5. Experiments
to suffer from significant error accumulation effects. We
hypothesize that the performance improvements in these 5.1. Experimental Settings
works are largely due to the DMS strategy used in them. Dataset. We conduct extensive experiments on nine
widely-used real-world datasets, including ETT (Electricity
Future 𝑇 timesteps Transformer Temperature) [30] (ETTh1, ETTh2, ETTm1,
ETTm2), Traffic, Electricity, Weather, ILI, Exchange-
Forecasting Output Rate [15]. All of them are multivariate time series. We
leave data descriptions in the Appendix.
Evaluation metric. Following previous works [28, 30,
𝑋∈ℝ × 31], we use Mean Squared Error (MSE) and Mean Absolute
Error (MAE) as the core metrics to compare performance.
History 𝐿 timesteps Compared methods. We include five recent
(b) One Linear Layer Transformer-based methods: FEDformer [31], Aut-
Figure 2. Illustration of the basic linear model. oformer [28], Informer [30], Pyraformer [18], and
LogTrans [16]. Besides, we include a naive DMS method:
To validate this hypothesis, we present the simplest DMS Closest Repeat (Repeat), which repeats the last value in the
model via a temporal linear layer, named LTSF-Linear, as a look-back window, as another simple baseline. Since there
baseline for comparison. The basic formulation of LTSF- are two variants of FEDformer, we compare the one with
Linear directly regresses historical time series for future better accuracy (FEDformer-f via Fourier transform).
prediction via a weighted sum operation (as illustrated in
Figure 2). The mathematical expression is X̂i = W Xi , 5.2. Comparison with Transformers
where W ∈ RT ×L is a linear layer along the temporal axis. Quantitative results. In Table 2, we extensively eval-
X̂i and Xi are the prediction and input for each ith vari- uate all mentioned Transformers on nine benchmarks, fol-
ate. Note that LTSF-Linear shares weights across different lowing the experimental setting of previous work [28, 30,
variates and does not model any spatial correlations. 31]. Surprisingly, the performance of LTSF-Linear sur-
LTSF-Linear is a set of linear models. Vanilla Linear is passes the SOTA FEDformer in most cases by 20% ∼ 50%
a one-layer linear model. To handle time series across dif- improvements on the multivariate forecasting, where LTSF-
ferent domains (e.g., finance, traffic, and energy domains), Linear even does not model correlations among variates.
we further introduce two variants with two preprocessing For different time series benchmarks, NLinear and DLin-
methods, named DLinear and NLinear. ear show the superiority to handle the distribution shift and
trend-seasonality features. We also provide results for uni-
variate forecasting of ETT datasets in the Appendix, where
• Specifically, DLinear is a combination of a Decom- LTSF-Linear still consistently outperforms Transformer-
position scheme used in Autoformer and FEDformer based LTSF solutions by a large margin.
with linear layers. It first decomposes a raw data in- FEDformer achieves competitive forecasting accuracy
put into a trend component by a moving average ker- on ETTh1. This because FEDformer employs classical time
nel and a remainder (seasonal) component. Then, two series analysis techniques such as frequency processing,
one-layer linear layers are applied to each component, which brings in time series inductive bias and benefits the
and we sum up the two features to get the final predic- ability of temporal feature extraction. In summary, these re-
tion. By explicitly handling trend, DLinear enhances sults reveal that existing complex Transformer-based LTSF
the performance of a vanilla linear when there is a clear solutions are not seemingly effective on the existing nine
trend in the data. benchmarks while LTSF-Linear can be a powerful baseline.
Another interesting observation is that even though the
naive Repeat method shows worse results when predict-
• Meanwhile, to boost the performance of LTSF-Linear ing long-term seasonal data (e.g., Electricity and Traffic), it
when there is a distribution shift in the dataset, NLin- surprisingly outperforms all Transformer-based methods on
ear first subtracts the input by the last value of the se- Exchange-Rate (around 45%). This is mainly caused by the
quence. Then, the input goes through a linear layer, wrong prediction of trends in Transformer-based solutions,
and the subtracted part is added back before making which may overfit toward sudden change noises in the train-
the final prediction. The subtraction and addition in ing data, resulting in significant accuracy degradation (see
NLinear are a simple normalization for the input se- Figure 3(b)). Instead, Repeat does not have the bias.
quence. Qualitative results. As shown in Figure 3, we plot

4
Datasets ETTh1&ETTh2 ETTm1 &ETTm2 Traffic Electricity Exchange-Rate Weather ILI
Variates 7 7 862 321 8 21 7
Timesteps 17,420 69,680 17,544 26,304 7,588 52,696 966
Granularity 1hour 5min 1hour 1hour 1day 10min 1week

Table 1. The statistics of the nine popular datasets for the LTSF problem.
Methods IMP. Linear* NLinear* DLinear* FEDformer Autoformer Informer Pyraformer* LogTrans Repeat*
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Exchange Electricity

96 27.40% 0.140 0.237 0.141 0.237 0.140 0.237 0.193 0.308 0.201 0.317 0.274 0.368 0.386 0.449 0.258 0.357 1.588 0.946
192 23.88% 0.153 0.250 0.154 0.248 0.153 0.249 0.201 0.315 0.222 0.334 0.296 0.386 0.386 0.443 0.266 0.368 1.595 0.950
336 21.02% 0.169 0.268 0.171 0.265 0.169 0.267 0.214 0.329 0.231 0.338 0.300 0.394 0.378 0.443 0.280 0.380 1.617 0.961
720 17.47% 0.203 0.301 0.210 0.297 0.203 0.301 0.246 0.355 0.254 0.361 0.373 0.439 0.376 0.445 0.283 0.376 1.647 0.975
96 45.27% 0.082 0.207 0.089 0.208 0.081 0.203 0.148 0.278 0.197 0.323 0.847 0.752 0.376 1.105 0.968 0.812 0.081 0.196
192 42.06% 0.167 0.304 0.180 0.300 0.157 0.293 0.271 0.380 0.300 0.369 1.204 0.895 1.748 1.151 1.040 0.851 0.167 0.289
336 33.69% 0.328 0.432 0.331 0.415 0.305 0.414 0.460 0.500 0.509 0.524 1.672 1.036 1.874 1.172 1.659 1.081 0.305 0.396
720 46.19% 0.964 0.750 1.033 0.780 0.643 0.601 1.195 0.841 1.447 0.941 2.478 1.310 1.943 1.206 1.941 1.127 0.823 0.681
96 30.15% 0.410 0.282 0.410 0.279 0.410 0.282 0.587 0.366 0.613 0.388 0.719 0.391 2.085 0.468 0.684 0.384 2.723 1.079
Traffic

192 29.96% 0.423 0.287 0.423 0.284 0.423 0.287 0.604 0.373 0.616 0.382 0.696 0.379 0.867 0.467 0.685 0.390 2.756 1.087
336 29.95% 0.436 0.295 0.435 0.290 0.436 0.296 0.621 0.383 0.622 0.337 0.777 0.420 0.869 0.469 0.734 0.408 2.791 1.095
720 25.87% 0.466 0.315 0.464 0.307 0.466 0.315 0.626 0.382 0.660 0.408 0.864 0.472 0.881 0.473 0.717 0.396 2.811 1.097
96 18.89% 0.176 0.236 0.182 0.232 0.176 0.237 0.217 0.296 0.266 0.336 0.300 0.384 0.896 0.556 0.458 0.490 0.259 0.254
Weather

192 21.01% 0.218 0.276 0.225 0.269 0.220 0.282 0.276 0.336 0.307 0.367 0.598 0.544 0.622 0.624 0.658 0.589 0.309 0.292
336 22.71% 0.262 0.312 0.271 0.301 0.265 0.319 0.339 0.380 0.359 0.395 0.578 0.523 0.739 0.753 0.797 0.652 0.377 0.338
720 19.85% 0.326 0.365 0.338 0.348 0.323 0.362 0.403 0.428 0.419 0.428 1.059 0.741 1.004 0.934 0.869 0.675 0.465 0.394
24 47.86% 1.947 0.985 1.683 0.858 2.215 1.081 3.228 1.260 3.483 1.287 5.764 1.677 1.420 2.012 4.480 1.444 6.587 1.701
36 36.43% 2.182 1.036 1.703 0.859 1.963 0.963 2.679 1.080 3.103 1.148 4.755 1.467 7.394 2.031 4.799 1.467 7.130 1.884
ILI

48 34.43% 2.256 1.060 1.719 0.884 2.130 1.024 2.622 1.078 2.669 1.085 4.763 1.469 7.551 2.057 4.800 1.468 6.575 1.798
60 34.33% 2.390 1.104 1.819 0.917 2.368 1.096 2.857 1.157 2.770 1.125 5.264 1.564 7.662 2.100 5.278 1.560 5.893 1.677
96 0.80% 0.375 0.397 0.374 0.394 0.375 0.399 0.376 0.419 0.449 0.459 0.865 0.713 0.664 0.612 0.878 0.740 1.295 0.713
ETTh1

192 3.57% 0.418 0.429 0.408 0.415 0.405 0.416 0.420 0.448 0.500 0.482 1.008 0.792 0.790 0.681 1.037 0.824 1.325 0.733
336 6.54% 0.479 0.476 0.429 0.427 0.439 0.443 0.459 0.465 0.521 0.496 1.107 0.809 0.891 0.738 1.238 0.932 1.323 0.744
720 13.04% 0.624 0.592 0.440 0.453 0.472 0.490 0.506 0.507 0.514 0.512 1.181 0.865 0.963 0.782 1.135 0.852 1.339 0.756
96 19.94% 0.288 0.352 0.277 0.338 0.289 0.353 0.346 0.388 0.358 0.397 3.755 1.525 0.645 0.597 2.116 1.197 0.432 0.422
ETTh2

192 19.81% 0.377 0.413 0.344 0.381 0.383 0.418 0.429 0.439 0.456 0.452 5.602 1.931 0.788 0.683 4.315 1.635 0.534 0.473
336 25.93% 0.452 0.461 0.357 0.400 0.448 0.465 0.496 0.487 0.482 0.486 4.721 1.835 0.907 0.747 1.124 1.604 0.591 0.508
720 14.25% 0.698 0.595 0.394 0.436 0.605 0.551 0.463 0.474 0.515 0.511 3.647 1.625 0.963 0.783 3.188 1.540 0.588 0.517
96 21.10% 0.308 0.352 0.306 0.348 0.299 0.343 0.379 0.419 0.505 0.475 0.672 0.571 0.543 0.510 0.600 0.546 1.214 0.665
ETTm1

192 21.36% 0.340 0.369 0.349 0.375 0.335 0.365 0.426 0.441 0.553 0.496 0.795 0.669 0.557 0.537 0.837 0.700 1.261 0.690
336 17.07% 0.376 0.393 0.375 0.388 0.369 0.386 0.445 0.459 0.621 0.537 1.212 0.871 0.754 0.655 1.124 0.832 1.283 0.707
720 21.73% 0.440 0.435 0.433 0.422 0.425 0.421 0.543 0.490 0.671 0.561 1.166 0.823 0.908 0.724 1.153 0.820 1.319 0.729
96 17.73% 0.168 0.262 0.167 0.255 0.167 0.260 0.203 0.287 0.255 0.339 0.365 0.453 0.435 0.507 0.768 0.642 0.266 0.328
ETTm2

192 17.84% 0.232 0.308 0.221 0.293 0.224 0.303 0.269 0.328 0.281 0.340 0.533 0.563 0.730 0.673 0.989 0.757 0.340 0.371
336 15.69% 0.320 0.373 0.274 0.327 0.281 0.342 0.325 0.366 0.339 0.372 1.363 0.887 1.201 0.845 1.334 0.872 0.412 0.410
720 12.58% 0.413 0.435 0.368 0.384 0.397 0.421 0.421 0.415 0.433 0.432 3.379 1.338 3.625 1.451 3.048 1.328 0.521 0.465
- Methods* are implemented by us; Other results are from FEDformer [31].

Table 2. Multivariate long-term forecasting errors in terms of MSE and MAE, the lower the better. Among them, ILI dataset is with
forecasting horizon T ∈ {24, 36, 48, 60}. For the others, T ∈ {96, 192, 336, 720}. Repeat repeats the last value in the look-back window.
The best results are highlighted in bold and the best results of Transformers are highlighted with a underline. Accordingly, IMP. is the
best result of linear models compared to the results of Transformer-based solutions.

the prediction results on three selected time series datasets it determines how much we can learn from historical data.
with Transformer-based solutions and LTSF-Linear: Elec- Generally speaking, a powerful TSF model with a strong
tricity (Sequence 1951, Variate 36), Exchange-Rate (Se- temporal relation extraction capability should be able to
quence 676, Variate 3), and ETTh2 ( Sequence 1241, Vari- achieve better results with larger look-back window sizes.
ate 2), where these datasets have different temporal patterns. To study the impact of input look-back win-
When the input length is 96 steps, and the output horizon dow sizes, we conduct experiments with L ∈
is 336 steps, Transformers [28, 30, 31] fail to capture the {24, 48, 72, 96, 120, 144, 168, 192, 336, 504, 672, 720}
scale and bias of the future data on Electricity and ETTh2. for long-term forecasting (T=720). Figure 4 demonstrates
Moreover, they can hardly predict a proper trend on aperi- the MSE results on two datasets. Similar to the observations
odic data such as Exchange-Rate. These phenomena further from previous studies [27, 30], existing Transformer-based
indicate the inadequacy of existing Transformer-based solu- models’ performance deteriorates or stays stable when
tions for the LTSF task. the look-back window size increases. In contrast, the
performances of all LTSF-Linear are significantly boosted
5.3. More Analyses on LTSF-Transformers
with the increase of look-back window size. Thus, existing
Can existing LTSF-Transformers extract temporal re- solutions tend to overfit temporal noises instead of extract-
lations well from longer input sequences? The size of the ing temporal information if given a longer sequence, and
look-back window greatly impacts forecasting accuracy as the input size 96 is exactly suitable for most Transformers.

5
GrouthTruth Autoformer Informer FEDformer DLinear GrouthTruth Autoformer Informer FEDformer DLinear GrouthTruth Autoformer Informer FEDformer DLinear
1.0
4
1.0
0.5
3
0.5
0.0
2
0.0
0.5
1
0.5
0 1.0
1.0
1 1.5
1.5

0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) Electricity (b) Exchange-Rate (c) ETTh2
Figure 3. Illustration of the long-term forecasting output (Y-axis) of five models with an input length L=96 and output length T =192
(X-axis) on Electricity, Exchange-Rate, and ETTh2, respectively.

Additionally, we provide more quantitative results in the 96 time steps. From the experimental results, the perfor-
Appendix, and our conclusion holds in almost all cases. mance of the SOTA Transformers drops slightly, indicat-
ing these models only capture similar temporal information
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
from the adjacent time series sequence. Since capturing the
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
intrinsic characteristics of the dataset generally does not re-
1.4
0.40 quire a large number of parameters, i,e. one parameter can
1.2 represent the periodicity. Using too many parameters will
0.35
1.0 even cause overfitting, which partially explains why LTSF-
0.30
0.8 Linear performs better than Transformer-based methods.
0.6
0.25 Are the self-attention scheme effective for LTSF? We
0.20 verify whether these complex designs in the existing Trans-
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720
former (e.g., Informer) are essential. In Table 4, we gradu-
(a) 720 steps-Traffic (b) 720 steps-Electricity ally transform Informer to Linear. First, we replace each
self-attention layer by a linear layer, called Att.-Linear,
Figure 4. The MSE results (Y-axis) of models with different look-
back window sizes (X-axis) of long-term forecasting (T=720) on
since a self-attention layer can be regarded as a fully-
the Traffic and Electricity datasets. connected layer where weights are dynamically changed.
Furthermore, we discard other auxiliary designs (e.g., FFN)
What can be learned for long-term forecasting? While in Informer to leave embedding layers and linear layers,
the temporal dynamics in the look-back window signifi- named Embed + Linear. Finally, we simplify the model to
cantly impact the forecasting accuracy of short-term time one linear layer. Surprisingly, the performance of Informer
series forecasting, we hypothesize that long-term forecast- grows with the gradual simplification, indicating the unnec-
ing depends on whether models can capture the trend and essary of the self-attention scheme and other complex mod-
periodicity well only. That is, the farther the forecasting ules at least for existing LTSF benchmarks.
horizon, the less impact the look-back window itself has.
Methods Informer Att.-Linear Embed + Linear Linear
96 0.847 1.003 0.173 0.084
Exchange

Methods FEDformer Autoformer


192 1.204 0.979 0.443 0.155
Input Close Far Close Far
336 1.672 1.498 1.288 0.301
Electricity 0.251 0.265 0.255 0.287
720 2.478 2.102 2.026 0.763
Traffic 0.631 0.645 0.677 0.675 96 0.865 0.613 0.454 0.400
ETTh1

Table 3. Comparison of different input sequences under the MSE 192 1.008 0.759 0.686 0.438
metric to explore what LTSF-Transformers depend on. If the in- 336 1.107 0.921 0.821 0.479
put is Close, we use the 96th , ..., 191th time steps as the input 720 1.181 0.902 1.051 0.515
sequence. If the input is Far, we use the 0th , ..., 95th time steps. Table 4. The MSE comparisons of gradually transforming In-
Both of them forecast the 192th , ..., (192 + 720)th time steps. former to a Linear from the left to right columns. Att.-Linear is
To validate the above hypothesis, in Table 3, we compare a structure that replaces each attention layer with a linear layer.
Embed + Linear is to drop other designs and only keeps embed-
the forecasting accuracy for the same future 720 time steps
ding layers and a linear layer. The look-back window size is 96.
with data from two different look-back windows: (i). the
original input L=96 setting (called Close) and (ii). the far Can existing LTSF-Transformers preserve temporal
input L=96 setting (called Far) that is before the original order well? Self-attention is inherently permutation-

6
Methods Linear FEDformer Autoformer Informer
Predict Length Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex.
96 0.080 0.133 0.169 0.161 0.160 0.162 0.152 0.158 0.160 0.952 1.004 0.959
Exchange

192 0.162 0.208 0.243 0.274 0.275 0.275 0.278 0.271 0.277 1.012 1.023 1.014
336 0.286 0.320 0.345 0.439 0.439 0.439 0.435 0.430 0.435 1.177 1.181 1.177
720 0.806 0.819 0.836 1.122 1.122 1.122 1.113 1.113 1.113 1.198 1.210 1.196
Average Drop N/A 27.26% 46.81% N/A -0.09% 0.20% N/A 0.09% 1.12% N/A -0.12% -0.18%
96 0.395 0.824 0.431 0.376 0.753 0.405 0.455 0.838 0.458 0.974 0.971 0.971
ETTh1

192 0.447 0.824 0.471 0.419 0.730 0.436 0.486 0.774 0.491 1.233 1.232 1.231
336 0.490 0.825 0.505 0.447 0.736 0.453 0.496 0.752 0.497 1.693 1.693 1.691
720 0.520 0.846 0.528 0.468 0.720 0.470 0.525 0.696 0.524 2.720 2.716 2.715
Average Drop N/A 81.06% 4.78% N/A 73.28% 3.44% N/A 56.91% 0.46% N/A 1.98% 0.18%
Table 5. The MSE comparisons of models when shuffling the raw input sequence. Shuf. randomly shuffles the input sequence. Half-EX.
randomly exchanges the first half of the input sequences with the second half. Average Drop is the average performance drop under all
forecasting lengths after shuffling. All results are the average test MSE of five runs.

invariant, i.e., regardless of the order. However, in time- Traffic


Methods Embedding
96 192 336 720
series forecasting, the sequence order often plays a crucial
role. We argue that even with positional and temporal em- All 0.597 0.606 0.627 0.649
wo/Pos. 0.587 0.604 0.621 0.626
beddings, existing Transformer-based methods still suffer FEDformer
wo/Temp. 0.613 0.623 0.650 0.677
from temporal information loss. In Table 5, we shuffle the wo/Pos.-Temp. 0.613 0.622 0.648 0.663
raw input before the embedding strategies. Two shuffling All 0.629 0.647 0.676 0.638
strategies are presented: Shuf. randomly shuffles the whole wo/Pos. 0.613 0.616 0.622 0.660
Autoformer
input sequences and Half-Ex. exchanges the first half of wo/Temp. 0.681 0.665 0.908 0.769
the input sequence with the second half. Interestingly, com- wo/Pos.-Temp. 0.672 0.811 1.133 1.300
pared with the original setting (Ori.) on the Exchange Rate, All 0.719 0.696 0.777 0.864
the performance of all Transformer-based methods does not wo/Pos. 1.035 1.186 1.307 1.472
Informer
wo/Temp. 0.754 0.780 0.903 1.259
fluctuate even when the input sequence is randomly shuf-
wo/Pos.-Temp. 1.038 1.351 1.491 1.512
fled. By contrary, the performance of LTSF-Linear is dam-
Table 6. The MSE comparisons of different embedding strate-
aged significantly. These indicate that LTSF-Transformers gies on Transformer-based methods with look-back window size
with different positional and temporal embeddings preserve 96 and forecasting lengths {96, 192, 336, 720}.
quite limited temporal relations and are prone to overfit on
noisy financial data, while the LTSF-Linear can model the Rather than using a single time step in each token, FED-
order naturally and avoid overfitting with fewer parameters. former and Autoformer input a sequence of timestamps to
For the ETTh1 dataset, FEDformer and Autoformer in- embed the temporal information. Hence, they can achieve
troduce time series inductive bias into their models, mak- comparable or even better performance without fixed po-
ing them can extract certain temporal information when the sitional embeddings. However, without timestamp embed-
dataset has more clear temporal patterns (e.g., periodicity) dings, the performance of Autoformer declines rapidly be-
than the Exchange Rate. Therefore, the average drops of cause of the loss of global temporal information. Instead,
the two Transformers are 73.28% and 56.91% under the thanks to the frequency-enhanced module proposed in FED-
Shuf. setting, where it loses the whole order information. former to introduce temporal inductive bias, it suffers less
Moreover, Informer still suffers less from both Shuf. and from removing any position/timestamp embeddings.
Half-Ex. settings due to its no such temporal inductive bias. Is training data size a limiting factor for existing LTSF-
Overall, the average drops of LTSF-Linear are larger than Transformers? Some may argue that the poor performance
Transformer-based methods for all cases, indicating the ex- of Transformer-based solutions is due to the small sizes of
isting Transformers do not preserve temporal order well. the benchmark datasets. Unlike computer vision or nat-
How effective are different embedding strategies? We ural language processing tasks, TSF is performed on col-
study the benefits of position and timestamp embeddings lected time series, and it is difficult to scale up the training
used in Transformer-based methods. In Table 6, the fore- data size. In fact, the size of the training data would in-
casting errors of Informer largely increase without posi- deed have a significant impact on the model performance.
tional embeddings (wo/Pos.). Without timestamp embed- Accordingly, we conduct experiments on Traffic, compar-
dings (wo/Temp.) will gradually damage the performance ing the performance of the model trained on a full dataset
of Informer as the forecasting lengths increase. Since In- (17,544*0.7 hours), named Ori., with that trained on a
former uses a single time step for each token, it is necessary shortened dataset (8,760 hours, i.e., 1 year), called Short.
to introduce temporal information in tokens. Unexpectedly, Table 7 presents that the prediction errors

7
with reduced training data are lower in most cases. This contributions do not come from proposing a linear model
might because the whole-year data maintains more clear but rather from throwing out an important question, show-
temporal features than a longer but incomplete data size. ing surprising comparisons, and demonstrating why LTSF-
While we cannot conclude that we should use less data for Transformers are not as effective as claimed in these works
training, it demonstrates that the training data scale is not through various perspectives. We sincerely hope our com-
the limiting reason for the performances of Autoformer and prehensive studies can benefit future work in this area.
FEDformer.

Methods FEDformer Autoformer


Dataset Ori. Short Ori. Short
96 0.587 0.568 0.613 0.594
192 0.604 0.584 0.616 0.621
336 0.621 0.601 0.622 0.621
720 0.626 0.608 0.660 0.650
Table 7. The MSE comparison of two training data sizes.

Is efficiency really a top-level priority?


 Existing LTSF-
Transformers claim that the O L2 complexity of the
vanilla Transformer is unaffordable for the LTSF problem.
Although they prove to be able to improve the theoretical
time and memory complexity from O L2 to O (L), it is
unclear whether 1) the actual inference time and memory
cost on devices are improved, and 2) the memory issue is
unacceptable and urgent for today’s GPU (e.g., an NVIDIA
Titan XP here). In Table 8, we compare the average prac-
tical efficiencies with 5 runs. Interestingly, compared with
the vanilla Transformer (with the same DMS decoder), most
Transformer variants incur similar or even worse inference
time and parameters in practice. These follow-ups introduce
more additional design elements to make practical costs
high. Moreover, the memory cost of the vanilla Transformer
is practically acceptable, even for output length L = 720,
which weakens the importance of developing a memory-
efficient Transformers, at least for existing benchmarks.

Method MACs Parameter Time Memory


DLinear 0.04G 139.7K 0.4ms 687MiB
Transformer× 4.03G 13.61M 26.8ms 6091MiB
Informer 3.93G 14.39M 49.3ms 3869MiB
Autoformer 4.41G 14.91M 164.1ms 7607MiB
Pyraformer 0.80G 241.4M∗ 3.4ms 7017MiB
FEDformer 4.41G 20.68M 40.5ms 4143MiB
- × is modified into the same one-step decoder, which is implemented in the source code from Autoformer.
- ∗ 236.7M parameters of Pyraformer come from its linear decoder.

Table 8. Comparison of practical efficiency of LTSF-Transformers


under L=96 and T=720 on the Electricity. MACs are the number of
multiply-accumulate operations. We use Dlinear for comparison
since it has the double cost in LTSF-Linear. The inference time
averages 5 runs. Future work. LTSF-Linear has a limited model ca-
6. Conclusion and Future Work pacity, and it merely serves a simple yet competitive base-
line with strong interpretability for future research. For ex-
Conclusion. This work questions the effectiveness of ample, the one-layer linear network is hard to capture the
emerging favored Transformer-based solutions for the long- temporal dynamics caused by change points [25]. Conse-
term time series forecasting problem. We use an em- quently, we believe there is a great potential for new model
barrassingly simple linear model LTSF-Linear as a DMS designs, data processing, and benchmarks to tackle the chal-
forecasting baseline to verify our claims. Note that our lenging LTSF problem.

8
Appendix:
Are Transformers Effective for Time Series Forecasting?
In this Appendix, we provide descriptions of non- • ETT (Electricity Transformer Temperature) [30]2 con-
Transformer-based TSF solutions, detailed experimental sists of two hourly-level datasets (ETTh) and two 15-
settings, more comparisons under different look-back win- minute-level datasets (ETTm). Each of them contains
dow sizes, and the visualization of LTSF-Linear on all seven oil and load features of electricity transformers
datasets. We also append our code to reproduce the results from July 2016 to July 2018.
shown in the paper.
• Traffic3 describes the road occupancy rates. It con-
A. Related Work: Non-Transformer-Based tains the hourly data recorded by the sensors of San
Francisco freeways from 2015 to 2016.
TSF Solutions
As a long-standing problem with a wide range of ap- • Electricity4 collects the hourly electricity consumption
plications, statistical approaches (e.g., autoregressive inte- of 321 clients from 2012 to 2014.
grated moving average (ARIMA) [1], exponential smooth- • Exchange-Rate [15]5 collects the daily exchange rates
ing [12], and structural models [14]) for time series fore- of 8 countries from 1990 to 2016.
casting have been used from the 1970s onward. Generally
speaking, the parametric models used in statistical methods • Weather6 includes 21 indicators of weather, such as air
require significant domain expertise to build. temperature, and humidity. Its data is recorded every
To relieve this burden, many machine learning 10 min for 2020 in Germany.
techniques such as gradient boosting regression tree
(GBRT) [10, 11] gain popularity, which learns the tempo- • ILI7 describes the ratio of patients seen with influenza-
ral dynamics of time series in a data-driven manner. How- like illness and the number of patients. It includes
ever, these methods still require manual feature engineer- weekly data from the Centers for Disease Control and
ing and model designs. With the powerful representation Prevention of the United States from 2002 to 2021.
learning capability of deep neural networks (DNNs) from
B.2. Implementation Details
abundant data, various deep learning-based TSF solutions
are proposed in the literature, achieving better forecasting For existing Transformer-based TSF solutions: the im-
accuracy than traditional techniques in many cases. plementation of Autoformer [28], Informer [30], and the
Besides Transformers, the other two popular DNN archi- vanilla Transformer [26] are all taken from the Autoformer
tectures are also applied for time series forecasting: work [28]; the implementation of FEDformer [31] and
Pyraformer [18] are from their respective code repository.
• Recurrent neural networks (RNNs) based methods We also adopt their default hyper-parameters to train the
(e.g., [21]) summarize the past information compactly models. For DLinear, the moving average kernel size for
in internal memory states and recursively update them- decomposition is 25, which is the same as Autoformer. The
selves for forecasting. total parameters of a vanilla linear model and a NLinear
are TL. The total parameters of the DLinear are 2TL. Since
• Convolutional neural networks (CNNs) based meth- LTSF-Linear will be underfitting when the input length is
ods (e.g., [3]), wherein convolutional filters are used short, and LTSF-Transformers tend to overfit on a long
to capture local temporal features. lookback window size. To compare the best performance
of existing LTSF-Transformers with LTSF-Linear, we re-
RNN-based TSF methods belong to IMS forecasting port L=336 for LTSF-Linear and L=96 for Transformers by
techniques. Depending on whether the decoder is imple- default. For more hyper-parameters of LTSF-Linear, please
mented in an autoregressive manner, there are either IMS refer to our code.
or DMS forecasting techniques for CNN-based TSF meth-
2 https://github.com/zhouhaoyi/ETDataset
ods [3, 17]. 3 http://pems.dot.ca.gov
4 https : / / archive . ics . uci . edu / ml / datasets /
B. Experimental Details ElectricityLoadDiagrams20112014
5 https : / / github . com / laiguokun / multivariate -
B.1. Data Descriptions time-series-data
6 https://www.bgc-jena.mpg.de/wetter/
We use nine wildly-used datasets in the main paper. The 7 https : / / gis . cdc . gov / grasp / fluview /

details are listed in the following. fluportaldashboard.html

9
C. Additional Comparison with Transformers tribute it to the low information-to-noise ratio in such finan-
cial data.
We further compare LTSF-Linear with LTSF-
Transformer for Univariate Forecasting on four ETT D. Ablation study on the LTSF-Linear
datasets. Moreover, in Figure 4 of the main paper, we
demonstrate that existing Transformers fail to exploit large D.1. Motivation of NLinear
look-back window sizes with two examples. Here, we give If we normalize the test data by the mean and variance of
comprehensive comparisons between LTSF-Linear and train data, there could be a distribution shift in testing data,
Transformer-based TSF solutions under various look-back i.e, the mean value of testing data is not 0. If the model
window sizes on all benchmarks. made a prediction that is out of the distribution of true value,
a large error would occur. For example, there is a large er-
C.1. Comparison of Univariate Forecasting
ror between the true value and the true value minus/add one.
We present the univariate forecasting results on the four Therefore, in NLinear, we use the subtraction and addition
ETT datasets in table 9. Similarly, LTSF-Linear, especially to shift the model prediction toward the distribution of true
for NLinear can consistently outperform all transformer- value. Then, large errors are avoided, and the model perfor-
based methods by a large margin in most time. We find mances can be improved. Figure 5 illustrates histograms of
that there are serious distribution shifts between training the trainset-test set distributions, where each bar represents
and test sets (as shown in Fig. 5 (a), (b)) on ETTh1 and the number of data points. Clear distribution shifts between
ETTh2 datasets. Simply normalization via the last value training and testing data can be observed in ETTh1, ETTh2,
from the lookback window can greatly relieve the distribu- and ILI. Accordingly, from Table 9 and Table 2 in the main
tion shift problem. paper, we can observe that there are great improvements
in the three datasets comparing the NLinear to the Linear,
C.2. Comparison under Different Look-back Win- showing the effectiveness of the NLinear in relieving dis-
dows tribution shifts. Moreover, for the datasets without obvi-
ous distribution shifts, like Electricity in Figure 5(c), using
In Figure 6, we provide the MSE comparisons of five
the vanilla Linear can be enough, demonstrating the similar
LTSF-Transformers with LTSF-Linear under different look-
performance with NLinear and DLinear.
back window sizes to explore whether existing Transform-
ers can extract temporal well from longer input sequences. D.2. The Features of LTSF-Linear
For hourly granularity datasets (ETTh1, ETTh2, Traffic,
Although LTSF-Linear is simple, it has some compelling
and Electricity), the increasing look-back window sizes are
characteristics:
{24, 48, 72, 96, 120, 144, 168, 192, 336, 504, 672, 720},
which represent {1, 2, 3, 4, 5, 6, 7, 8, 14, 21, 28, 30} • An O(1) maximum signal traversing path length:
days. The forecasting steps are {24, 720}, which mean {1, The shorter the path, the better the dependencies are
30} days. For 5-minute granularity datasets (ETTm1 and captured [18], making LTSF-Linear capable of cap-
ETTm2), we set the look-back window size as {24, 36, 48, turing both short-range and long-range temporal rela-
60, 72, 144, 288}, which represent {2, 3, 4, 5, 6, 12, 24} tions.
hours. For 10-minute granularity datasets (Weather), we set
the look-back window size as {24, 48, 72, 96, 120, 144, • High-efficiency: As LTSF-Linear is a linear model
168, 192, 336, 504, 672, 720}, which mean {4, 8, 12, 16, with two linear layers at most, it costs much lower
20, 24, 28, 32, 56, 84, 112, 120} hours. The forecasting memory and fewer parameters and has a faster infer-
steps are {24, 720} that are {4, 120} hours. For weekly ence speed than existing Transformers (see Table 8 in
granularity dataset (ILI), we set the look-back window size main paper).
as {26, 52, 78, 104, 130, 156, 208}, which represent {0.5, 1, • Interpretability: After training, we can visualize
1.5, 2, 2.5, 3, 3.5, 4} years. The corresponding forecasting weights from the seasonality and trend branches to
steps are {26, 208}, meaning {0.5, 4} years. have some insights on the predicted values [9].
As shown in Figure 6, with increased look-back win-
dow sizes, the performance of LTSF-Linear is significantly • Easy-to-use: LTSF-Linear can be obtained easily
boosted for most datasets (e.g., ETTm1 and Traffic), while without tuning model hyper-parameters.
this is not the case for Transformer-based TSF solutions.
D.3. Interpretability of LTSF-Linear
Most of their performance fluctuates or gets worse as
the input lengths increase. To be specific, the results of Because LTSF-Linear is a set of linear models, the
Exchange-Rate do not show improved results with a long weights of linear layers can directly reveal how LTSF-
look-back window (from Figure 6(m) and (n)), and we at- Linear works. The weight visualization of LTSF-Linear can

10
Methods Linear NLinear DLinear FEDformer-f FEDformer-w Autoformer Informer LogTrans
Metric MSE MAE MSE MSE MAE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.189 0.359 0.053 0.177 0.056 0.180 0.079 0.215 0.080 0.214 0.071 0.206 0.193 0.377 0.283 0.468
ET T h1

192 0.078 0.212 0.069 0.204 0.071 0.204 0.104 0.245 0.105 0.256 0.114 0.262 0.217 0.395 0.234 0.409
336 0.091 0.237 0.081 0.226 0.098 0.244 0.119 0.270 0.120 0.269 0.107 0.258 0.202 0.381 0.386 0.546
720 0.172 0.340 0.080 0.226 0.189 0.359 0.142 0.299 0.127 0.280 0.126 0.283 0.183 0.355 0.475 0.629
96 0.133 0.283 0.129 0.278 0.131 0.279 0.128 0.271 0.156 0.306 0.153 0.306 0.213 0.373 0.217 0.379
ET T h2

192 0.176 0.330 0.169 0.324 0.176 0.329 0.185 0.330 0.238 0.380 0.204 0.351 0.227 0.387 0.281 0.429
336 0.213 0.371 0.194 0.355 0.209 0.367 0.231 0.378 0.271 0.412 0.246 0.389 0.242 0.401 0.293 0.437
720 0.292 0.440 0.225 0.381 0.276 0.426 0.278 0.420 0.288 0.438 0.268 0.409 0.291 0.439 0.218 0.387
96 0.028 0.125 0.026 0.122 0.028 0.123 0.033 0.140 0.036 0.149 0.056 0.183 0.109 0.277 0.049 0.171
ET T m1

192 0.043 0.154 0.039 0.149 0.045 0.156 0.058 0.186 0.069 0.206 0.081 0.216 0.151 0.310 0.157 0.317
336 0.059 0.180 0.052 0.172 0.061 0.182 0.084 0.231 0.071 0.209 0.076 0.218 0.427 0.591 0.289 0.459
720 0.080 0.211 0.073 0.207 0.080 0.210 0.102 0.250 0.105 0.248 0.110 0.267 0.438 0.586 0.430 0.579
96 0.066 0.189 0.063 0.182 0.063 0.183 0.067 0.198 0.063 0.189 0.065 0.189 0.088 0.225 0.075 0.208
ET T m2

192 0.094 0.230 0.090 0.223 0.092 0.227 0.102 0.245 0.110 0.252 0.118 0.256 0.132 0.283 0.129 0.275
336 0.120 0.263 0.117 0.259 0.119 0.261 0.130 0.279 0.147 0.301 0.154 0.305 0.180 0.336 0.154 0.302
720 0.175 0.320 0.170 0.318 0.175 0.320 0.178 0.325 0.219 0.368 0.182 0.335 0.300 0.435 0.160 0.321

Table 9. Univariate long sequence time-series forecasting results on ETT full benchmark. The best results are highlighted in bold and the
best results of Transformers are highlighted with a underline.

(a) ETTh1 channel6 (b) ETTh2 channel3 (c) Electricity channel3 (d) ILI channel6

Figure 5. Distribution of ETTh1, ETTh2, Electricity, and ILI dataset. A clear distribution shift between training and testing data can be
observed in ETTh1, ETTh2, and ILI.

also reveal certain characteristics in the data used for fore- ing steps. Among these forecasting time steps, the 0, 167,
casting. 335, 503, 671 time steps have higher weights. Note that 24
Here we take DLinear as an example. Accordingly, we time steps are a day, and 168 time steps are a week. This
visualize the trend and remainder weights of all datasets indicates that Traffic has a daily periodicity and a weekly
with a fixed input length of 96 and four different forecasting periodicity.
horizons. To obtain a smooth weight with a clear pattern in
visualization, we initialize the weights of the linear layers References
in DLinear as 1/L rather than random initialization. That
is, we use the same weight for every forecasting time step [1] Adebiyi A Ariyo, Adewumi O Adewumi, and
in the look-back window at the start of training. Charles K Ayo. Stock price prediction using the arima
How the model works: Figure 7(c) visualize the model. In 2014 UKSim-AMSS 16th International Con-
weights of the trend and the remaining layers on the ference on Computer Modelling and Simulation, pages
Exchange-Rate dataset. Due to the lack of periodicity and 106–112. IEEE, 2014. 1, 2, 9
seasonality in financial data, it is hard to observe clear pat-
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
terns, but the trend layer reveals greater weights of informa-
Bengio. Neural machine translation by jointly learn-
tion closer to the outputs, representing their larger contribu-
ing to align and translate. arXiv: Computation and
tions to the predicted values.
Language, 2014. 2
Periodicity of data: For Traffic data, as shown in Fig-
ure 7(d), the model gives high weights to the latest time [3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun.
step of the look-back window for the 0,23,47...719 forecast- An empirical evaluation of generic convolutional and

11
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.0 5
1.2
1.4
0.9
4
1.0
0.8 1.2

0.7 0.8 3
1.0
0.6 0.6
0.8 2
0.5
0.4
0.4 0.6 1
0.2
0.3
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720

(a) 24 steps-ETTh1 (b) 720 steps-ETTh1 (c) 24 steps-ETTh2 (d) 720 steps-ETTh2

Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.2
0.6
1.1 0.30 4

1.0
0.5 0.25
0.9 3

0.8
0.4 0.20
2
0.7

0.3 0.6 0.15


1
0.5
0.4 0.10
0.2
24 36 48 60 72 144 288 24 36 48 60 72 144 288 24 36 48 60 72 144 288 24 36 48 60 72 144 288

(e) 24 steps-ETTm1 (f) 576 steps-ETTm1 (g) 24 steps-ETTm2 (h) 576 steps-ETTm2

Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
0.50 0.9
1.6
0.45 1.4
1.4 0.8
0.40
1.2
1.2 0.7
0.35
0.30 1.0 1.0
0.6
0.25 0.8 0.8
0.20 0.5
0.6
0.15 0.6
0.4 0.4
0.10
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720

(i) 24 steps-Weather (j) 720 steps-Weather (k) 24 steps-Traffic (l) 720 steps-Traffic

Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.4
7 8
1.2 3.0
7
6
1.0 2.5 6
0.8 5
2.0 5
0.6 4
1.5 4
0.4
3
1.0 3
0.2
2 2
0.0 0.5
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 26 52 78 104 130 156 208 26 52 78 104 130 156 208

(m) 24 steps-Exchange (n) 720 steps-Exchange (o) 24 steps-ILI (p) 60 steps-ILI

Figure 6. The MSE results (Y-axis) of models with different look-back window sizes (X-axis) of the long-term forecasting (e.g., 720-time
steps) and the short-term forecasting (e.g., 24 time steps) on different benchmarks.

recurrent networks for sequence modeling. arXiv Triangular, variable-specific attentions for long se-
preprint arXiv:1803.01271, 2018. 1, 9 quence multivariate time series forecasting–full ver-
sion. arXiv preprint arXiv:2204.13767, 2022. 1
[4] Guillaume Chevillon. Direct multi-step estimation
and forecasting. Journal of Economic Surveys, [6] R. B. Cleveland. Stl : A seasonal-trend decomposition
21(4):746–785, 2007. 2 procedure based on loess. Journal of Office Statistics,
1990. 3
[5] Razvan-Gabriel Cirstea, Chenjuan Guo, Bin Yang,
Tung Kieu, Xuanyi Dong, and Shirui Pan. Triformer: [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

12
ETTh1

(a1) Remainder (a2) Trend (a3) Remainder (a4) Trend (a5) Remainder (a6) Trend (a7) Remainder (a8) Trend
In-96, Out-96 In-96, Out-168 In-96, Out-336 In-96, Out-720
Electricity

(b1) Remainder (b2) Trend (b3) Remainder (b4) Trend (b5) Remainder (b6) Trend (b7) Remainder (b8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Exchange-Rate

(c1) Remainder (c2) Trend (c3) Remainder (c4) Trend (c5) Remainder (c6) Trend (c7) Remainder (c8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Traffic

(d1) Remainder (d2) Trend (d3) Remainder (d4) Trend (d5) Remainder (d6) Trend (d7) Remainder (d8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Weather

(e1) Remainder (e2) Trend (e3) Remainder (e4) Trend (e5) Remainder (e6) Trend (e7) Remainder (e8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
ILI

(f1) Remainder (f2) Trend (f3) Remainder (f4) Trend (f5) Remainder (f6) Trend (f7) Remainder (f8) Trend
In-36, Out-24 In-36, Out-36 In-36, Out-48 In-36, Out-60

Figure 7. Visualization of the weights(T*L) of LTSF-Linear on several benchmarks. Models are trained with a look-back window L
(X-axis) and different forecasting time steps T (Y-axis). We show weights in the remainder and trend layer.

Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv

13
preprint arXiv:1810.04805, 2018. 1 shifted windows. In Proceedings of the IEEE/CVF
[8] Linhao Dong, Shuang Xu, and Bo Xu. Speech- International Conference on Computer Vision, pages
transformer: a no-recurrence sequence-to-sequence 10012–10022, 2021. 1
model for speech recognition. In 2018 IEEE Inter- [20] LIU Minhao, Ailing Zeng, LAI Qiuxia, Ruiyuan Gao,
national Conference on Acoustics, Speech and Signal Min Li, Jing Qin, and Qiang Xu. T-wavenet: A tree-
Processing (ICASSP), pages 5884–5888. IEEE, 2018. structured wavelet neural network for time series sig-
1 nal analysis. In International Conference on Learning
[9] Ruijun Dong and Witold Pedrycz. A granular time se- Representations, 2021. 2
ries approach to long-term forecasting and trend fore- [21] Gábor Petneházi. Recurrent neural networks for time
casting. Physica A: Statistical Mechanics and its Ap- series forecasting. arXiv preprint arXiv:1901.00069,
plications, 387(13):3253–3270, 2008. 10 2019. 9
[10] Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, [22] David Salinas, Valentin Flunkert, and Jan Gasthaus.
Hadi Samer Jomaa, and Lars Schmidt-Thieme. Do Deepar: Probabilistic forecasting with autoregressive
we really need deep learning models for time series recurrent networks. International Journal of Forecast-
forecasting? arXiv preprint arXiv:2101.02118, 2021. ing, 2017. 2
9 [23] Souhaib Ben Taieb, Rob J Hyndman, et al. Recur-
[11] Jerome H Friedman. Greedy function approximation: sive and direct multi-step forecasting: the best of both
a gradient boosting machine. Annals of statistics, worlds, volume 19. Citeseer, 2012. 2
pages 1189–1232, 2001. 1, 9 [24] Sean J. Taylor and Benjamin Letham. Forecasting at
[12] Everette S Gardner Jr. Exponential smoothing: The scale. PeerJ Prepr., 2017. 2
state of the art. Journal of forecasting, 4(1):1–28, [25] Gerrit JJ van den Burg and Christopher KI Williams.
1985. 9 An evaluation of change point detection algorithms.
arXiv preprint arXiv:2003.06222, 2020. 8
[13] James Douglas Hamilton. Time series analysis.
Princeton university press, 2020. 3 [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
[14] Andrew C Harvey. Forecasting, structural time series Kaiser, and Illia Polosukhin. Attention is all you need.
models and the kalman filter. 1990. 9
Advances in neural information processing systems,
[15] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and 30, 2017. 1, 2, 9
Hanxiao Liu. Modeling long- and short-term tempo- [27] Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi
ral patterns with deep neural networks. international Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Trans-
acm sigir conference on research and development in formers in time series: A survey. arXiv preprint
information retrieval, 2017. 1, 4, 9 arXiv:2202.07125, 2022. 1, 2, 5
[16] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, [28] Jiehui Xu, Jianmin Wang, Mingsheng Long, et al.
Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. En- Autoformer: Decomposition transformers with auto-
hancing the locality and breaking the memory bottle- correlation for long-term series forecasting. Advances
neck of transformer on time series forecasting. Ad- in Neural Information Processing Systems, 34, 2021.
vances in Neural Information Processing Systems, 32, 1, 2, 3, 4, 5, 9
2019. 1, 2, 3, 4
[29] Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao,
[17] Minhao Liu, Ailing Zeng, Zhijian Xu, Qiuxia Lai, and Xizhou Zhu, Bo Dai, and Qiang Xu. Deciwatch: A
Qiang Xu. Time series is a special sequence: Fore- simple baseline for 10x efficient 2d and 3d pose esti-
casting with sample convolution and interaction. arXiv mation. arXiv preprint arXiv:2203.08713, 2022. 1
preprint arXiv:2106.09305, 2021. 1, 9 [30] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai
[18] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Informer: Beyond efficient transformer for long se-
Low-complexity pyramidal attention for long-range quence time-series forecasting. In The Thirty-Fifth
time series modeling and forecasting. In International AAAI Conference on Artificial Intelligence, AAAI
Conference on Learning Representations, 2021. 1, 2, 2021, Virtual Conference, volume 35, pages 11106–
3, 4, 9, 10 11115. AAAI Press, 2021. 1, 2, 3, 4, 5, 9
[19] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [31] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang,
Zheng Zhang, Stephen Lin, and Baining Guo. Swin Liang Sun, and Rong Jin. Fedformer: Frequency en-
transformer: Hierarchical vision transformer using hanced decomposed transformer for long-term series

14
forecasting. In International Conference on Machine
Learning, 2022. 1, 2, 3, 4, 5, 9

15

You might also like