0% found this document useful (0 votes)

59 views

Are Transformers Effective For Time Series Forecasting?

Uploaded by

Mukenze junior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Are Transformers Effective For Time Series Forecasting?

Uploaded by

Mukenze junior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Are Transformers Effective for Time Series Forecasting?

Ailing Zeng1 , Muxi Chen1 , Lei Zhang2 , Qiang Xu1

1
The Chinese University of Hong Kong
2
International Digital Economy Academy (IDEA)
{alzeng, mxchen21, qxu}@cse.cuhk.edu.hk
{leizhang}@idea.edu.cn
arXiv:2205.13504v3 [cs.AI] 17 Aug 2022

Abstract ergy management, and financial investment. Over the past

several decades, TSF solutions have undergone a progres-
Recently, there has been a surge of Transformer-based sion from traditional statistical methods (e.g., ARIMA [1])
solutions for the long-term time series forecasting (LTSF) and machine learning techniques (e.g., GBRT [11]) to
task. Despite the growing performance over the past few deep learning-based solutions, e.g., Recurrent Neural Net-
years, we question the validity of this line of research in this works [15] and Temporal Convolutional Networks [3, 17].
work. Specifically, Transformers is arguably the most suc- Transformer [26] is arguably the most successful se-
cessful solution to extract the semantic correlations among quence modeling architecture, demonstrating unparalleled
the elements in a long sequence. However, in time series performances in various applications, such as natural lan-
modeling, we are to extract the temporal relations in an guage processing (NLP) [7], speech recognition [8], and
ordered set of continuous points. While employing posi- computer vision [19, 29]. Recently, there has also been a
tional encoding and using tokens to embed sub-series in surge of Transformer-based solutions for time series anal-
Transformers facilitate preserving some ordering informa- ysis, as surveyed in [27]. Most notable models, which
tion, the nature of the permutation-invariant self-attention focus on the less explored and challenging long-term time
mechanism inevitably results in temporal information loss. series forecasting (LTSF) problem, include LogTrans [16]
To validate our claim, we introduce a set of embarrass- (NeurIPS 2019), Informer [30] (AAAI 2021 Best paper),
ingly simple one-layer linear models named LTSF-Linear Autoformer [28] (NeurIPS 2021), Pyraformer [18] (ICLR
for comparison. Experimental results on nine real-life 2022 Oral), Triformer [5] (IJCAI 2022) and the recent FED-
datasets show that LTSF-Linear surprisingly outperforms former [31] (ICML 2022).
existing sophisticated Transformer-based LTSF models in The main working power of Transformers is from its
all cases, and often by a large margin. Moreover, we con- multi-head self-attention mechanism, which has a remark-
duct comprehensive empirical studies to explore the im- able capability of extracting semantic correlations among
pacts of various design elements of LTSF models on their elements in a long sequence (e.g., words in texts or 2D
temporal relation extraction capability. We hope this sur- patches in images). However, self-attention is permutation-
prising finding opens up new research directions for the invariant and “anti-order” to some extent. While using var-
LTSF task. We also advocate revisiting the validity of ious types of positional encoding techniques can preserve
Transformer-based solutions for other time series analysis some ordering information, it is still inevitable to have tem-
tasks (e.g., anomaly detection) in the future. Code is avail- poral information loss after applying self-attention on top
able at: https://github.com/cure-lab/LTSF- of them. This is usually not a serious concern for semantic-
Linear. rich applications such as NLP, e.g., the semantic meaning
of a sentence is largely preserved even if we reorder some
words in it. However, when analyzing time series data,
1. Introduction there is usually a lack of semantics in the numerical data
itself, and we are mainly interested in modeling the tempo-
Time series are ubiquitous in today’s data-driven world.
ral changes among a continuous set of points. That is, the
Given historical data, time series forecasting (TSF) is a
order itself plays the most crucial role. Consequently, we
long-standing task that has a wide range of applications,
pose the following intriguing question: Are Transformers
including but not limited to traffic flow estimation, en-
really effective for long-term time series forecasting?
* Equal contribution Moreover, while existing Transformer-based LTSF so-

1
lutions have demonstrated considerable prediction accu- accuracy compared to existing works, it merely serves as a
racy improvements over traditional methods, in their exper- simple baseline for future research on the challenging long-
iments, all the compared (non-Transformer) baselines per- term TSF problem. With our findings, we also advocate
form autoregressive or iterated multi-step (IMS) forecast- revisiting the validity of Transformer-based solutions for
ing [1, 2, 22, 24], which are known to suffer from significant other time series analysis tasks in the future.
error accumulation effects for the LTSF problem. There-
fore, in this work, we challenge Transformer-based LTSF 2. Preliminaries: TSF Problem Formulation
solutions with direct multi-step (DMS) forecasting strate-
gies to validate their real performance. For time series containing C variates, given historical
Not all time series are predictable, let alone long-term data X = {X1t , ..., XCt }L t=1 , wherein L is the look-back
forecasting (e.g., for chaotic systems). We hypothesize that window size and Xit is the value of the ith variate at the tth
long-term forecasting is only feasible for those time series time step. The time series forecasting task is to predict the
L+T
with a relatively clear trend and periodicity. As linear mod- values X̂ = {X̂1t , ..., X̂Ct }t=L+1 at the T future time steps.
els can already extract such information, we introduce a set When T > 1, iterated multi-step (IMS) forecasting [23]
of embarrassingly simple models named LTSF-Linear as a learns a single-step forecaster and iteratively applies it to
new baseline for comparison. LTSF-Linear regresses histor- obtain multi-step predictions. Alternatively, direct multi-
ical time series with a one-layer linear model to forecast fu- step (DMS) forecasting [4] directly optimizes the multi-step
ture time series directly. We conduct extensive experiments forecasting objective at once.
on nine widely-used benchmark datasets that cover various Compared to DMS forecasting results, IMS predictions
real-life applications: traffic, energy, economics, weather, have smaller variance thanks to the autoregressive estima-
and disease predictions. Surprisingly, our results show that tion procedure, but they inevitably suffer from error accu-
LTSF-Linear outperforms existing complex Transformer- mulation effects. Consequently, IMS forecasting is prefer-
based models in all cases, and often by a large margin (20% able when there is a highly-accurate single-step forecaster,
∼ 50%). Moreover, we find that, in contrast to the claims in and T is relatively small. In contrast, DMS forecasting gen-
existing Transformers, most of them fail to extract temporal erates more accurate predictions when it is hard to obtain an
relations from long sequences, i.e., the forecasting errors are unbiased single-step forecasting model, or T is large.
not reduced (sometimes even increased) with the increase of
look-back window sizes. Finally, we conduct various abla- 3. Transformer-Based LTSF Solutions
tion studies on existing Transformer-based TSF solutions to Transformer-based models [26] have achieved unparal-
study the impact of various design elements in them. leled performances in many long-standing AI tasks in natu-
To sum up, the contributions of this work include: ral language processing and computer vision fields, thanks
to the effectiveness of the multi-head self-attention mech-
• To the best of our knowledge, this is the first work to
anism. This has also triggered lots of research interest
challenge the effectiveness of the booming Transform-
in Transformer-based time series modeling techniques [20,
ers for the long-term time series forecasting task.
27]. In particular, a large amount of research works are ded-
• To validate our claims, we introduce a set of em- icated to the LTSF task (e.g., [16, 18, 28, 30, 31]). Con-
barrassingly simple one-layer linear models, named sidering the ability to capture long-range dependencies
LTSF-Linear, and compare them with existing with Transformer models, most of them focus on the less-
Transformer-based LTSF solutions on nine bench- explored long-term forecasting problem (T 1)1 .
marks. LTSF-Linear can be a new baseline for the When applying the vanilla Transformer model to the
LTSF problem. LTSF problem, it has some limitations, including the
quadratic time/memory complexity with the original self-
• We conduct comprehensive empirical studies on var- attention scheme and error accumulation caused by the au-
ious aspects of existing Transformer-based solutions, toregressive decoder design. Informer [30] addresses these
including the capability of modeling long inputs, the issues and proposes a novel Transformer architecture with
sensitivity to time series order, the impact of posi- reduced complexity and a DMS forecasting strategy. Later,
tional encoding and sub-series embedding, and effi- more Transformer variants introduce various time series
ciency comparisons. Our findings would benefit future features into their models for performance or efficiency im-
research in this area. provements [18,28,31]. We summarize the design elements
With the above, we conclude that the temporal model- of existing Transformer-based LTSF solutions as follows
ing capabilities of Transformers for time series are exag- (see Figure 1).
gerated, at least for the existing LTSF benchmarks. At the 1 Due to page limit, we leave the discussion of non-Transformer fore-

same time, while LTSF-Linear achieves a better prediction casting solutions in the Appendix.

2
LogSparse and convolutional Iterated Multi-Step
Channel projection self-attention @LogTrans (IMS) @LogTrans
Normalization
ProbSparse and distilling Direct Multi-Step
self-attention @Informer (DMS) @Informer
Fixed position

Output
Series auto-correlation with DMS with auto-correlation and
Input

Timestamp
preparation decomposition @Autoformer decomposition @Autoformer
Local timestamp
Multi-resolution pyramidal DMS along spatio-temporal
Seasonal-trend attention @Pyraformer dimension @Pyraformer
decomposition Global timestamp Frequency enhanced block with DMS with frequency attention
decomposition @FEDformer and decomposition@FEDformer

(a) Preprocessing (b) Embedding (c) Encoder (d) Decoder

Figure 1. The pipeline of existing Transformer-based TSF solutions. In (a) and (b), the solid boxes are essential operations, and the dotted
boxes are applied optionally. (c) and (d) are distinct for different methods [16, 18, 28, 30, 31].
Time series decomposition: For data preprocessing, nor- attention mechanism and a self-attention distilling operation
malization with zero-mean is common in TSF. Besides, to decrease the complexity to O (LlogL), and FEDformer
Autoformer [28] first applies seasonal-trend decomposition designs a Fourier enhanced block and a wavelet enhanced
behind each neural block, which is a standard method in block with random selection to obtain O (L) complexity.
time series analysis to make raw data more predictable [6, Lastly, Autoformer designs a series-wise auto-correlation
13]. Specifically, they use a moving average kernel on the mechanism to replace the original self-attention layer.
input sequence to extract the trend-cyclical component of Decoders: The vanilla Transformer decoder outputs se-
the time series. The difference between the original sequences in an autoregressive manner, resulting in a slow in-
quence and the trend component is regarded as the seasonal ference speed and error accumulation effects, especially for
component. On top of the decomposition scheme of Aut- long-term predictions. Informer designs a generative-style
oformer, FEDformer [31] further proposes the mixture of decoder for DMS forecasting. Other Transformer variants
experts’ strategies to mix the trend components extracted employ similar DMS strategies. For instance, Pyraformer
by moving average kernels with various kernel sizes. uses a fully-connected layer concatenating Spatio-temporal
Input embedding strategies: The self-attention layer in axes as the decoder. Autoformer sums up two refined de-
the Transformer architecture cannot preserve the positional composed features from trend-cyclical components and the
information of the time series. However, local positional stacked auto-correlation mechanism for seasonal compo-
information, i.e. the ordering of time series, is important. nents to get the final prediction. FEDformer also uses a
Besides, global temporal information, such as hierarchical decomposition scheme with the proposed frequency atten-
timestamps (week, month, year) and agnostic timestamps tion block to decode the final results.
(holidays and events), is also informative [30]. To enhance The premise of Transformer models is the semantic cor-
the temporal context of time-series inputs, a practical design relations between paired elements, while the self-attention
in the SOTA Transformer-based methods is injecting sev- mechanism itself is permutation-invariant, and its capabil-
eral embeddings, like a fixed positional encoding, a channel ity of modeling temporal relations largely depends on posi-
projection embedding, and learnable temporal embeddings tional encodings associated with input tokens. Considering
into the input sequence. Moreover, temporal embeddings the raw numerical data in time series (e.g., stock prices or
with a temporal convolution layer [16] or learnable times- electricity values), there are hardly any point-wise semantic
tamps [28] are introduced. correlations between them. In time series modeling, we are
Self-attention schemes: Transformers rely on the self- mainly interested in the temporal relations among a contin-
attention mechanism to extract the semantic dependen- uous set of points, and the order of these elements instead
cies between paired elements. Motivated by reducing of the paired relationship plays the most crucial role. While
the O L2 time and memory complexity of the vanilla employing positional encoding and using tokens to embed
Transformer, recent works propose two strategies for ef- sub-series facilitate preserving some ordering information,
ficiency. On the one hand, LogTrans and Pyraformer the nature of the permutation-invariant self-attention mech-
explicitly introduce a sparsity bias into the self-attention anism inevitably results in temporal information loss. Due
scheme. Specifically, LogTrans uses a Logsparse mask to to the above observations, we are interested in revisiting the
reduce the computational complexity to O (LlogL) while effectiveness of Transformer-based LTSF solutions.
Pyraformer adopts pyramidal attention that captures hierar-
chically multi-scale temporal dependencies with an O (L) 4. An Embarrassingly Simple Baseline
time and memory complexity. On the other hand, In-
former and FEDformer use the low-rank property in the In the experiments of existing Transformer-based LTSF
self-attention matrix. Informer proposes a ProbSparse self- solutions (T 1), all the compared (non-Transformer)

3
baselines are IMS forecasting techniques, which are known 5. Experiments
to suffer from significant error accumulation effects. We
hypothesize that the performance improvements in these 5.1. Experimental Settings
works are largely due to the DMS strategy used in them. Dataset. We conduct extensive experiments on nine
widely-used real-world datasets, including ETT (Electricity
Future 𝑇 timesteps Transformer Temperature) [30] (ETTh1, ETTh2, ETTm1,
ETTm2), Traffic, Electricity, Weather, ILI, Exchange-
Forecasting Output Rate [15]. All of them are multivariate time series. We
leave data descriptions in the Appendix.
Evaluation metric. Following previous works [28, 30,
𝑋∈ℝ × 31], we use Mean Squared Error (MSE) and Mean Absolute
Error (MAE) as the core metrics to compare performance.
History 𝐿 timesteps Compared methods. We include five recent
(b) One Linear Layer Transformer-based methods: FEDformer [31], Aut-
Figure 2. Illustration of the basic linear model. oformer [28], Informer [30], Pyraformer [18], and
LogTrans [16]. Besides, we include a naive DMS method:
To validate this hypothesis, we present the simplest DMS Closest Repeat (Repeat), which repeats the last value in the
model via a temporal linear layer, named LTSF-Linear, as a look-back window, as another simple baseline. Since there
baseline for comparison. The basic formulation of LTSF- are two variants of FEDformer, we compare the one with
Linear directly regresses historical time series for future better accuracy (FEDformer-f via Fourier transform).
prediction via a weighted sum operation (as illustrated in
Figure 2). The mathematical expression is X̂i = W Xi , 5.2. Comparison with Transformers
where W ∈ RT ×L is a linear layer along the temporal axis. Quantitative results. In Table 2, we extensively eval-
X̂i and Xi are the prediction and input for each ith vari- uate all mentioned Transformers on nine benchmarks, fol-
ate. Note that LTSF-Linear shares weights across different lowing the experimental setting of previous work [28, 30,
variates and does not model any spatial correlations. 31]. Surprisingly, the performance of LTSF-Linear sur-
LTSF-Linear is a set of linear models. Vanilla Linear is passes the SOTA FEDformer in most cases by 20% ∼ 50%
a one-layer linear model. To handle time series across dif- improvements on the multivariate forecasting, where LTSF-
ferent domains (e.g., finance, traffic, and energy domains), Linear even does not model correlations among variates.
we further introduce two variants with two preprocessing For different time series benchmarks, NLinear and DLin-
methods, named DLinear and NLinear. ear show the superiority to handle the distribution shift and
trend-seasonality features. We also provide results for uni-
variate forecasting of ETT datasets in the Appendix, where
• Specifically, DLinear is a combination of a Decom- LTSF-Linear still consistently outperforms Transformer-
position scheme used in Autoformer and FEDformer based LTSF solutions by a large margin.
with linear layers. It first decomposes a raw data in- FEDformer achieves competitive forecasting accuracy
put into a trend component by a moving average ker- on ETTh1. This because FEDformer employs classical time
nel and a remainder (seasonal) component. Then, two series analysis techniques such as frequency processing,
one-layer linear layers are applied to each component, which brings in time series inductive bias and benefits the
and we sum up the two features to get the final predic- ability of temporal feature extraction. In summary, these re-
tion. By explicitly handling trend, DLinear enhances sults reveal that existing complex Transformer-based LTSF
the performance of a vanilla linear when there is a clear solutions are not seemingly effective on the existing nine
trend in the data. benchmarks while LTSF-Linear can be a powerful baseline.
Another interesting observation is that even though the
naive Repeat method shows worse results when predict-
• Meanwhile, to boost the performance of LTSF-Linear ing long-term seasonal data (e.g., Electricity and Traffic), it
when there is a distribution shift in the dataset, NLin- surprisingly outperforms all Transformer-based methods on
ear first subtracts the input by the last value of the se- Exchange-Rate (around 45%). This is mainly caused by the
quence. Then, the input goes through a linear layer, wrong prediction of trends in Transformer-based solutions,
and the subtracted part is added back before making which may overfit toward sudden change noises in the train-
the final prediction. The subtraction and addition in ing data, resulting in significant accuracy degradation (see
NLinear are a simple normalization for the input se- Figure 3(b)). Instead, Repeat does not have the bias.
quence. Qualitative results. As shown in Figure 3, we plot

4
Datasets ETTh1&ETTh2 ETTm1 &ETTm2 Traffic Electricity Exchange-Rate Weather ILI
Variates 7 7 862 321 8 21 7
Timesteps 17,420 69,680 17,544 26,304 7,588 52,696 966
Granularity 1hour 5min 1hour 1hour 1day 10min 1week

Table 1. The statistics of the nine popular datasets for the LTSF problem.
Methods IMP. Linear* NLinear* DLinear* FEDformer Autoformer Informer Pyraformer* LogTrans Repeat*
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Exchange Electricity

96 27.40% 0.140 0.237 0.141 0.237 0.140 0.237 0.193 0.308 0.201 0.317 0.274 0.368 0.386 0.449 0.258 0.357 1.588 0.946
192 23.88% 0.153 0.250 0.154 0.248 0.153 0.249 0.201 0.315 0.222 0.334 0.296 0.386 0.386 0.443 0.266 0.368 1.595 0.950
336 21.02% 0.169 0.268 0.171 0.265 0.169 0.267 0.214 0.329 0.231 0.338 0.300 0.394 0.378 0.443 0.280 0.380 1.617 0.961
720 17.47% 0.203 0.301 0.210 0.297 0.203 0.301 0.246 0.355 0.254 0.361 0.373 0.439 0.376 0.445 0.283 0.376 1.647 0.975
96 45.27% 0.082 0.207 0.089 0.208 0.081 0.203 0.148 0.278 0.197 0.323 0.847 0.752 0.376 1.105 0.968 0.812 0.081 0.196
192 42.06% 0.167 0.304 0.180 0.300 0.157 0.293 0.271 0.380 0.300 0.369 1.204 0.895 1.748 1.151 1.040 0.851 0.167 0.289
336 33.69% 0.328 0.432 0.331 0.415 0.305 0.414 0.460 0.500 0.509 0.524 1.672 1.036 1.874 1.172 1.659 1.081 0.305 0.396
720 46.19% 0.964 0.750 1.033 0.780 0.643 0.601 1.195 0.841 1.447 0.941 2.478 1.310 1.943 1.206 1.941 1.127 0.823 0.681
96 30.15% 0.410 0.282 0.410 0.279 0.410 0.282 0.587 0.366 0.613 0.388 0.719 0.391 2.085 0.468 0.684 0.384 2.723 1.079
Traffic

192 29.96% 0.423 0.287 0.423 0.284 0.423 0.287 0.604 0.373 0.616 0.382 0.696 0.379 0.867 0.467 0.685 0.390 2.756 1.087
336 29.95% 0.436 0.295 0.435 0.290 0.436 0.296 0.621 0.383 0.622 0.337 0.777 0.420 0.869 0.469 0.734 0.408 2.791 1.095
720 25.87% 0.466 0.315 0.464 0.307 0.466 0.315 0.626 0.382 0.660 0.408 0.864 0.472 0.881 0.473 0.717 0.396 2.811 1.097
96 18.89% 0.176 0.236 0.182 0.232 0.176 0.237 0.217 0.296 0.266 0.336 0.300 0.384 0.896 0.556 0.458 0.490 0.259 0.254
Weather

192 21.01% 0.218 0.276 0.225 0.269 0.220 0.282 0.276 0.336 0.307 0.367 0.598 0.544 0.622 0.624 0.658 0.589 0.309 0.292
336 22.71% 0.262 0.312 0.271 0.301 0.265 0.319 0.339 0.380 0.359 0.395 0.578 0.523 0.739 0.753 0.797 0.652 0.377 0.338
720 19.85% 0.326 0.365 0.338 0.348 0.323 0.362 0.403 0.428 0.419 0.428 1.059 0.741 1.004 0.934 0.869 0.675 0.465 0.394
24 47.86% 1.947 0.985 1.683 0.858 2.215 1.081 3.228 1.260 3.483 1.287 5.764 1.677 1.420 2.012 4.480 1.444 6.587 1.701
36 36.43% 2.182 1.036 1.703 0.859 1.963 0.963 2.679 1.080 3.103 1.148 4.755 1.467 7.394 2.031 4.799 1.467 7.130 1.884
ILI

48 34.43% 2.256 1.060 1.719 0.884 2.130 1.024 2.622 1.078 2.669 1.085 4.763 1.469 7.551 2.057 4.800 1.468 6.575 1.798
60 34.33% 2.390 1.104 1.819 0.917 2.368 1.096 2.857 1.157 2.770 1.125 5.264 1.564 7.662 2.100 5.278 1.560 5.893 1.677
96 0.80% 0.375 0.397 0.374 0.394 0.375 0.399 0.376 0.419 0.449 0.459 0.865 0.713 0.664 0.612 0.878 0.740 1.295 0.713
ETTh1

192 3.57% 0.418 0.429 0.408 0.415 0.405 0.416 0.420 0.448 0.500 0.482 1.008 0.792 0.790 0.681 1.037 0.824 1.325 0.733
336 6.54% 0.479 0.476 0.429 0.427 0.439 0.443 0.459 0.465 0.521 0.496 1.107 0.809 0.891 0.738 1.238 0.932 1.323 0.744
720 13.04% 0.624 0.592 0.440 0.453 0.472 0.490 0.506 0.507 0.514 0.512 1.181 0.865 0.963 0.782 1.135 0.852 1.339 0.756
96 19.94% 0.288 0.352 0.277 0.338 0.289 0.353 0.346 0.388 0.358 0.397 3.755 1.525 0.645 0.597 2.116 1.197 0.432 0.422
ETTh2

192 19.81% 0.377 0.413 0.344 0.381 0.383 0.418 0.429 0.439 0.456 0.452 5.602 1.931 0.788 0.683 4.315 1.635 0.534 0.473
336 25.93% 0.452 0.461 0.357 0.400 0.448 0.465 0.496 0.487 0.482 0.486 4.721 1.835 0.907 0.747 1.124 1.604 0.591 0.508
720 14.25% 0.698 0.595 0.394 0.436 0.605 0.551 0.463 0.474 0.515 0.511 3.647 1.625 0.963 0.783 3.188 1.540 0.588 0.517
96 21.10% 0.308 0.352 0.306 0.348 0.299 0.343 0.379 0.419 0.505 0.475 0.672 0.571 0.543 0.510 0.600 0.546 1.214 0.665
ETTm1

192 21.36% 0.340 0.369 0.349 0.375 0.335 0.365 0.426 0.441 0.553 0.496 0.795 0.669 0.557 0.537 0.837 0.700 1.261 0.690
336 17.07% 0.376 0.393 0.375 0.388 0.369 0.386 0.445 0.459 0.621 0.537 1.212 0.871 0.754 0.655 1.124 0.832 1.283 0.707
720 21.73% 0.440 0.435 0.433 0.422 0.425 0.421 0.543 0.490 0.671 0.561 1.166 0.823 0.908 0.724 1.153 0.820 1.319 0.729
96 17.73% 0.168 0.262 0.167 0.255 0.167 0.260 0.203 0.287 0.255 0.339 0.365 0.453 0.435 0.507 0.768 0.642 0.266 0.328
ETTm2

192 17.84% 0.232 0.308 0.221 0.293 0.224 0.303 0.269 0.328 0.281 0.340 0.533 0.563 0.730 0.673 0.989 0.757 0.340 0.371
336 15.69% 0.320 0.373 0.274 0.327 0.281 0.342 0.325 0.366 0.339 0.372 1.363 0.887 1.201 0.845 1.334 0.872 0.412 0.410
720 12.58% 0.413 0.435 0.368 0.384 0.397 0.421 0.421 0.415 0.433 0.432 3.379 1.338 3.625 1.451 3.048 1.328 0.521 0.465
- Methods* are implemented by us; Other results are from FEDformer [31].

Table 2. Multivariate long-term forecasting errors in terms of MSE and MAE, the lower the better. Among them, ILI dataset is with
forecasting horizon T ∈ {24, 36, 48, 60}. For the others, T ∈ {96, 192, 336, 720}. Repeat repeats the last value in the look-back window.
The best results are highlighted in bold and the best results of Transformers are highlighted with a underline. Accordingly, IMP. is the
best result of linear models compared to the results of Transformer-based solutions.

the prediction results on three selected time series datasets it determines how much we can learn from historical data.
with Transformer-based solutions and LTSF-Linear: Elec- Generally speaking, a powerful TSF model with a strong
tricity (Sequence 1951, Variate 36), Exchange-Rate (Se- temporal relation extraction capability should be able to
quence 676, Variate 3), and ETTh2 ( Sequence 1241, Vari- achieve better results with larger look-back window sizes.
ate 2), where these datasets have different temporal patterns. To study the impact of input look-back win-
When the input length is 96 steps, and the output horizon dow sizes, we conduct experiments with L ∈
is 336 steps, Transformers [28, 30, 31] fail to capture the {24, 48, 72, 96, 120, 144, 168, 192, 336, 504, 672, 720}
scale and bias of the future data on Electricity and ETTh2. for long-term forecasting (T=720). Figure 4 demonstrates
Moreover, they can hardly predict a proper trend on aperi- the MSE results on two datasets. Similar to the observations
odic data such as Exchange-Rate. These phenomena further from previous studies [27, 30], existing Transformer-based
indicate the inadequacy of existing Transformer-based solu- models’ performance deteriorates or stays stable when
tions for the LTSF task. the look-back window size increases. In contrast, the
performances of all LTSF-Linear are significantly boosted
5.3. More Analyses on LTSF-Transformers
with the increase of look-back window size. Thus, existing
Can existing LTSF-Transformers extract temporal re- solutions tend to overfit temporal noises instead of extract-
lations well from longer input sequences? The size of the ing temporal information if given a longer sequence, and
look-back window greatly impacts forecasting accuracy as the input size 96 is exactly suitable for most Transformers.

5
GrouthTruth Autoformer Informer FEDformer DLinear GrouthTruth Autoformer Informer FEDformer DLinear GrouthTruth Autoformer Informer FEDformer DLinear
1.0
4
1.0
0.5
3
0.5
0.0
2
0.0
0.5
1
0.5
0 1.0
1.0
1 1.5
1.5

0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) Electricity (b) Exchange-Rate (c) ETTh2
Figure 3. Illustration of the long-term forecasting output (Y-axis) of five models with an input length L=96 and output length T =192
(X-axis) on Electricity, Exchange-Rate, and ETTh2, respectively.

Additionally, we provide more quantitative results in the 96 time steps. From the experimental results, the perfor-
Appendix, and our conclusion holds in almost all cases. mance of the SOTA Transformers drops slightly, indicat-
ing these models only capture similar temporal information
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
from the adjacent time series sequence. Since capturing the
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
intrinsic characteristics of the dataset generally does not re-
1.4
0.40 quire a large number of parameters, i,e. one parameter can
1.2 represent the periodicity. Using too many parameters will
0.35
1.0 even cause overfitting, which partially explains why LTSF-
0.30
0.8 Linear performs better than Transformer-based methods.
0.6
0.25 Are the self-attention scheme effective for LTSF? We
0.20 verify whether these complex designs in the existing Trans-
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720
former (e.g., Informer) are essential. In Table 4, we gradu-
(a) 720 steps-Traffic (b) 720 steps-Electricity ally transform Informer to Linear. First, we replace each
self-attention layer by a linear layer, called Att.-Linear,
Figure 4. The MSE results (Y-axis) of models with different look-
back window sizes (X-axis) of long-term forecasting (T=720) on
since a self-attention layer can be regarded as a fully-
the Traffic and Electricity datasets. connected layer where weights are dynamically changed.
Furthermore, we discard other auxiliary designs (e.g., FFN)
What can be learned for long-term forecasting? While in Informer to leave embedding layers and linear layers,
the temporal dynamics in the look-back window signifi- named Embed + Linear. Finally, we simplify the model to
cantly impact the forecasting accuracy of short-term time one linear layer. Surprisingly, the performance of Informer
series forecasting, we hypothesize that long-term forecast- grows with the gradual simplification, indicating the unnec-
ing depends on whether models can capture the trend and essary of the self-attention scheme and other complex mod-
periodicity well only. That is, the farther the forecasting ules at least for existing LTSF benchmarks.
horizon, the less impact the look-back window itself has.
Methods Informer Att.-Linear Embed + Linear Linear
96 0.847 1.003 0.173 0.084
Exchange

Methods FEDformer Autoformer

192 1.204 0.979 0.443 0.155
Input Close Far Close Far
336 1.672 1.498 1.288 0.301
Electricity 0.251 0.265 0.255 0.287
720 2.478 2.102 2.026 0.763
Traffic 0.631 0.645 0.677 0.675 96 0.865 0.613 0.454 0.400
ETTh1

Table 3. Comparison of different input sequences under the MSE 192 1.008 0.759 0.686 0.438
metric to explore what LTSF-Transformers depend on. If the in- 336 1.107 0.921 0.821 0.479
put is Close, we use the 96th , ..., 191th time steps as the input 720 1.181 0.902 1.051 0.515
sequence. If the input is Far, we use the 0th , ..., 95th time steps. Table 4. The MSE comparisons of gradually transforming In-
Both of them forecast the 192th , ..., (192 + 720)th time steps. former to a Linear from the left to right columns. Att.-Linear is
To validate the above hypothesis, in Table 3, we compare a structure that replaces each attention layer with a linear layer.
Embed + Linear is to drop other designs and only keeps embed-
the forecasting accuracy for the same future 720 time steps
ding layers and a linear layer. The look-back window size is 96.
with data from two different look-back windows: (i). the
original input L=96 setting (called Close) and (ii). the far Can existing LTSF-Transformers preserve temporal
input L=96 setting (called Far) that is before the original order well? Self-attention is inherently permutation-

6
Methods Linear FEDformer Autoformer Informer
Predict Length Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex.
96 0.080 0.133 0.169 0.161 0.160 0.162 0.152 0.158 0.160 0.952 1.004 0.959
Exchange

192 0.162 0.208 0.243 0.274 0.275 0.275 0.278 0.271 0.277 1.012 1.023 1.014
336 0.286 0.320 0.345 0.439 0.439 0.439 0.435 0.430 0.435 1.177 1.181 1.177
720 0.806 0.819 0.836 1.122 1.122 1.122 1.113 1.113 1.113 1.198 1.210 1.196
Average Drop N/A 27.26% 46.81% N/A -0.09% 0.20% N/A 0.09% 1.12% N/A -0.12% -0.18%
96 0.395 0.824 0.431 0.376 0.753 0.405 0.455 0.838 0.458 0.974 0.971 0.971
ETTh1

192 0.447 0.824 0.471 0.419 0.730 0.436 0.486 0.774 0.491 1.233 1.232 1.231
336 0.490 0.825 0.505 0.447 0.736 0.453 0.496 0.752 0.497 1.693 1.693 1.691
720 0.520 0.846 0.528 0.468 0.720 0.470 0.525 0.696 0.524 2.720 2.716 2.715
Average Drop N/A 81.06% 4.78% N/A 73.28% 3.44% N/A 56.91% 0.46% N/A 1.98% 0.18%
Table 5. The MSE comparisons of models when shuffling the raw input sequence. Shuf. randomly shuffles the input sequence. Half-EX.
randomly exchanges the first half of the input sequences with the second half. Average Drop is the average performance drop under all
forecasting lengths after shuffling. All results are the average test MSE of five runs.

invariant, i.e., regardless of the order. However, in time- Traffic

Methods Embedding
96 192 336 720
series forecasting, the sequence order often plays a crucial
role. We argue that even with positional and temporal em- All 0.597 0.606 0.627 0.649
wo/Pos. 0.587 0.604 0.621 0.626
beddings, existing Transformer-based methods still suffer FEDformer
wo/Temp. 0.613 0.623 0.650 0.677
from temporal information loss. In Table 5, we shuffle the wo/Pos.-Temp. 0.613 0.622 0.648 0.663
raw input before the embedding strategies. Two shuffling All 0.629 0.647 0.676 0.638
strategies are presented: Shuf. randomly shuffles the whole wo/Pos. 0.613 0.616 0.622 0.660
Autoformer
input sequences and Half-Ex. exchanges the first half of wo/Temp. 0.681 0.665 0.908 0.769
the input sequence with the second half. Interestingly, com- wo/Pos.-Temp. 0.672 0.811 1.133 1.300
pared with the original setting (Ori.) on the Exchange Rate, All 0.719 0.696 0.777 0.864
the performance of all Transformer-based methods does not wo/Pos. 1.035 1.186 1.307 1.472
Informer
wo/Temp. 0.754 0.780 0.903 1.259
fluctuate even when the input sequence is randomly shuf-
wo/Pos.-Temp. 1.038 1.351 1.491 1.512
fled. By contrary, the performance of LTSF-Linear is dam-
Table 6. The MSE comparisons of different embedding strate-
aged significantly. These indicate that LTSF-Transformers gies on Transformer-based methods with look-back window size
with different positional and temporal embeddings preserve 96 and forecasting lengths {96, 192, 336, 720}.
quite limited temporal relations and are prone to overfit on
noisy financial data, while the LTSF-Linear can model the Rather than using a single time step in each token, FED-
order naturally and avoid overfitting with fewer parameters. former and Autoformer input a sequence of timestamps to
For the ETTh1 dataset, FEDformer and Autoformer in- embed the temporal information. Hence, they can achieve
troduce time series inductive bias into their models, mak- comparable or even better performance without fixed po-
ing them can extract certain temporal information when the sitional embeddings. However, without timestamp embed-
dataset has more clear temporal patterns (e.g., periodicity) dings, the performance of Autoformer declines rapidly be-
than the Exchange Rate. Therefore, the average drops of cause of the loss of global temporal information. Instead,
the two Transformers are 73.28% and 56.91% under the thanks to the frequency-enhanced module proposed in FED-
Shuf. setting, where it loses the whole order information. former to introduce temporal inductive bias, it suffers less
Moreover, Informer still suffers less from both Shuf. and from removing any position/timestamp embeddings.
Half-Ex. settings due to its no such temporal inductive bias. Is training data size a limiting factor for existing LTSF-
Overall, the average drops of LTSF-Linear are larger than Transformers? Some may argue that the poor performance
Transformer-based methods for all cases, indicating the ex- of Transformer-based solutions is due to the small sizes of
isting Transformers do not preserve temporal order well. the benchmark datasets. Unlike computer vision or nat-
How effective are different embedding strategies? We ural language processing tasks, TSF is performed on col-
study the benefits of position and timestamp embeddings lected time series, and it is difficult to scale up the training
used in Transformer-based methods. In Table 6, the fore- data size. In fact, the size of the training data would in-
casting errors of Informer largely increase without posi- deed have a significant impact on the model performance.
tional embeddings (wo/Pos.). Without timestamp embed- Accordingly, we conduct experiments on Traffic, compar-
dings (wo/Temp.) will gradually damage the performance ing the performance of the model trained on a full dataset
of Informer as the forecasting lengths increase. Since In- (17,544*0.7 hours), named Ori., with that trained on a
former uses a single time step for each token, it is necessary shortened dataset (8,760 hours, i.e., 1 year), called Short.
to introduce temporal information in tokens. Unexpectedly, Table 7 presents that the prediction errors

7
with reduced training data are lower in most cases. This contributions do not come from proposing a linear model
might because the whole-year data maintains more clear but rather from throwing out an important question, show-
temporal features than a longer but incomplete data size. ing surprising comparisons, and demonstrating why LTSF-
While we cannot conclude that we should use less data for Transformers are not as effective as claimed in these works
training, it demonstrates that the training data scale is not through various perspectives. We sincerely hope our com-
the limiting reason for the performances of Autoformer and prehensive studies can benefit future work in this area.
FEDformer.

Methods FEDformer Autoformer

Dataset Ori. Short Ori. Short
96 0.587 0.568 0.613 0.594
192 0.604 0.584 0.616 0.621
336 0.621 0.601 0.622 0.621
720 0.626 0.608 0.660 0.650
Table 7. The MSE comparison of two training data sizes.

Is efficiency really a top-level priority?

Existing LTSF-
Transformers claim that the O L2 complexity of the
vanilla Transformer is unaffordable for the LTSF problem.
Although they prove to be able to improve the theoretical
time and memory complexity from O L2 to O (L), it is
unclear whether 1) the actual inference time and memory
cost on devices are improved, and 2) the memory issue is
unacceptable and urgent for today’s GPU (e.g., an NVIDIA
Titan XP here). In Table 8, we compare the average prac-
tical efficiencies with 5 runs. Interestingly, compared with
the vanilla Transformer (with the same DMS decoder), most
Transformer variants incur similar or even worse inference
time and parameters in practice. These follow-ups introduce
more additional design elements to make practical costs
high. Moreover, the memory cost of the vanilla Transformer
is practically acceptable, even for output length L = 720,
which weakens the importance of developing a memory-
efficient Transformers, at least for existing benchmarks.

Method MACs Parameter Time Memory

DLinear 0.04G 139.7K 0.4ms 687MiB
Transformer× 4.03G 13.61M 26.8ms 6091MiB
Informer 3.93G 14.39M 49.3ms 3869MiB
Autoformer 4.41G 14.91M 164.1ms 7607MiB
Pyraformer 0.80G 241.4M∗ 3.4ms 7017MiB
FEDformer 4.41G 20.68M 40.5ms 4143MiB
- × is modified into the same one-step decoder, which is implemented in the source code from Autoformer.
- ∗ 236.7M parameters of Pyraformer come from its linear decoder.

Table 8. Comparison of practical efficiency of LTSF-Transformers

under L=96 and T=720 on the Electricity. MACs are the number of
multiply-accumulate operations. We use Dlinear for comparison
since it has the double cost in LTSF-Linear. The inference time
averages 5 runs. Future work. LTSF-Linear has a limited model ca-
6. Conclusion and Future Work pacity, and it merely serves a simple yet competitive base-
line with strong interpretability for future research. For ex-
Conclusion. This work questions the effectiveness of ample, the one-layer linear network is hard to capture the
emerging favored Transformer-based solutions for the long- temporal dynamics caused by change points [25]. Conse-
term time series forecasting problem. We use an em- quently, we believe there is a great potential for new model
barrassingly simple linear model LTSF-Linear as a DMS designs, data processing, and benchmarks to tackle the chal-
forecasting baseline to verify our claims. Note that our lenging LTSF problem.

8
Appendix:
Are Transformers Effective for Time Series Forecasting?
In this Appendix, we provide descriptions of non- • ETT (Electricity Transformer Temperature) [30]2 con-
Transformer-based TSF solutions, detailed experimental sists of two hourly-level datasets (ETTh) and two 15-
settings, more comparisons under different look-back win- minute-level datasets (ETTm). Each of them contains
dow sizes, and the visualization of LTSF-Linear on all seven oil and load features of electricity transformers
datasets. We also append our code to reproduce the results from July 2016 to July 2018.
shown in the paper.
• Traffic3 describes the road occupancy rates. It con-
A. Related Work: Non-Transformer-Based tains the hourly data recorded by the sensors of San
Francisco freeways from 2015 to 2016.
TSF Solutions
As a long-standing problem with a wide range of ap- • Electricity4 collects the hourly electricity consumption
plications, statistical approaches (e.g., autoregressive inte- of 321 clients from 2012 to 2014.
grated moving average (ARIMA) [1], exponential smooth- • Exchange-Rate [15]5 collects the daily exchange rates
ing [12], and structural models [14]) for time series fore- of 8 countries from 1990 to 2016.
casting have been used from the 1970s onward. Generally
speaking, the parametric models used in statistical methods • Weather6 includes 21 indicators of weather, such as air
require significant domain expertise to build. temperature, and humidity. Its data is recorded every
To relieve this burden, many machine learning 10 min for 2020 in Germany.
techniques such as gradient boosting regression tree
(GBRT) [10, 11] gain popularity, which learns the tempo- • ILI7 describes the ratio of patients seen with influenza-
ral dynamics of time series in a data-driven manner. How- like illness and the number of patients. It includes
ever, these methods still require manual feature engineer- weekly data from the Centers for Disease Control and
ing and model designs. With the powerful representation Prevention of the United States from 2002 to 2021.
learning capability of deep neural networks (DNNs) from
B.2. Implementation Details
abundant data, various deep learning-based TSF solutions
are proposed in the literature, achieving better forecasting For existing Transformer-based TSF solutions: the im-
accuracy than traditional techniques in many cases. plementation of Autoformer [28], Informer [30], and the
Besides Transformers, the other two popular DNN archi- vanilla Transformer [26] are all taken from the Autoformer
tectures are also applied for time series forecasting: work [28]; the implementation of FEDformer [31] and
Pyraformer [18] are from their respective code repository.
• Recurrent neural networks (RNNs) based methods We also adopt their default hyper-parameters to train the
(e.g., [21]) summarize the past information compactly models. For DLinear, the moving average kernel size for
in internal memory states and recursively update them- decomposition is 25, which is the same as Autoformer. The
selves for forecasting. total parameters of a vanilla linear model and a NLinear
are TL. The total parameters of the DLinear are 2TL. Since
• Convolutional neural networks (CNNs) based meth- LTSF-Linear will be underfitting when the input length is
ods (e.g., [3]), wherein convolutional filters are used short, and LTSF-Transformers tend to overfit on a long
to capture local temporal features. lookback window size. To compare the best performance
of existing LTSF-Transformers with LTSF-Linear, we re-
RNN-based TSF methods belong to IMS forecasting port L=336 for LTSF-Linear and L=96 for Transformers by
techniques. Depending on whether the decoder is imple- default. For more hyper-parameters of LTSF-Linear, please
mented in an autoregressive manner, there are either IMS refer to our code.
or DMS forecasting techniques for CNN-based TSF meth-
2 https://github.com/zhouhaoyi/ETDataset
ods [3, 17]. 3 http://pems.dot.ca.gov
4 https : / / archive . ics . uci . edu / ml / datasets /
B. Experimental Details ElectricityLoadDiagrams20112014
5 https : / / github . com / laiguokun / multivariate -
B.1. Data Descriptions time-series-data
6 https://www.bgc-jena.mpg.de/wetter/
We use nine wildly-used datasets in the main paper. The 7 https : / / gis . cdc . gov / grasp / fluview /

details are listed in the following. fluportaldashboard.html

9
C. Additional Comparison with Transformers tribute it to the low information-to-noise ratio in such finan-
cial data.
We further compare LTSF-Linear with LTSF-
Transformer for Univariate Forecasting on four ETT D. Ablation study on the LTSF-Linear
datasets. Moreover, in Figure 4 of the main paper, we
demonstrate that existing Transformers fail to exploit large D.1. Motivation of NLinear
look-back window sizes with two examples. Here, we give If we normalize the test data by the mean and variance of
comprehensive comparisons between LTSF-Linear and train data, there could be a distribution shift in testing data,
Transformer-based TSF solutions under various look-back i.e, the mean value of testing data is not 0. If the model
window sizes on all benchmarks. made a prediction that is out of the distribution of true value,
a large error would occur. For example, there is a large er-
C.1. Comparison of Univariate Forecasting
ror between the true value and the true value minus/add one.
We present the univariate forecasting results on the four Therefore, in NLinear, we use the subtraction and addition
ETT datasets in table 9. Similarly, LTSF-Linear, especially to shift the model prediction toward the distribution of true
for NLinear can consistently outperform all transformer- value. Then, large errors are avoided, and the model perfor-
based methods by a large margin in most time. We find mances can be improved. Figure 5 illustrates histograms of
that there are serious distribution shifts between training the trainset-test set distributions, where each bar represents
and test sets (as shown in Fig. 5 (a), (b)) on ETTh1 and the number of data points. Clear distribution shifts between
ETTh2 datasets. Simply normalization via the last value training and testing data can be observed in ETTh1, ETTh2,
from the lookback window can greatly relieve the distribu- and ILI. Accordingly, from Table 9 and Table 2 in the main
tion shift problem. paper, we can observe that there are great improvements
in the three datasets comparing the NLinear to the Linear,
C.2. Comparison under Different Look-back Win- showing the effectiveness of the NLinear in relieving dis-
dows tribution shifts. Moreover, for the datasets without obvi-
ous distribution shifts, like Electricity in Figure 5(c), using
In Figure 6, we provide the MSE comparisons of five
the vanilla Linear can be enough, demonstrating the similar
LTSF-Transformers with LTSF-Linear under different look-
performance with NLinear and DLinear.
back window sizes to explore whether existing Transform-
ers can extract temporal well from longer input sequences. D.2. The Features of LTSF-Linear
For hourly granularity datasets (ETTh1, ETTh2, Traffic,
Although LTSF-Linear is simple, it has some compelling
and Electricity), the increasing look-back window sizes are
characteristics:
{24, 48, 72, 96, 120, 144, 168, 192, 336, 504, 672, 720},
which represent {1, 2, 3, 4, 5, 6, 7, 8, 14, 21, 28, 30} • An O(1) maximum signal traversing path length:
days. The forecasting steps are {24, 720}, which mean {1, The shorter the path, the better the dependencies are
30} days. For 5-minute granularity datasets (ETTm1 and captured [18], making LTSF-Linear capable of cap-
ETTm2), we set the look-back window size as {24, 36, 48, turing both short-range and long-range temporal rela-
60, 72, 144, 288}, which represent {2, 3, 4, 5, 6, 12, 24} tions.
hours. For 10-minute granularity datasets (Weather), we set
the look-back window size as {24, 48, 72, 96, 120, 144, • High-efficiency: As LTSF-Linear is a linear model
168, 192, 336, 504, 672, 720}, which mean {4, 8, 12, 16, with two linear layers at most, it costs much lower
20, 24, 28, 32, 56, 84, 112, 120} hours. The forecasting memory and fewer parameters and has a faster infer-
steps are {24, 720} that are {4, 120} hours. For weekly ence speed than existing Transformers (see Table 8 in
granularity dataset (ILI), we set the look-back window size main paper).
as {26, 52, 78, 104, 130, 156, 208}, which represent {0.5, 1, • Interpretability: After training, we can visualize
1.5, 2, 2.5, 3, 3.5, 4} years. The corresponding forecasting weights from the seasonality and trend branches to
steps are {26, 208}, meaning {0.5, 4} years. have some insights on the predicted values [9].
As shown in Figure 6, with increased look-back win-
dow sizes, the performance of LTSF-Linear is significantly • Easy-to-use: LTSF-Linear can be obtained easily
boosted for most datasets (e.g., ETTm1 and Traffic), while without tuning model hyper-parameters.
this is not the case for Transformer-based TSF solutions.
D.3. Interpretability of LTSF-Linear
Most of their performance fluctuates or gets worse as
the input lengths increase. To be specific, the results of Because LTSF-Linear is a set of linear models, the
Exchange-Rate do not show improved results with a long weights of linear layers can directly reveal how LTSF-
look-back window (from Figure 6(m) and (n)), and we at- Linear works. The weight visualization of LTSF-Linear can

10
Methods Linear NLinear DLinear FEDformer-f FEDformer-w Autoformer Informer LogTrans
Metric MSE MAE MSE MSE MAE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.189 0.359 0.053 0.177 0.056 0.180 0.079 0.215 0.080 0.214 0.071 0.206 0.193 0.377 0.283 0.468
ET T h1

192 0.078 0.212 0.069 0.204 0.071 0.204 0.104 0.245 0.105 0.256 0.114 0.262 0.217 0.395 0.234 0.409
336 0.091 0.237 0.081 0.226 0.098 0.244 0.119 0.270 0.120 0.269 0.107 0.258 0.202 0.381 0.386 0.546
720 0.172 0.340 0.080 0.226 0.189 0.359 0.142 0.299 0.127 0.280 0.126 0.283 0.183 0.355 0.475 0.629
96 0.133 0.283 0.129 0.278 0.131 0.279 0.128 0.271 0.156 0.306 0.153 0.306 0.213 0.373 0.217 0.379
ET T h2

192 0.176 0.330 0.169 0.324 0.176 0.329 0.185 0.330 0.238 0.380 0.204 0.351 0.227 0.387 0.281 0.429
336 0.213 0.371 0.194 0.355 0.209 0.367 0.231 0.378 0.271 0.412 0.246 0.389 0.242 0.401 0.293 0.437
720 0.292 0.440 0.225 0.381 0.276 0.426 0.278 0.420 0.288 0.438 0.268 0.409 0.291 0.439 0.218 0.387
96 0.028 0.125 0.026 0.122 0.028 0.123 0.033 0.140 0.036 0.149 0.056 0.183 0.109 0.277 0.049 0.171
ET T m1

192 0.043 0.154 0.039 0.149 0.045 0.156 0.058 0.186 0.069 0.206 0.081 0.216 0.151 0.310 0.157 0.317
336 0.059 0.180 0.052 0.172 0.061 0.182 0.084 0.231 0.071 0.209 0.076 0.218 0.427 0.591 0.289 0.459
720 0.080 0.211 0.073 0.207 0.080 0.210 0.102 0.250 0.105 0.248 0.110 0.267 0.438 0.586 0.430 0.579
96 0.066 0.189 0.063 0.182 0.063 0.183 0.067 0.198 0.063 0.189 0.065 0.189 0.088 0.225 0.075 0.208
ET T m2

192 0.094 0.230 0.090 0.223 0.092 0.227 0.102 0.245 0.110 0.252 0.118 0.256 0.132 0.283 0.129 0.275
336 0.120 0.263 0.117 0.259 0.119 0.261 0.130 0.279 0.147 0.301 0.154 0.305 0.180 0.336 0.154 0.302
720 0.175 0.320 0.170 0.318 0.175 0.320 0.178 0.325 0.219 0.368 0.182 0.335 0.300 0.435 0.160 0.321

Table 9. Univariate long sequence time-series forecasting results on ETT full benchmark. The best results are highlighted in bold and the
best results of Transformers are highlighted with a underline.

(a) ETTh1 channel6 (b) ETTh2 channel3 (c) Electricity channel3 (d) ILI channel6

Figure 5. Distribution of ETTh1, ETTh2, Electricity, and ILI dataset. A clear distribution shift between training and testing data can be
observed in ETTh1, ETTh2, and ILI.

also reveal certain characteristics in the data used for fore- ing steps. Among these forecasting time steps, the 0, 167,
casting. 335, 503, 671 time steps have higher weights. Note that 24
Here we take DLinear as an example. Accordingly, we time steps are a day, and 168 time steps are a week. This
visualize the trend and remainder weights of all datasets indicates that Traffic has a daily periodicity and a weekly
with a fixed input length of 96 and four different forecasting periodicity.
horizons. To obtain a smooth weight with a clear pattern in
visualization, we initialize the weights of the linear layers References
in DLinear as 1/L rather than random initialization. That
is, we use the same weight for every forecasting time step [1] Adebiyi A Ariyo, Adewumi O Adewumi, and
in the look-back window at the start of training. Charles K Ayo. Stock price prediction using the arima
How the model works: Figure 7(c) visualize the model. In 2014 UKSim-AMSS 16th International Con-
weights of the trend and the remaining layers on the ference on Computer Modelling and Simulation, pages
Exchange-Rate dataset. Due to the lack of periodicity and 106–112. IEEE, 2014. 1, 2, 9
seasonality in financial data, it is hard to observe clear pat-
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
terns, but the trend layer reveals greater weights of informa-
Bengio. Neural machine translation by jointly learn-
tion closer to the outputs, representing their larger contribu-
ing to align and translate. arXiv: Computation and
tions to the predicted values.
Language, 2014. 2
Periodicity of data: For Traffic data, as shown in Fig-
ure 7(d), the model gives high weights to the latest time [3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun.
step of the look-back window for the 0,23,47...719 forecast- An empirical evaluation of generic convolutional and

11
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.0 5
1.2
1.4
0.9
4
1.0
0.8 1.2

0.7 0.8 3
1.0
0.6 0.6
0.8 2
0.5
0.4
0.4 0.6 1
0.2
0.3
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720

(a) 24 steps-ETTh1 (b) 720 steps-ETTh1 (c) 24 steps-ETTh2 (d) 720 steps-ETTh2

1.0
0.5 0.25
0.9 3

0.8
0.4 0.20
2
0.7

0.3 0.6 0.15

1
0.5
0.4 0.10
0.2
24 36 48 60 72 144 288 24 36 48 60 72 144 288 24 36 48 60 72 144 288 24 36 48 60 72 144 288

(e) 24 steps-ETTm1 (f) 576 steps-ETTm1 (g) 24 steps-ETTm2 (h) 576 steps-ETTm2

Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
0.50 0.9
1.6
0.45 1.4
1.4 0.8
0.40
1.2
1.2 0.7
0.35
0.30 1.0 1.0
0.6
0.25 0.8 0.8
0.20 0.5
0.6
0.15 0.6
0.4 0.4
0.10
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720

(i) 24 steps-Weather (j) 720 steps-Weather (k) 24 steps-Traffic (l) 720 steps-Traffic

Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.4
7 8
1.2 3.0
7
6
1.0 2.5 6
0.8 5
2.0 5
0.6 4
1.5 4
0.4
3
1.0 3
0.2
2 2
0.0 0.5
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 26 52 78 104 130 156 208 26 52 78 104 130 156 208

(m) 24 steps-Exchange (n) 720 steps-Exchange (o) 24 steps-ILI (p) 60 steps-ILI

Figure 6. The MSE results (Y-axis) of models with different look-back window sizes (X-axis) of the long-term forecasting (e.g., 720-time
steps) and the short-term forecasting (e.g., 24 time steps) on different benchmarks.

recurrent networks for sequence modeling. arXiv Triangular, variable-specific attentions for long se-
preprint arXiv:1803.01271, 2018. 1, 9 quence multivariate time series forecasting–full ver-
sion. arXiv preprint arXiv:2204.13767, 2022. 1
[4] Guillaume Chevillon. Direct multi-step estimation
and forecasting. Journal of Economic Surveys, [6] R. B. Cleveland. Stl : A seasonal-trend decomposition
21(4):746–785, 2007. 2 procedure based on loess. Journal of Office Statistics,
1990. 3
[5] Razvan-Gabriel Cirstea, Chenjuan Guo, Bin Yang,
Tung Kieu, Xuanyi Dong, and Shirui Pan. Triformer: [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

12
ETTh1

(a1) Remainder (a2) Trend (a3) Remainder (a4) Trend (a5) Remainder (a6) Trend (a7) Remainder (a8) Trend
In-96, Out-96 In-96, Out-168 In-96, Out-336 In-96, Out-720
Electricity

(b1) Remainder (b2) Trend (b3) Remainder (b4) Trend (b5) Remainder (b6) Trend (b7) Remainder (b8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Exchange-Rate

(c1) Remainder (c2) Trend (c3) Remainder (c4) Trend (c5) Remainder (c6) Trend (c7) Remainder (c8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Traffic

(d1) Remainder (d2) Trend (d3) Remainder (d4) Trend (d5) Remainder (d6) Trend (d7) Remainder (d8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Weather

(e1) Remainder (e2) Trend (e3) Remainder (e4) Trend (e5) Remainder (e6) Trend (e7) Remainder (e8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
ILI

(f1) Remainder (f2) Trend (f3) Remainder (f4) Trend (f5) Remainder (f6) Trend (f7) Remainder (f8) Trend
In-36, Out-24 In-36, Out-36 In-36, Out-48 In-36, Out-60

Figure 7. Visualization of the weights(T*L) of LTSF-Linear on several benchmarks. Models are trained with a look-back window L
(X-axis) and different forecasting time steps T (Y-axis). We show weights in the remainder and trend layer.

Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv

13
preprint arXiv:1810.04805, 2018. 1 shifted windows. In Proceedings of the IEEE/CVF
[8] Linhao Dong, Shuang Xu, and Bo Xu. Speech- International Conference on Computer Vision, pages
transformer: a no-recurrence sequence-to-sequence 10012–10022, 2021. 1
model for speech recognition. In 2018 IEEE Inter- [20] LIU Minhao, Ailing Zeng, LAI Qiuxia, Ruiyuan Gao,
national Conference on Acoustics, Speech and Signal Min Li, Jing Qin, and Qiang Xu. T-wavenet: A tree-
Processing (ICASSP), pages 5884–5888. IEEE, 2018. structured wavelet neural network for time series sig-
1 nal analysis. In International Conference on Learning
[9] Ruijun Dong and Witold Pedrycz. A granular time se- Representations, 2021. 2
ries approach to long-term forecasting and trend fore- [21] Gábor Petneházi. Recurrent neural networks for time
casting. Physica A: Statistical Mechanics and its Ap- series forecasting. arXiv preprint arXiv:1901.00069,
plications, 387(13):3253–3270, 2008. 10 2019. 9
[10] Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, [22] David Salinas, Valentin Flunkert, and Jan Gasthaus.
Hadi Samer Jomaa, and Lars Schmidt-Thieme. Do Deepar: Probabilistic forecasting with autoregressive
we really need deep learning models for time series recurrent networks. International Journal of Forecast-
forecasting? arXiv preprint arXiv:2101.02118, 2021. ing, 2017. 2
9 [23] Souhaib Ben Taieb, Rob J Hyndman, et al. Recur-
[11] Jerome H Friedman. Greedy function approximation: sive and direct multi-step forecasting: the best of both
a gradient boosting machine. Annals of statistics, worlds, volume 19. Citeseer, 2012. 2
pages 1189–1232, 2001. 1, 9 [24] Sean J. Taylor and Benjamin Letham. Forecasting at
[12] Everette S Gardner Jr. Exponential smoothing: The scale. PeerJ Prepr., 2017. 2
state of the art. Journal of forecasting, 4(1):1–28, [25] Gerrit JJ van den Burg and Christopher KI Williams.
1985. 9 An evaluation of change point detection algorithms.
arXiv preprint arXiv:2003.06222, 2020. 8
[13] James Douglas Hamilton. Time series analysis.
Princeton university press, 2020. 3 [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
[14] Andrew C Harvey. Forecasting, structural time series Kaiser, and Illia Polosukhin. Attention is all you need.
models and the kalman filter. 1990. 9
Advances in neural information processing systems,
[15] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and 30, 2017. 1, 2, 9
Hanxiao Liu. Modeling long- and short-term tempo- [27] Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi
ral patterns with deep neural networks. international Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Trans-
acm sigir conference on research and development in formers in time series: A survey. arXiv preprint
information retrieval, 2017. 1, 4, 9 arXiv:2202.07125, 2022. 1, 2, 5
[16] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, [28] Jiehui Xu, Jianmin Wang, Mingsheng Long, et al.
Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. En- Autoformer: Decomposition transformers with auto-
hancing the locality and breaking the memory bottle- correlation for long-term series forecasting. Advances
neck of transformer on time series forecasting. Ad- in Neural Information Processing Systems, 34, 2021.
vances in Neural Information Processing Systems, 32, 1, 2, 3, 4, 5, 9
2019. 1, 2, 3, 4
[29] Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao,
[17] Minhao Liu, Ailing Zeng, Zhijian Xu, Qiuxia Lai, and Xizhou Zhu, Bo Dai, and Qiang Xu. Deciwatch: A
Qiang Xu. Time series is a special sequence: Fore- simple baseline for 10x efficient 2d and 3d pose esti-
casting with sample convolution and interaction. arXiv mation. arXiv preprint arXiv:2203.08713, 2022. 1
preprint arXiv:2106.09305, 2021. 1, 9 [30] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai
[18] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Informer: Beyond efficient transformer for long se-
Low-complexity pyramidal attention for long-range quence time-series forecasting. In The Thirty-Fifth
time series modeling and forecasting. In International AAAI Conference on Artificial Intelligence, AAAI
Conference on Learning Representations, 2021. 1, 2, 2021, Virtual Conference, volume 35, pages 11106–
3, 4, 9, 10 11115. AAAI Press, 2021. 1, 2, 3, 4, 5, 9
[19] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [31] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang,
Zheng Zhang, Stephen Lin, and Baining Guo. Swin Liang Sun, and Rong Jin. Fedformer: Frequency en-
transformer: Hierarchical vision transformer using hanced decomposed transformer for long-term series

14
forecasting. In International Conference on Machine
Learning, 2022. 1, 2, 3, 4, 5, 9

Are Transformers Effective For Time Series Forecasting?
No ratings yet
Are Transformers Effective For Time Series Forecasting?
8 pages
Are Transformers Effective for Time Series Forecasting?
No ratings yet
Are Transformers Effective for Time Series Forecasting?
8 pages
A Systematic Review For Transformer-Based Long-Term Series Forecasting
No ratings yet
A Systematic Review For Transformer-Based Long-Term Series Forecasting
30 pages
XLSTMTime - Long-term Time Series Forecasting With XLSTM
No ratings yet
XLSTMTime - Long-term Time Series Forecasting With XLSTM
13 pages
CH 10
No ratings yet
CH 10
41 pages
XLSTMTime Long-term Time Series Forecasting With XLSTM
No ratings yet
XLSTMTime Long-term Time Series Forecasting With XLSTM
13 pages
Time Series Forecasting of Petroleum
No ratings yet
Time Series Forecasting of Petroleum
11 pages
Time Series Forecasting of Petroleum Pro
No ratings yet
Time Series Forecasting of Petroleum Pro
11 pages
s13042-025-02560-w
No ratings yet
s13042-025-02560-w
34 pages
Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting
No ratings yet
Timemachine: A Time Series Is Worth 4 Mambas For Long-Term Forecasting
10 pages
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
No ratings yet
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
25 pages
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
No ratings yet
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
19 pages
632_iTransformer_Inverted_Tran
No ratings yet
632_iTransformer_Inverted_Tran
25 pages
Time Machine
No ratings yet
Time Machine
10 pages
ETSformer - Exponential Smoothing Transformers For Time-Series Forecasting
No ratings yet
ETSformer - Exponential Smoothing Transformers For Time-Series Forecasting
18 pages
Effective LSTMs With Seasonal-Trend Decomposition and Adaptive Learning and Niching-Based Backtracking Search Algorithm For Time Series Forecasting
No ratings yet
Effective LSTMs With Seasonal-Trend Decomposition and Adaptive Learning and Niching-Based Backtracking Search Algorithm For Time Series Forecasting
29 pages
1.shiyang Li - Enhance Locality and Break The Memory Bottleneck
No ratings yet
1.shiyang Li - Enhance Locality and Break The Memory Bottleneck
14 pages
Enhancing The Locality and Breaking The Memory Bottleneck of Transformer On Time Series Forecasting Paper
No ratings yet
Enhancing The Locality and Breaking The Memory Bottleneck of Transformer On Time Series Forecasting Paper
11 pages
A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting
No ratings yet
A Joint Time-Frequency Domain Transformer For Multivariate Time Series Forecasting
33 pages
Autoformer Nips21
No ratings yet
Autoformer Nips21
12 pages
Autoformer
No ratings yet
Autoformer
20 pages
GCformer - An Efficient Framework For Accurate and Scalable Long-Term Multivariate Time Series Forecasting
No ratings yet
GCformer - An Efficient Framework For Accurate and Scalable Long-Term Multivariate Time Series Forecasting
10 pages
Financial Time Series Forecasting Using CNN and Transformer
No ratings yet
Financial Time Series Forecasting Using CNN and Transformer
4 pages
Building Trend Fuzzy Granulation-Based LSTM Recurrent Neural Network for Long-Term Time-Series Forecasting
No ratings yet
Building Trend Fuzzy Granulation-Based LSTM Recurrent Neural Network for Long-Term Time-Series Forecasting
15 pages
Sag Heer 2019
No ratings yet
Sag Heer 2019
19 pages
2305.12095
No ratings yet
2305.12095
39 pages
An Experimental Review On Deep Learning Architectures For Time Series Forecasting
No ratings yet
An Experimental Review On Deep Learning Architectures For Time Series Forecasting
25 pages
Is Mamba effective for time series forecasting?
No ratings yet
Is Mamba effective for time series forecasting?
14 pages
Segrnn: Segment Recurrent Neural Network For Long-Term Time Series Forecasting
No ratings yet
Segrnn: Segment Recurrent Neural Network For Long-Term Time Series Forecasting
11 pages
Long-term Forecasting With TiDE Time-series Dense Encoder
No ratings yet
Long-term Forecasting With TiDE Time-series Dense Encoder
21 pages
Science and Technology Journals
No ratings yet
Science and Technology Journals
8 pages
1-s2.0-S0020025524005188-main
No ratings yet
1-s2.0-S0020025524005188-main
16 pages
Book 7
No ratings yet
Book 7
35 pages
FEDformer - Frequency Enhanced Decomposed Transformer For Long-Term Series Forecasting
No ratings yet
FEDformer - Frequency Enhanced Decomposed Transformer For Long-Term Series Forecasting
19 pages
Infomaxformer - Maximum Entropy Transformer For Long Time-Series Forecasting Problem
No ratings yet
Infomaxformer - Maximum Entropy Transformer For Long Time-Series Forecasting Problem
10 pages
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
No ratings yet
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
24 pages
s11063-024-11656-3
No ratings yet
s11063-024-11656-3
25 pages
2304.08424v1
No ratings yet
2304.08424v1
18 pages
ouyang2017
No ratings yet
ouyang2017
13 pages
Learning Deep Time-Index Models For Time Series Forecasting
No ratings yet
Learning Deep Time-Index Models For Time Series Forecasting
21 pages
Long-Range Transformers For Dynamic Spatiotemporal Forecasting
No ratings yet
Long-Range Transformers For Dynamic Spatiotemporal Forecasting
19 pages
Multivariate Time Series Forecasting Final 3rd Sem
No ratings yet
Multivariate Time Series Forecasting Final 3rd Sem
22 pages
21CIKM_AGCNT Adaptive Graph Convolutional Network for Transformer-based Long Sequence Time-Series Forecasting
No ratings yet
21CIKM_AGCNT Adaptive Graph Convolutional Network for Transformer-based Long Sequence Time-Series Forecasting
4 pages
1 s2.0 S0950705120305967 Main
No ratings yet
1 s2.0 S0950705120305967 Main
13 pages
Bryan Lim
No ratings yet
Bryan Lim
145 pages
Improving Long-Term Multivariate Time Series Forecasting With A Seasonal-Trend Decomposition-Based 2-Dimensional Temporal Convolution Dense Network
No ratings yet
Improving Long-Term Multivariate Time Series Forecasting With A Seasonal-Trend Decomposition-Based 2-Dimensional Temporal Convolution Dense Network
13 pages
10.1007@s10489 019 01426 3
No ratings yet
10.1007@s10489 019 01426 3
14 pages
Non - Stationary Former
No ratings yet
Non - Stationary Former
21 pages
Parallel Multivariate Deep Learning Models For Time-Series Prediction: A Comparative Analysis in Asian Stock Markets
No ratings yet
Parallel Multivariate Deep Learning Models For Time-Series Prediction: A Comparative Analysis in Asian Stock Markets
12 pages
Time Gpt
No ratings yet
Time Gpt
12 pages
ssrn-5033163
No ratings yet
ssrn-5033163
33 pages
Transformers in Time Series - A Survey
No ratings yet
Transformers in Time Series - A Survey
9 pages
2233 A Transformer Based Framework
No ratings yet
2233 A Transformer Based Framework
19 pages
FreDo - Frequency Domain-Based Long-Term Time Series Forecasting
No ratings yet
FreDo - Frequency Domain-Based Long-Term Time Series Forecasting
12 pages
FilterNet Harnessing Frequency Filters for Time Series Forecasting
No ratings yet
FilterNet Harnessing Frequency Filters for Time Series Forecasting
20 pages
TimeGPT 1 2310.03589
No ratings yet
TimeGPT 1 2310.03589
12 pages
1 s2.0 S0925231220300606 Main
No ratings yet
1 s2.0 S0925231220300606 Main
11 pages
LSTM and Transformer
No ratings yet
LSTM and Transformer
4 pages
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Dod Data Analytics Ai Adoption Strategy
No ratings yet
Dod Data Analytics Ai Adoption Strategy
26 pages
01 Transformers For Time-Series Data - by BearingPoint Data, Analytics & AI - BearingPoint Data, Analytics & AI - Medium
No ratings yet
01 Transformers For Time-Series Data - by BearingPoint Data, Analytics & AI - BearingPoint Data, Analytics & AI - Medium
20 pages
Webinar Chainlit Copilot
No ratings yet
Webinar Chainlit Copilot
51 pages
Ch. 8 Molecules
No ratings yet
Ch. 8 Molecules
39 pages
pow_234_2024-12-23
No ratings yet
pow_234_2024-12-23
14 pages
MNM31 1main PDF
No ratings yet
MNM31 1main PDF
3 pages
Numerical Differentiation
100% (1)
Numerical Differentiation
3 pages
mathematics investigation
No ratings yet
mathematics investigation
4 pages
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
No ratings yet
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
16 pages
3_Curve Fitting
No ratings yet
3_Curve Fitting
4 pages
Problem Set 5
No ratings yet
Problem Set 5
2 pages
Instruction: Solve The Following Problems Using Polya's Four Step Problem Solving Strategy
No ratings yet
Instruction: Solve The Following Problems Using Polya's Four Step Problem Solving Strategy
4 pages
Lecture 1 CMS 165: Introduction To The Course
No ratings yet
Lecture 1 CMS 165: Introduction To The Course
11 pages
ON SOLVING NEUTROSOPHIC UNCONSTRAINED OPTIMIZATION PROBLEMS BY NEWTON’S METHOD
No ratings yet
ON SOLVING NEUTROSOPHIC UNCONSTRAINED OPTIMIZATION PROBLEMS BY NEWTON’S METHOD
14 pages
HW3 Questions
No ratings yet
HW3 Questions
5 pages
Fem 10-16-2022 14.20
No ratings yet
Fem 10-16-2022 14.20
7 pages
Machine Learning Basics Infographic With Algorithm Examples PDF
No ratings yet
Machine Learning Basics Infographic With Algorithm Examples PDF
1 page
TOPIC 2 - Systems of Linear Equations
No ratings yet
TOPIC 2 - Systems of Linear Equations
32 pages
Thermodynamics Class 11 Notes Physics Chapter 12
No ratings yet
Thermodynamics Class 11 Notes Physics Chapter 12
7 pages
Adding Swing
No ratings yet
Adding Swing
3 pages
Multivariable Control of A Binary Distillation Column
No ratings yet
Multivariable Control of A Binary Distillation Column
6 pages
CNN Model for Smoke and Fire Detection
No ratings yet
CNN Model for Smoke and Fire Detection
10 pages
Activity 3 Numerical
No ratings yet
Activity 3 Numerical
6 pages
Equations With Fractions
No ratings yet
Equations With Fractions
30 pages
Uninformed Search Strategies: Artificial Intelligence COSC-3112 Ms. Humaira Anwer
No ratings yet
Uninformed Search Strategies: Artificial Intelligence COSC-3112 Ms. Humaira Anwer
6 pages
Elliptic Curve Diffie-Hellman 571 On Google™ Android
100% (1)
Elliptic Curve Diffie-Hellman 571 On Google™ Android
64 pages
Electrical Power and Energy Systems: Hamed Shakouri G., Hamid Reza Radmanesh
No ratings yet
Electrical Power and Energy Systems: Hamed Shakouri G., Hamid Reza Radmanesh
11 pages
2020-Problem Set 4
No ratings yet
2020-Problem Set 4
5 pages
PoNN_General
No ratings yet
PoNN_General
10 pages
21EC733
No ratings yet
21EC733
4 pages
Simulation of Product Distribution PT Anugrah Citra Boga by Using Vehicle Routing Problem With Time Window Method To Minimize The Distribution Cost
No ratings yet
Simulation of Product Distribution PT Anugrah Citra Boga by Using Vehicle Routing Problem With Time Window Method To Minimize The Distribution Cost
4 pages
A New Meta-Heuristic Bat Inspired Classification Approach For Microarray Data
No ratings yet
A New Meta-Heuristic Bat Inspired Classification Approach For Microarray Data
5 pages
Road Accident Prediction Journal Paper
No ratings yet
Road Accident Prediction Journal Paper
3 pages
Predicting The Risk of Heart Failure With EHR Sequential Data Modeling
No ratings yet
Predicting The Risk of Heart Failure With EHR Sequential Data Modeling
6 pages

Are Transformers Effective For Time Series Forecasting?

Uploaded by

Are Transformers Effective For Time Series Forecasting?

Uploaded by

Are Transformers Effective for Time Series Forecasting?

Ailing Zeng1 *, Muxi Chen1 * , Lei Zhang2 , Qiang Xu1

Abstract ergy management, and financial investment. Over the past

(a) Preprocessing (b) Embedding (c) Encoder (d) Decoder

Methods FEDformer Autoformer

invariant, i.e., regardless of the order. However, in time- Traffic

Methods FEDformer Autoformer

Is efficiency really a top-level priority?

Method MACs Parameter Time Memory

Table 8. Comparison of practical efficiency of LTSF-Transformers

details are listed in the following. fluportaldashboard.html

0.3 0.6 0.15

(m) 24 steps-Exchange (n) 720 steps-Exchange (o) 24 steps-ILI (p) 60 steps-ILI

You might also like

Ailing Zeng1 , Muxi Chen1 , Lei Zhang2 , Qiang Xu1