Are Transformers Effective For Time Series Forecasting?
Are Transformers Effective For Time Series Forecasting?
1
lutions have demonstrated considerable prediction accu- accuracy compared to existing works, it merely serves as a
racy improvements over traditional methods, in their exper- simple baseline for future research on the challenging long-
iments, all the compared (non-Transformer) baselines per- term TSF problem. With our findings, we also advocate
form autoregressive or iterated multi-step (IMS) forecast- revisiting the validity of Transformer-based solutions for
ing [1, 2, 22, 24], which are known to suffer from significant other time series analysis tasks in the future.
error accumulation effects for the LTSF problem. There-
fore, in this work, we challenge Transformer-based LTSF 2. Preliminaries: TSF Problem Formulation
solutions with direct multi-step (DMS) forecasting strate-
gies to validate their real performance. For time series containing C variates, given historical
Not all time series are predictable, let alone long-term data X = {X1t , ..., XCt }L t=1 , wherein L is the look-back
forecasting (e.g., for chaotic systems). We hypothesize that window size and Xit is the value of the ith variate at the tth
long-term forecasting is only feasible for those time series time step. The time series forecasting task is to predict the
L+T
with a relatively clear trend and periodicity. As linear mod- values X̂ = {X̂1t , ..., X̂Ct }t=L+1 at the T future time steps.
els can already extract such information, we introduce a set When T > 1, iterated multi-step (IMS) forecasting [23]
of embarrassingly simple models named LTSF-Linear as a learns a single-step forecaster and iteratively applies it to
new baseline for comparison. LTSF-Linear regresses histor- obtain multi-step predictions. Alternatively, direct multi-
ical time series with a one-layer linear model to forecast fu- step (DMS) forecasting [4] directly optimizes the multi-step
ture time series directly. We conduct extensive experiments forecasting objective at once.
on nine widely-used benchmark datasets that cover various Compared to DMS forecasting results, IMS predictions
real-life applications: traffic, energy, economics, weather, have smaller variance thanks to the autoregressive estima-
and disease predictions. Surprisingly, our results show that tion procedure, but they inevitably suffer from error accu-
LTSF-Linear outperforms existing complex Transformer- mulation effects. Consequently, IMS forecasting is prefer-
based models in all cases, and often by a large margin (20% able when there is a highly-accurate single-step forecaster,
∼ 50%). Moreover, we find that, in contrast to the claims in and T is relatively small. In contrast, DMS forecasting gen-
existing Transformers, most of them fail to extract temporal erates more accurate predictions when it is hard to obtain an
relations from long sequences, i.e., the forecasting errors are unbiased single-step forecasting model, or T is large.
not reduced (sometimes even increased) with the increase of
look-back window sizes. Finally, we conduct various abla- 3. Transformer-Based LTSF Solutions
tion studies on existing Transformer-based TSF solutions to Transformer-based models [26] have achieved unparal-
study the impact of various design elements in them. leled performances in many long-standing AI tasks in natu-
To sum up, the contributions of this work include: ral language processing and computer vision fields, thanks
to the effectiveness of the multi-head self-attention mech-
• To the best of our knowledge, this is the first work to
anism. This has also triggered lots of research interest
challenge the effectiveness of the booming Transform-
in Transformer-based time series modeling techniques [20,
ers for the long-term time series forecasting task.
27]. In particular, a large amount of research works are ded-
• To validate our claims, we introduce a set of em- icated to the LTSF task (e.g., [16, 18, 28, 30, 31]). Con-
barrassingly simple one-layer linear models, named sidering the ability to capture long-range dependencies
LTSF-Linear, and compare them with existing with Transformer models, most of them focus on the less-
Transformer-based LTSF solutions on nine bench- explored long-term forecasting problem (T 1)1 .
marks. LTSF-Linear can be a new baseline for the When applying the vanilla Transformer model to the
LTSF problem. LTSF problem, it has some limitations, including the
quadratic time/memory complexity with the original self-
• We conduct comprehensive empirical studies on var- attention scheme and error accumulation caused by the au-
ious aspects of existing Transformer-based solutions, toregressive decoder design. Informer [30] addresses these
including the capability of modeling long inputs, the issues and proposes a novel Transformer architecture with
sensitivity to time series order, the impact of posi- reduced complexity and a DMS forecasting strategy. Later,
tional encoding and sub-series embedding, and effi- more Transformer variants introduce various time series
ciency comparisons. Our findings would benefit future features into their models for performance or efficiency im-
research in this area. provements [18,28,31]. We summarize the design elements
With the above, we conclude that the temporal model- of existing Transformer-based LTSF solutions as follows
ing capabilities of Transformers for time series are exag- (see Figure 1).
gerated, at least for the existing LTSF benchmarks. At the 1 Due to page limit, we leave the discussion of non-Transformer fore-
same time, while LTSF-Linear achieves a better prediction casting solutions in the Appendix.
2
LogSparse and convolutional Iterated Multi-Step
Channel projection self-attention @LogTrans (IMS) @LogTrans
Normalization
ProbSparse and distilling Direct Multi-Step
self-attention @Informer (DMS) @Informer
Fixed position
Output
Series auto-correlation with DMS with auto-correlation and
Input
Timestamp
preparation decomposition @Autoformer decomposition @Autoformer
Local timestamp
Multi-resolution pyramidal DMS along spatio-temporal
Seasonal-trend attention @Pyraformer dimension @Pyraformer
decomposition Global timestamp Frequency enhanced block with DMS with frequency attention
decomposition @FEDformer and decomposition@FEDformer
Figure 1. The pipeline of existing Transformer-based TSF solutions. In (a) and (b), the solid boxes are essential operations, and the dotted
boxes are applied optionally. (c) and (d) are distinct for different methods [16, 18, 28, 30, 31].
Time series decomposition: For data preprocessing, nor- attention mechanism and a self-attention distilling operation
malization with zero-mean is common in TSF. Besides, to decrease the complexity to O (LlogL), and FEDformer
Autoformer [28] first applies seasonal-trend decomposition designs a Fourier enhanced block and a wavelet enhanced
behind each neural block, which is a standard method in block with random selection to obtain O (L) complexity.
time series analysis to make raw data more predictable [6, Lastly, Autoformer designs a series-wise auto-correlation
13]. Specifically, they use a moving average kernel on the mechanism to replace the original self-attention layer.
input sequence to extract the trend-cyclical component of Decoders: The vanilla Transformer decoder outputs se-
the time series. The difference between the original se- quences in an autoregressive manner, resulting in a slow in-
quence and the trend component is regarded as the seasonal ference speed and error accumulation effects, especially for
component. On top of the decomposition scheme of Aut- long-term predictions. Informer designs a generative-style
oformer, FEDformer [31] further proposes the mixture of decoder for DMS forecasting. Other Transformer variants
experts’ strategies to mix the trend components extracted employ similar DMS strategies. For instance, Pyraformer
by moving average kernels with various kernel sizes. uses a fully-connected layer concatenating Spatio-temporal
Input embedding strategies: The self-attention layer in axes as the decoder. Autoformer sums up two refined de-
the Transformer architecture cannot preserve the positional composed features from trend-cyclical components and the
information of the time series. However, local positional stacked auto-correlation mechanism for seasonal compo-
information, i.e. the ordering of time series, is important. nents to get the final prediction. FEDformer also uses a
Besides, global temporal information, such as hierarchical decomposition scheme with the proposed frequency atten-
timestamps (week, month, year) and agnostic timestamps tion block to decode the final results.
(holidays and events), is also informative [30]. To enhance The premise of Transformer models is the semantic cor-
the temporal context of time-series inputs, a practical design relations between paired elements, while the self-attention
in the SOTA Transformer-based methods is injecting sev- mechanism itself is permutation-invariant, and its capabil-
eral embeddings, like a fixed positional encoding, a channel ity of modeling temporal relations largely depends on posi-
projection embedding, and learnable temporal embeddings tional encodings associated with input tokens. Considering
into the input sequence. Moreover, temporal embeddings the raw numerical data in time series (e.g., stock prices or
with a temporal convolution layer [16] or learnable times- electricity values), there are hardly any point-wise semantic
tamps [28] are introduced. correlations between them. In time series modeling, we are
Self-attention schemes: Transformers rely on the self- mainly interested in the temporal relations among a contin-
attention mechanism to extract the semantic dependen- uous set of points, and the order of these elements instead
cies between paired elements. Motivated by reducing of the paired relationship plays the most crucial role. While
the O L2 time and memory complexity of the vanilla employing positional encoding and using tokens to embed
Transformer, recent works propose two strategies for ef- sub-series facilitate preserving some ordering information,
ficiency. On the one hand, LogTrans and Pyraformer the nature of the permutation-invariant self-attention mech-
explicitly introduce a sparsity bias into the self-attention anism inevitably results in temporal information loss. Due
scheme. Specifically, LogTrans uses a Logsparse mask to to the above observations, we are interested in revisiting the
reduce the computational complexity to O (LlogL) while effectiveness of Transformer-based LTSF solutions.
Pyraformer adopts pyramidal attention that captures hierar-
chically multi-scale temporal dependencies with an O (L) 4. An Embarrassingly Simple Baseline
time and memory complexity. On the other hand, In-
former and FEDformer use the low-rank property in the In the experiments of existing Transformer-based LTSF
self-attention matrix. Informer proposes a ProbSparse self- solutions (T 1), all the compared (non-Transformer)
3
baselines are IMS forecasting techniques, which are known 5. Experiments
to suffer from significant error accumulation effects. We
hypothesize that the performance improvements in these 5.1. Experimental Settings
works are largely due to the DMS strategy used in them. Dataset. We conduct extensive experiments on nine
widely-used real-world datasets, including ETT (Electricity
Future 𝑇 timesteps Transformer Temperature) [30] (ETTh1, ETTh2, ETTm1,
ETTm2), Traffic, Electricity, Weather, ILI, Exchange-
Forecasting Output Rate [15]. All of them are multivariate time series. We
leave data descriptions in the Appendix.
Evaluation metric. Following previous works [28, 30,
𝑋∈ℝ × 31], we use Mean Squared Error (MSE) and Mean Absolute
Error (MAE) as the core metrics to compare performance.
History 𝐿 timesteps Compared methods. We include five recent
(b) One Linear Layer Transformer-based methods: FEDformer [31], Aut-
Figure 2. Illustration of the basic linear model. oformer [28], Informer [30], Pyraformer [18], and
LogTrans [16]. Besides, we include a naive DMS method:
To validate this hypothesis, we present the simplest DMS Closest Repeat (Repeat), which repeats the last value in the
model via a temporal linear layer, named LTSF-Linear, as a look-back window, as another simple baseline. Since there
baseline for comparison. The basic formulation of LTSF- are two variants of FEDformer, we compare the one with
Linear directly regresses historical time series for future better accuracy (FEDformer-f via Fourier transform).
prediction via a weighted sum operation (as illustrated in
Figure 2). The mathematical expression is X̂i = W Xi , 5.2. Comparison with Transformers
where W ∈ RT ×L is a linear layer along the temporal axis. Quantitative results. In Table 2, we extensively eval-
X̂i and Xi are the prediction and input for each ith vari- uate all mentioned Transformers on nine benchmarks, fol-
ate. Note that LTSF-Linear shares weights across different lowing the experimental setting of previous work [28, 30,
variates and does not model any spatial correlations. 31]. Surprisingly, the performance of LTSF-Linear sur-
LTSF-Linear is a set of linear models. Vanilla Linear is passes the SOTA FEDformer in most cases by 20% ∼ 50%
a one-layer linear model. To handle time series across dif- improvements on the multivariate forecasting, where LTSF-
ferent domains (e.g., finance, traffic, and energy domains), Linear even does not model correlations among variates.
we further introduce two variants with two preprocessing For different time series benchmarks, NLinear and DLin-
methods, named DLinear and NLinear. ear show the superiority to handle the distribution shift and
trend-seasonality features. We also provide results for uni-
variate forecasting of ETT datasets in the Appendix, where
• Specifically, DLinear is a combination of a Decom- LTSF-Linear still consistently outperforms Transformer-
position scheme used in Autoformer and FEDformer based LTSF solutions by a large margin.
with linear layers. It first decomposes a raw data in- FEDformer achieves competitive forecasting accuracy
put into a trend component by a moving average ker- on ETTh1. This because FEDformer employs classical time
nel and a remainder (seasonal) component. Then, two series analysis techniques such as frequency processing,
one-layer linear layers are applied to each component, which brings in time series inductive bias and benefits the
and we sum up the two features to get the final predic- ability of temporal feature extraction. In summary, these re-
tion. By explicitly handling trend, DLinear enhances sults reveal that existing complex Transformer-based LTSF
the performance of a vanilla linear when there is a clear solutions are not seemingly effective on the existing nine
trend in the data. benchmarks while LTSF-Linear can be a powerful baseline.
Another interesting observation is that even though the
naive Repeat method shows worse results when predict-
• Meanwhile, to boost the performance of LTSF-Linear ing long-term seasonal data (e.g., Electricity and Traffic), it
when there is a distribution shift in the dataset, NLin- surprisingly outperforms all Transformer-based methods on
ear first subtracts the input by the last value of the se- Exchange-Rate (around 45%). This is mainly caused by the
quence. Then, the input goes through a linear layer, wrong prediction of trends in Transformer-based solutions,
and the subtracted part is added back before making which may overfit toward sudden change noises in the train-
the final prediction. The subtraction and addition in ing data, resulting in significant accuracy degradation (see
NLinear are a simple normalization for the input se- Figure 3(b)). Instead, Repeat does not have the bias.
quence. Qualitative results. As shown in Figure 3, we plot
4
Datasets ETTh1&ETTh2 ETTm1 &ETTm2 Traffic Electricity Exchange-Rate Weather ILI
Variates 7 7 862 321 8 21 7
Timesteps 17,420 69,680 17,544 26,304 7,588 52,696 966
Granularity 1hour 5min 1hour 1hour 1day 10min 1week
Table 1. The statistics of the nine popular datasets for the LTSF problem.
Methods IMP. Linear* NLinear* DLinear* FEDformer Autoformer Informer Pyraformer* LogTrans Repeat*
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Exchange Electricity
96 27.40% 0.140 0.237 0.141 0.237 0.140 0.237 0.193 0.308 0.201 0.317 0.274 0.368 0.386 0.449 0.258 0.357 1.588 0.946
192 23.88% 0.153 0.250 0.154 0.248 0.153 0.249 0.201 0.315 0.222 0.334 0.296 0.386 0.386 0.443 0.266 0.368 1.595 0.950
336 21.02% 0.169 0.268 0.171 0.265 0.169 0.267 0.214 0.329 0.231 0.338 0.300 0.394 0.378 0.443 0.280 0.380 1.617 0.961
720 17.47% 0.203 0.301 0.210 0.297 0.203 0.301 0.246 0.355 0.254 0.361 0.373 0.439 0.376 0.445 0.283 0.376 1.647 0.975
96 45.27% 0.082 0.207 0.089 0.208 0.081 0.203 0.148 0.278 0.197 0.323 0.847 0.752 0.376 1.105 0.968 0.812 0.081 0.196
192 42.06% 0.167 0.304 0.180 0.300 0.157 0.293 0.271 0.380 0.300 0.369 1.204 0.895 1.748 1.151 1.040 0.851 0.167 0.289
336 33.69% 0.328 0.432 0.331 0.415 0.305 0.414 0.460 0.500 0.509 0.524 1.672 1.036 1.874 1.172 1.659 1.081 0.305 0.396
720 46.19% 0.964 0.750 1.033 0.780 0.643 0.601 1.195 0.841 1.447 0.941 2.478 1.310 1.943 1.206 1.941 1.127 0.823 0.681
96 30.15% 0.410 0.282 0.410 0.279 0.410 0.282 0.587 0.366 0.613 0.388 0.719 0.391 2.085 0.468 0.684 0.384 2.723 1.079
Traffic
192 29.96% 0.423 0.287 0.423 0.284 0.423 0.287 0.604 0.373 0.616 0.382 0.696 0.379 0.867 0.467 0.685 0.390 2.756 1.087
336 29.95% 0.436 0.295 0.435 0.290 0.436 0.296 0.621 0.383 0.622 0.337 0.777 0.420 0.869 0.469 0.734 0.408 2.791 1.095
720 25.87% 0.466 0.315 0.464 0.307 0.466 0.315 0.626 0.382 0.660 0.408 0.864 0.472 0.881 0.473 0.717 0.396 2.811 1.097
96 18.89% 0.176 0.236 0.182 0.232 0.176 0.237 0.217 0.296 0.266 0.336 0.300 0.384 0.896 0.556 0.458 0.490 0.259 0.254
Weather
192 21.01% 0.218 0.276 0.225 0.269 0.220 0.282 0.276 0.336 0.307 0.367 0.598 0.544 0.622 0.624 0.658 0.589 0.309 0.292
336 22.71% 0.262 0.312 0.271 0.301 0.265 0.319 0.339 0.380 0.359 0.395 0.578 0.523 0.739 0.753 0.797 0.652 0.377 0.338
720 19.85% 0.326 0.365 0.338 0.348 0.323 0.362 0.403 0.428 0.419 0.428 1.059 0.741 1.004 0.934 0.869 0.675 0.465 0.394
24 47.86% 1.947 0.985 1.683 0.858 2.215 1.081 3.228 1.260 3.483 1.287 5.764 1.677 1.420 2.012 4.480 1.444 6.587 1.701
36 36.43% 2.182 1.036 1.703 0.859 1.963 0.963 2.679 1.080 3.103 1.148 4.755 1.467 7.394 2.031 4.799 1.467 7.130 1.884
ILI
48 34.43% 2.256 1.060 1.719 0.884 2.130 1.024 2.622 1.078 2.669 1.085 4.763 1.469 7.551 2.057 4.800 1.468 6.575 1.798
60 34.33% 2.390 1.104 1.819 0.917 2.368 1.096 2.857 1.157 2.770 1.125 5.264 1.564 7.662 2.100 5.278 1.560 5.893 1.677
96 0.80% 0.375 0.397 0.374 0.394 0.375 0.399 0.376 0.419 0.449 0.459 0.865 0.713 0.664 0.612 0.878 0.740 1.295 0.713
ETTh1
192 3.57% 0.418 0.429 0.408 0.415 0.405 0.416 0.420 0.448 0.500 0.482 1.008 0.792 0.790 0.681 1.037 0.824 1.325 0.733
336 6.54% 0.479 0.476 0.429 0.427 0.439 0.443 0.459 0.465 0.521 0.496 1.107 0.809 0.891 0.738 1.238 0.932 1.323 0.744
720 13.04% 0.624 0.592 0.440 0.453 0.472 0.490 0.506 0.507 0.514 0.512 1.181 0.865 0.963 0.782 1.135 0.852 1.339 0.756
96 19.94% 0.288 0.352 0.277 0.338 0.289 0.353 0.346 0.388 0.358 0.397 3.755 1.525 0.645 0.597 2.116 1.197 0.432 0.422
ETTh2
192 19.81% 0.377 0.413 0.344 0.381 0.383 0.418 0.429 0.439 0.456 0.452 5.602 1.931 0.788 0.683 4.315 1.635 0.534 0.473
336 25.93% 0.452 0.461 0.357 0.400 0.448 0.465 0.496 0.487 0.482 0.486 4.721 1.835 0.907 0.747 1.124 1.604 0.591 0.508
720 14.25% 0.698 0.595 0.394 0.436 0.605 0.551 0.463 0.474 0.515 0.511 3.647 1.625 0.963 0.783 3.188 1.540 0.588 0.517
96 21.10% 0.308 0.352 0.306 0.348 0.299 0.343 0.379 0.419 0.505 0.475 0.672 0.571 0.543 0.510 0.600 0.546 1.214 0.665
ETTm1
192 21.36% 0.340 0.369 0.349 0.375 0.335 0.365 0.426 0.441 0.553 0.496 0.795 0.669 0.557 0.537 0.837 0.700 1.261 0.690
336 17.07% 0.376 0.393 0.375 0.388 0.369 0.386 0.445 0.459 0.621 0.537 1.212 0.871 0.754 0.655 1.124 0.832 1.283 0.707
720 21.73% 0.440 0.435 0.433 0.422 0.425 0.421 0.543 0.490 0.671 0.561 1.166 0.823 0.908 0.724 1.153 0.820 1.319 0.729
96 17.73% 0.168 0.262 0.167 0.255 0.167 0.260 0.203 0.287 0.255 0.339 0.365 0.453 0.435 0.507 0.768 0.642 0.266 0.328
ETTm2
192 17.84% 0.232 0.308 0.221 0.293 0.224 0.303 0.269 0.328 0.281 0.340 0.533 0.563 0.730 0.673 0.989 0.757 0.340 0.371
336 15.69% 0.320 0.373 0.274 0.327 0.281 0.342 0.325 0.366 0.339 0.372 1.363 0.887 1.201 0.845 1.334 0.872 0.412 0.410
720 12.58% 0.413 0.435 0.368 0.384 0.397 0.421 0.421 0.415 0.433 0.432 3.379 1.338 3.625 1.451 3.048 1.328 0.521 0.465
- Methods* are implemented by us; Other results are from FEDformer [31].
Table 2. Multivariate long-term forecasting errors in terms of MSE and MAE, the lower the better. Among them, ILI dataset is with
forecasting horizon T ∈ {24, 36, 48, 60}. For the others, T ∈ {96, 192, 336, 720}. Repeat repeats the last value in the look-back window.
The best results are highlighted in bold and the best results of Transformers are highlighted with a underline. Accordingly, IMP. is the
best result of linear models compared to the results of Transformer-based solutions.
the prediction results on three selected time series datasets it determines how much we can learn from historical data.
with Transformer-based solutions and LTSF-Linear: Elec- Generally speaking, a powerful TSF model with a strong
tricity (Sequence 1951, Variate 36), Exchange-Rate (Se- temporal relation extraction capability should be able to
quence 676, Variate 3), and ETTh2 ( Sequence 1241, Vari- achieve better results with larger look-back window sizes.
ate 2), where these datasets have different temporal patterns. To study the impact of input look-back win-
When the input length is 96 steps, and the output horizon dow sizes, we conduct experiments with L ∈
is 336 steps, Transformers [28, 30, 31] fail to capture the {24, 48, 72, 96, 120, 144, 168, 192, 336, 504, 672, 720}
scale and bias of the future data on Electricity and ETTh2. for long-term forecasting (T=720). Figure 4 demonstrates
Moreover, they can hardly predict a proper trend on aperi- the MSE results on two datasets. Similar to the observations
odic data such as Exchange-Rate. These phenomena further from previous studies [27, 30], existing Transformer-based
indicate the inadequacy of existing Transformer-based solu- models’ performance deteriorates or stays stable when
tions for the LTSF task. the look-back window size increases. In contrast, the
performances of all LTSF-Linear are significantly boosted
5.3. More Analyses on LTSF-Transformers
with the increase of look-back window size. Thus, existing
Can existing LTSF-Transformers extract temporal re- solutions tend to overfit temporal noises instead of extract-
lations well from longer input sequences? The size of the ing temporal information if given a longer sequence, and
look-back window greatly impacts forecasting accuracy as the input size 96 is exactly suitable for most Transformers.
5
GrouthTruth Autoformer Informer FEDformer DLinear GrouthTruth Autoformer Informer FEDformer DLinear GrouthTruth Autoformer Informer FEDformer DLinear
1.0
4
1.0
0.5
3
0.5
0.0
2
0.0
0.5
1
0.5
0 1.0
1.0
1 1.5
1.5
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) Electricity (b) Exchange-Rate (c) ETTh2
Figure 3. Illustration of the long-term forecasting output (Y-axis) of five models with an input length L=96 and output length T =192
(X-axis) on Electricity, Exchange-Rate, and ETTh2, respectively.
Additionally, we provide more quantitative results in the 96 time steps. From the experimental results, the perfor-
Appendix, and our conclusion holds in almost all cases. mance of the SOTA Transformers drops slightly, indicat-
ing these models only capture similar temporal information
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
from the adjacent time series sequence. Since capturing the
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
intrinsic characteristics of the dataset generally does not re-
1.4
0.40 quire a large number of parameters, i,e. one parameter can
1.2 represent the periodicity. Using too many parameters will
0.35
1.0 even cause overfitting, which partially explains why LTSF-
0.30
0.8 Linear performs better than Transformer-based methods.
0.6
0.25 Are the self-attention scheme effective for LTSF? We
0.20 verify whether these complex designs in the existing Trans-
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720
former (e.g., Informer) are essential. In Table 4, we gradu-
(a) 720 steps-Traffic (b) 720 steps-Electricity ally transform Informer to Linear. First, we replace each
self-attention layer by a linear layer, called Att.-Linear,
Figure 4. The MSE results (Y-axis) of models with different look-
back window sizes (X-axis) of long-term forecasting (T=720) on
since a self-attention layer can be regarded as a fully-
the Traffic and Electricity datasets. connected layer where weights are dynamically changed.
Furthermore, we discard other auxiliary designs (e.g., FFN)
What can be learned for long-term forecasting? While in Informer to leave embedding layers and linear layers,
the temporal dynamics in the look-back window signifi- named Embed + Linear. Finally, we simplify the model to
cantly impact the forecasting accuracy of short-term time one linear layer. Surprisingly, the performance of Informer
series forecasting, we hypothesize that long-term forecast- grows with the gradual simplification, indicating the unnec-
ing depends on whether models can capture the trend and essary of the self-attention scheme and other complex mod-
periodicity well only. That is, the farther the forecasting ules at least for existing LTSF benchmarks.
horizon, the less impact the look-back window itself has.
Methods Informer Att.-Linear Embed + Linear Linear
96 0.847 1.003 0.173 0.084
Exchange
Table 3. Comparison of different input sequences under the MSE 192 1.008 0.759 0.686 0.438
metric to explore what LTSF-Transformers depend on. If the in- 336 1.107 0.921 0.821 0.479
put is Close, we use the 96th , ..., 191th time steps as the input 720 1.181 0.902 1.051 0.515
sequence. If the input is Far, we use the 0th , ..., 95th time steps. Table 4. The MSE comparisons of gradually transforming In-
Both of them forecast the 192th , ..., (192 + 720)th time steps. former to a Linear from the left to right columns. Att.-Linear is
To validate the above hypothesis, in Table 3, we compare a structure that replaces each attention layer with a linear layer.
Embed + Linear is to drop other designs and only keeps embed-
the forecasting accuracy for the same future 720 time steps
ding layers and a linear layer. The look-back window size is 96.
with data from two different look-back windows: (i). the
original input L=96 setting (called Close) and (ii). the far Can existing LTSF-Transformers preserve temporal
input L=96 setting (called Far) that is before the original order well? Self-attention is inherently permutation-
6
Methods Linear FEDformer Autoformer Informer
Predict Length Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex. Ori. Shuf. Half-Ex.
96 0.080 0.133 0.169 0.161 0.160 0.162 0.152 0.158 0.160 0.952 1.004 0.959
Exchange
192 0.162 0.208 0.243 0.274 0.275 0.275 0.278 0.271 0.277 1.012 1.023 1.014
336 0.286 0.320 0.345 0.439 0.439 0.439 0.435 0.430 0.435 1.177 1.181 1.177
720 0.806 0.819 0.836 1.122 1.122 1.122 1.113 1.113 1.113 1.198 1.210 1.196
Average Drop N/A 27.26% 46.81% N/A -0.09% 0.20% N/A 0.09% 1.12% N/A -0.12% -0.18%
96 0.395 0.824 0.431 0.376 0.753 0.405 0.455 0.838 0.458 0.974 0.971 0.971
ETTh1
192 0.447 0.824 0.471 0.419 0.730 0.436 0.486 0.774 0.491 1.233 1.232 1.231
336 0.490 0.825 0.505 0.447 0.736 0.453 0.496 0.752 0.497 1.693 1.693 1.691
720 0.520 0.846 0.528 0.468 0.720 0.470 0.525 0.696 0.524 2.720 2.716 2.715
Average Drop N/A 81.06% 4.78% N/A 73.28% 3.44% N/A 56.91% 0.46% N/A 1.98% 0.18%
Table 5. The MSE comparisons of models when shuffling the raw input sequence. Shuf. randomly shuffles the input sequence. Half-EX.
randomly exchanges the first half of the input sequences with the second half. Average Drop is the average performance drop under all
forecasting lengths after shuffling. All results are the average test MSE of five runs.
7
with reduced training data are lower in most cases. This contributions do not come from proposing a linear model
might because the whole-year data maintains more clear but rather from throwing out an important question, show-
temporal features than a longer but incomplete data size. ing surprising comparisons, and demonstrating why LTSF-
While we cannot conclude that we should use less data for Transformers are not as effective as claimed in these works
training, it demonstrates that the training data scale is not through various perspectives. We sincerely hope our com-
the limiting reason for the performances of Autoformer and prehensive studies can benefit future work in this area.
FEDformer.
8
Appendix:
Are Transformers Effective for Time Series Forecasting?
In this Appendix, we provide descriptions of non- • ETT (Electricity Transformer Temperature) [30]2 con-
Transformer-based TSF solutions, detailed experimental sists of two hourly-level datasets (ETTh) and two 15-
settings, more comparisons under different look-back win- minute-level datasets (ETTm). Each of them contains
dow sizes, and the visualization of LTSF-Linear on all seven oil and load features of electricity transformers
datasets. We also append our code to reproduce the results from July 2016 to July 2018.
shown in the paper.
• Traffic3 describes the road occupancy rates. It con-
A. Related Work: Non-Transformer-Based tains the hourly data recorded by the sensors of San
Francisco freeways from 2015 to 2016.
TSF Solutions
As a long-standing problem with a wide range of ap- • Electricity4 collects the hourly electricity consumption
plications, statistical approaches (e.g., autoregressive inte- of 321 clients from 2012 to 2014.
grated moving average (ARIMA) [1], exponential smooth- • Exchange-Rate [15]5 collects the daily exchange rates
ing [12], and structural models [14]) for time series fore- of 8 countries from 1990 to 2016.
casting have been used from the 1970s onward. Generally
speaking, the parametric models used in statistical methods • Weather6 includes 21 indicators of weather, such as air
require significant domain expertise to build. temperature, and humidity. Its data is recorded every
To relieve this burden, many machine learning 10 min for 2020 in Germany.
techniques such as gradient boosting regression tree
(GBRT) [10, 11] gain popularity, which learns the tempo- • ILI7 describes the ratio of patients seen with influenza-
ral dynamics of time series in a data-driven manner. How- like illness and the number of patients. It includes
ever, these methods still require manual feature engineer- weekly data from the Centers for Disease Control and
ing and model designs. With the powerful representation Prevention of the United States from 2002 to 2021.
learning capability of deep neural networks (DNNs) from
B.2. Implementation Details
abundant data, various deep learning-based TSF solutions
are proposed in the literature, achieving better forecasting For existing Transformer-based TSF solutions: the im-
accuracy than traditional techniques in many cases. plementation of Autoformer [28], Informer [30], and the
Besides Transformers, the other two popular DNN archi- vanilla Transformer [26] are all taken from the Autoformer
tectures are also applied for time series forecasting: work [28]; the implementation of FEDformer [31] and
Pyraformer [18] are from their respective code repository.
• Recurrent neural networks (RNNs) based methods We also adopt their default hyper-parameters to train the
(e.g., [21]) summarize the past information compactly models. For DLinear, the moving average kernel size for
in internal memory states and recursively update them- decomposition is 25, which is the same as Autoformer. The
selves for forecasting. total parameters of a vanilla linear model and a NLinear
are TL. The total parameters of the DLinear are 2TL. Since
• Convolutional neural networks (CNNs) based meth- LTSF-Linear will be underfitting when the input length is
ods (e.g., [3]), wherein convolutional filters are used short, and LTSF-Transformers tend to overfit on a long
to capture local temporal features. lookback window size. To compare the best performance
of existing LTSF-Transformers with LTSF-Linear, we re-
RNN-based TSF methods belong to IMS forecasting port L=336 for LTSF-Linear and L=96 for Transformers by
techniques. Depending on whether the decoder is imple- default. For more hyper-parameters of LTSF-Linear, please
mented in an autoregressive manner, there are either IMS refer to our code.
or DMS forecasting techniques for CNN-based TSF meth-
2 https://github.com/zhouhaoyi/ETDataset
ods [3, 17]. 3 http://pems.dot.ca.gov
4 https : / / archive . ics . uci . edu / ml / datasets /
B. Experimental Details ElectricityLoadDiagrams20112014
5 https : / / github . com / laiguokun / multivariate -
B.1. Data Descriptions time-series-data
6 https://www.bgc-jena.mpg.de/wetter/
We use nine wildly-used datasets in the main paper. The 7 https : / / gis . cdc . gov / grasp / fluview /
9
C. Additional Comparison with Transformers tribute it to the low information-to-noise ratio in such finan-
cial data.
We further compare LTSF-Linear with LTSF-
Transformer for Univariate Forecasting on four ETT D. Ablation study on the LTSF-Linear
datasets. Moreover, in Figure 4 of the main paper, we
demonstrate that existing Transformers fail to exploit large D.1. Motivation of NLinear
look-back window sizes with two examples. Here, we give If we normalize the test data by the mean and variance of
comprehensive comparisons between LTSF-Linear and train data, there could be a distribution shift in testing data,
Transformer-based TSF solutions under various look-back i.e, the mean value of testing data is not 0. If the model
window sizes on all benchmarks. made a prediction that is out of the distribution of true value,
a large error would occur. For example, there is a large er-
C.1. Comparison of Univariate Forecasting
ror between the true value and the true value minus/add one.
We present the univariate forecasting results on the four Therefore, in NLinear, we use the subtraction and addition
ETT datasets in table 9. Similarly, LTSF-Linear, especially to shift the model prediction toward the distribution of true
for NLinear can consistently outperform all transformer- value. Then, large errors are avoided, and the model perfor-
based methods by a large margin in most time. We find mances can be improved. Figure 5 illustrates histograms of
that there are serious distribution shifts between training the trainset-test set distributions, where each bar represents
and test sets (as shown in Fig. 5 (a), (b)) on ETTh1 and the number of data points. Clear distribution shifts between
ETTh2 datasets. Simply normalization via the last value training and testing data can be observed in ETTh1, ETTh2,
from the lookback window can greatly relieve the distribu- and ILI. Accordingly, from Table 9 and Table 2 in the main
tion shift problem. paper, we can observe that there are great improvements
in the three datasets comparing the NLinear to the Linear,
C.2. Comparison under Different Look-back Win- showing the effectiveness of the NLinear in relieving dis-
dows tribution shifts. Moreover, for the datasets without obvi-
ous distribution shifts, like Electricity in Figure 5(c), using
In Figure 6, we provide the MSE comparisons of five
the vanilla Linear can be enough, demonstrating the similar
LTSF-Transformers with LTSF-Linear under different look-
performance with NLinear and DLinear.
back window sizes to explore whether existing Transform-
ers can extract temporal well from longer input sequences. D.2. The Features of LTSF-Linear
For hourly granularity datasets (ETTh1, ETTh2, Traffic,
Although LTSF-Linear is simple, it has some compelling
and Electricity), the increasing look-back window sizes are
characteristics:
{24, 48, 72, 96, 120, 144, 168, 192, 336, 504, 672, 720},
which represent {1, 2, 3, 4, 5, 6, 7, 8, 14, 21, 28, 30} • An O(1) maximum signal traversing path length:
days. The forecasting steps are {24, 720}, which mean {1, The shorter the path, the better the dependencies are
30} days. For 5-minute granularity datasets (ETTm1 and captured [18], making LTSF-Linear capable of cap-
ETTm2), we set the look-back window size as {24, 36, 48, turing both short-range and long-range temporal rela-
60, 72, 144, 288}, which represent {2, 3, 4, 5, 6, 12, 24} tions.
hours. For 10-minute granularity datasets (Weather), we set
the look-back window size as {24, 48, 72, 96, 120, 144, • High-efficiency: As LTSF-Linear is a linear model
168, 192, 336, 504, 672, 720}, which mean {4, 8, 12, 16, with two linear layers at most, it costs much lower
20, 24, 28, 32, 56, 84, 112, 120} hours. The forecasting memory and fewer parameters and has a faster infer-
steps are {24, 720} that are {4, 120} hours. For weekly ence speed than existing Transformers (see Table 8 in
granularity dataset (ILI), we set the look-back window size main paper).
as {26, 52, 78, 104, 130, 156, 208}, which represent {0.5, 1, • Interpretability: After training, we can visualize
1.5, 2, 2.5, 3, 3.5, 4} years. The corresponding forecasting weights from the seasonality and trend branches to
steps are {26, 208}, meaning {0.5, 4} years. have some insights on the predicted values [9].
As shown in Figure 6, with increased look-back win-
dow sizes, the performance of LTSF-Linear is significantly • Easy-to-use: LTSF-Linear can be obtained easily
boosted for most datasets (e.g., ETTm1 and Traffic), while without tuning model hyper-parameters.
this is not the case for Transformer-based TSF solutions.
D.3. Interpretability of LTSF-Linear
Most of their performance fluctuates or gets worse as
the input lengths increase. To be specific, the results of Because LTSF-Linear is a set of linear models, the
Exchange-Rate do not show improved results with a long weights of linear layers can directly reveal how LTSF-
look-back window (from Figure 6(m) and (n)), and we at- Linear works. The weight visualization of LTSF-Linear can
10
Methods Linear NLinear DLinear FEDformer-f FEDformer-w Autoformer Informer LogTrans
Metric MSE MAE MSE MSE MAE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.189 0.359 0.053 0.177 0.056 0.180 0.079 0.215 0.080 0.214 0.071 0.206 0.193 0.377 0.283 0.468
ET T h1
192 0.078 0.212 0.069 0.204 0.071 0.204 0.104 0.245 0.105 0.256 0.114 0.262 0.217 0.395 0.234 0.409
336 0.091 0.237 0.081 0.226 0.098 0.244 0.119 0.270 0.120 0.269 0.107 0.258 0.202 0.381 0.386 0.546
720 0.172 0.340 0.080 0.226 0.189 0.359 0.142 0.299 0.127 0.280 0.126 0.283 0.183 0.355 0.475 0.629
96 0.133 0.283 0.129 0.278 0.131 0.279 0.128 0.271 0.156 0.306 0.153 0.306 0.213 0.373 0.217 0.379
ET T h2
192 0.176 0.330 0.169 0.324 0.176 0.329 0.185 0.330 0.238 0.380 0.204 0.351 0.227 0.387 0.281 0.429
336 0.213 0.371 0.194 0.355 0.209 0.367 0.231 0.378 0.271 0.412 0.246 0.389 0.242 0.401 0.293 0.437
720 0.292 0.440 0.225 0.381 0.276 0.426 0.278 0.420 0.288 0.438 0.268 0.409 0.291 0.439 0.218 0.387
96 0.028 0.125 0.026 0.122 0.028 0.123 0.033 0.140 0.036 0.149 0.056 0.183 0.109 0.277 0.049 0.171
ET T m1
192 0.043 0.154 0.039 0.149 0.045 0.156 0.058 0.186 0.069 0.206 0.081 0.216 0.151 0.310 0.157 0.317
336 0.059 0.180 0.052 0.172 0.061 0.182 0.084 0.231 0.071 0.209 0.076 0.218 0.427 0.591 0.289 0.459
720 0.080 0.211 0.073 0.207 0.080 0.210 0.102 0.250 0.105 0.248 0.110 0.267 0.438 0.586 0.430 0.579
96 0.066 0.189 0.063 0.182 0.063 0.183 0.067 0.198 0.063 0.189 0.065 0.189 0.088 0.225 0.075 0.208
ET T m2
192 0.094 0.230 0.090 0.223 0.092 0.227 0.102 0.245 0.110 0.252 0.118 0.256 0.132 0.283 0.129 0.275
336 0.120 0.263 0.117 0.259 0.119 0.261 0.130 0.279 0.147 0.301 0.154 0.305 0.180 0.336 0.154 0.302
720 0.175 0.320 0.170 0.318 0.175 0.320 0.178 0.325 0.219 0.368 0.182 0.335 0.300 0.435 0.160 0.321
Table 9. Univariate long sequence time-series forecasting results on ETT full benchmark. The best results are highlighted in bold and the
best results of Transformers are highlighted with a underline.
(a) ETTh1 channel6 (b) ETTh2 channel3 (c) Electricity channel3 (d) ILI channel6
Figure 5. Distribution of ETTh1, ETTh2, Electricity, and ILI dataset. A clear distribution shift between training and testing data can be
observed in ETTh1, ETTh2, and ILI.
also reveal certain characteristics in the data used for fore- ing steps. Among these forecasting time steps, the 0, 167,
casting. 335, 503, 671 time steps have higher weights. Note that 24
Here we take DLinear as an example. Accordingly, we time steps are a day, and 168 time steps are a week. This
visualize the trend and remainder weights of all datasets indicates that Traffic has a daily periodicity and a weekly
with a fixed input length of 96 and four different forecasting periodicity.
horizons. To obtain a smooth weight with a clear pattern in
visualization, we initialize the weights of the linear layers References
in DLinear as 1/L rather than random initialization. That
is, we use the same weight for every forecasting time step [1] Adebiyi A Ariyo, Adewumi O Adewumi, and
in the look-back window at the start of training. Charles K Ayo. Stock price prediction using the arima
How the model works: Figure 7(c) visualize the model. In 2014 UKSim-AMSS 16th International Con-
weights of the trend and the remaining layers on the ference on Computer Modelling and Simulation, pages
Exchange-Rate dataset. Due to the lack of periodicity and 106–112. IEEE, 2014. 1, 2, 9
seasonality in financial data, it is hard to observe clear pat-
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
terns, but the trend layer reveals greater weights of informa-
Bengio. Neural machine translation by jointly learn-
tion closer to the outputs, representing their larger contribu-
ing to align and translate. arXiv: Computation and
tions to the predicted values.
Language, 2014. 2
Periodicity of data: For Traffic data, as shown in Fig-
ure 7(d), the model gives high weights to the latest time [3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun.
step of the look-back window for the 0,23,47...719 forecast- An empirical evaluation of generic convolutional and
11
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.0 5
1.2
1.4
0.9
4
1.0
0.8 1.2
0.7 0.8 3
1.0
0.6 0.6
0.8 2
0.5
0.4
0.4 0.6 1
0.2
0.3
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720
(a) 24 steps-ETTh1 (b) 720 steps-ETTh1 (c) 24 steps-ETTh2 (d) 720 steps-ETTh2
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.2
0.6
1.1 0.30 4
1.0
0.5 0.25
0.9 3
0.8
0.4 0.20
2
0.7
(e) 24 steps-ETTm1 (f) 576 steps-ETTm1 (g) 24 steps-ETTm2 (h) 576 steps-ETTm2
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
0.50 0.9
1.6
0.45 1.4
1.4 0.8
0.40
1.2
1.2 0.7
0.35
0.30 1.0 1.0
0.6
0.25 0.8 0.8
0.20 0.5
0.6
0.15 0.6
0.4 0.4
0.10
0.4
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720
(i) 24 steps-Weather (j) 720 steps-Weather (k) 24 steps-Traffic (l) 720 steps-Traffic
Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear Transformer Autoformer Pyraformer NLinear
Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear Informer FEDformer Linear DLinear
1.4
7 8
1.2 3.0
7
6
1.0 2.5 6
0.8 5
2.0 5
0.6 4
1.5 4
0.4
3
1.0 3
0.2
2 2
0.0 0.5
24 48 72 96 120 144 168 192 336 504 672 720 24 48 72 96 120 144 168 192 336 504 672 720 26 52 78 104 130 156 208 26 52 78 104 130 156 208
Figure 6. The MSE results (Y-axis) of models with different look-back window sizes (X-axis) of the long-term forecasting (e.g., 720-time
steps) and the short-term forecasting (e.g., 24 time steps) on different benchmarks.
recurrent networks for sequence modeling. arXiv Triangular, variable-specific attentions for long se-
preprint arXiv:1803.01271, 2018. 1, 9 quence multivariate time series forecasting–full ver-
sion. arXiv preprint arXiv:2204.13767, 2022. 1
[4] Guillaume Chevillon. Direct multi-step estimation
and forecasting. Journal of Economic Surveys, [6] R. B. Cleveland. Stl : A seasonal-trend decomposition
21(4):746–785, 2007. 2 procedure based on loess. Journal of Office Statistics,
1990. 3
[5] Razvan-Gabriel Cirstea, Chenjuan Guo, Bin Yang,
Tung Kieu, Xuanyi Dong, and Shirui Pan. Triformer: [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
12
ETTh1
(a1) Remainder (a2) Trend (a3) Remainder (a4) Trend (a5) Remainder (a6) Trend (a7) Remainder (a8) Trend
In-96, Out-96 In-96, Out-168 In-96, Out-336 In-96, Out-720
Electricity
(b1) Remainder (b2) Trend (b3) Remainder (b4) Trend (b5) Remainder (b6) Trend (b7) Remainder (b8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Exchange-Rate
(c1) Remainder (c2) Trend (c3) Remainder (c4) Trend (c5) Remainder (c6) Trend (c7) Remainder (c8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Traffic
(d1) Remainder (d2) Trend (d3) Remainder (d4) Trend (d5) Remainder (d6) Trend (d7) Remainder (d8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
Weather
(e1) Remainder (e2) Trend (e3) Remainder (e4) Trend (e5) Remainder (e6) Trend (e7) Remainder (e8) Trend
In-96, Out-96 In-96, Out-192 In-96, Out-336 In-96, Out-720
ILI
(f1) Remainder (f2) Trend (f3) Remainder (f4) Trend (f5) Remainder (f6) Trend (f7) Remainder (f8) Trend
In-36, Out-24 In-36, Out-36 In-36, Out-48 In-36, Out-60
Figure 7. Visualization of the weights(T*L) of LTSF-Linear on several benchmarks. Models are trained with a look-back window L
(X-axis) and different forecasting time steps T (Y-axis). We show weights in the remainder and trend layer.
Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv
13
preprint arXiv:1810.04805, 2018. 1 shifted windows. In Proceedings of the IEEE/CVF
[8] Linhao Dong, Shuang Xu, and Bo Xu. Speech- International Conference on Computer Vision, pages
transformer: a no-recurrence sequence-to-sequence 10012–10022, 2021. 1
model for speech recognition. In 2018 IEEE Inter- [20] LIU Minhao, Ailing Zeng, LAI Qiuxia, Ruiyuan Gao,
national Conference on Acoustics, Speech and Signal Min Li, Jing Qin, and Qiang Xu. T-wavenet: A tree-
Processing (ICASSP), pages 5884–5888. IEEE, 2018. structured wavelet neural network for time series sig-
1 nal analysis. In International Conference on Learning
[9] Ruijun Dong and Witold Pedrycz. A granular time se- Representations, 2021. 2
ries approach to long-term forecasting and trend fore- [21] Gábor Petneházi. Recurrent neural networks for time
casting. Physica A: Statistical Mechanics and its Ap- series forecasting. arXiv preprint arXiv:1901.00069,
plications, 387(13):3253–3270, 2008. 10 2019. 9
[10] Shereen Elsayed, Daniela Thyssens, Ahmed Rashed, [22] David Salinas, Valentin Flunkert, and Jan Gasthaus.
Hadi Samer Jomaa, and Lars Schmidt-Thieme. Do Deepar: Probabilistic forecasting with autoregressive
we really need deep learning models for time series recurrent networks. International Journal of Forecast-
forecasting? arXiv preprint arXiv:2101.02118, 2021. ing, 2017. 2
9 [23] Souhaib Ben Taieb, Rob J Hyndman, et al. Recur-
[11] Jerome H Friedman. Greedy function approximation: sive and direct multi-step forecasting: the best of both
a gradient boosting machine. Annals of statistics, worlds, volume 19. Citeseer, 2012. 2
pages 1189–1232, 2001. 1, 9 [24] Sean J. Taylor and Benjamin Letham. Forecasting at
[12] Everette S Gardner Jr. Exponential smoothing: The scale. PeerJ Prepr., 2017. 2
state of the art. Journal of forecasting, 4(1):1–28, [25] Gerrit JJ van den Burg and Christopher KI Williams.
1985. 9 An evaluation of change point detection algorithms.
arXiv preprint arXiv:2003.06222, 2020. 8
[13] James Douglas Hamilton. Time series analysis.
Princeton university press, 2020. 3 [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
[14] Andrew C Harvey. Forecasting, structural time series Kaiser, and Illia Polosukhin. Attention is all you need.
models and the kalman filter. 1990. 9
Advances in neural information processing systems,
[15] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and 30, 2017. 1, 2, 9
Hanxiao Liu. Modeling long- and short-term tempo- [27] Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi
ral patterns with deep neural networks. international Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Trans-
acm sigir conference on research and development in formers in time series: A survey. arXiv preprint
information retrieval, 2017. 1, 4, 9 arXiv:2202.07125, 2022. 1, 2, 5
[16] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, [28] Jiehui Xu, Jianmin Wang, Mingsheng Long, et al.
Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. En- Autoformer: Decomposition transformers with auto-
hancing the locality and breaking the memory bottle- correlation for long-term series forecasting. Advances
neck of transformer on time series forecasting. Ad- in Neural Information Processing Systems, 34, 2021.
vances in Neural Information Processing Systems, 32, 1, 2, 3, 4, 5, 9
2019. 1, 2, 3, 4
[29] Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao,
[17] Minhao Liu, Ailing Zeng, Zhijian Xu, Qiuxia Lai, and Xizhou Zhu, Bo Dai, and Qiang Xu. Deciwatch: A
Qiang Xu. Time series is a special sequence: Fore- simple baseline for 10x efficient 2d and 3d pose esti-
casting with sample convolution and interaction. arXiv mation. arXiv preprint arXiv:2203.08713, 2022. 1
preprint arXiv:2106.09305, 2021. 1, 9 [30] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai
[18] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Informer: Beyond efficient transformer for long se-
Low-complexity pyramidal attention for long-range quence time-series forecasting. In The Thirty-Fifth
time series modeling and forecasting. In International AAAI Conference on Artificial Intelligence, AAAI
Conference on Learning Representations, 2021. 1, 2, 2021, Virtual Conference, volume 35, pages 11106–
3, 4, 9, 10 11115. AAAI Press, 2021. 1, 2, 3, 4, 5, 9
[19] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, [31] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang,
Zheng Zhang, Stephen Lin, and Baining Guo. Swin Liang Sun, and Rong Jin. Fedformer: Frequency en-
transformer: Hierarchical vision transformer using hanced decomposed transformer for long-term series
14
forecasting. In International Conference on Machine
Learning, 2022. 1, 2, 3, 4, 5, 9
15