20618-Article Text-24631-1-2-20220628
20618-Article Text-24631-1-2-20220628
6639
base Q1: What is the next value ?
models
Pθ1(x)
x2 xT−1
Model 1 ?
x1 … xT
Pθ2(x)
time t
Model 2
...
Q2: Which base model will be good at
...
predicting the next value ?
Pθ3(x) Model N
unstationary time series Model 1 Model 2 ... Model N
distribution
Figure 1: For non-stationary time series, different base models usually perform well on different data regimes. Compared with
predicting the next value directly (Q1), it could be easier to predict which base model is more likely to perform well (Q2).
Therefore, our goal is to find the appropriate models to make prediction with fast adaptation to the changing data distribution.
series model combination problem (Prudêncio and Luder- inforcement learning perspective with some insights.
mir 2004; Lemke and Gabrys 2010; Talagala et al. 2018; • Our model achieves state-of-the-art performances on var-
Montero-Manso et al. 2020). For example, FFORMS (Tala- ious public benchmarks.
gala et al. 2018) formulates the problem from a classification
perspective to select the best model. FFORMA (Montero- The rest of this paper is organized as follows. In Section 2
Manso et al. 2020) later extends FFORMS to output con- and Section 3, we provide the backgrounds and introduce
tinuous weights to form a weighted forecast combination. some related work. We then introduce the proposed RLMC
Other lines of research includes using some heuristics to framework in the Section 4. Experiments are conducted in
weight base models according to their recent performances Section 5 on several real-world datasets. We conclude the
(Cerqueira et al. 2017; Sánchez 2008), and applying rein- paper in Section 6.
forcement learning based methods (Feng and Zhang 2019;
Feng, Sun, and Zhang 2019). Background
In this work, we propose to tackle the model weight de-
Time Series Forecasting
termination problem for time series prediction as an rein-
forcement learning problem. Recently, RL has shown to be We consider the time series forecasting problem with dis-
effective on selecting teacher models for knowledge distil- crete time points. At time t, the task is to use a length-T
lation (Yuan et al. 2021) and selecting suitable samples to observed series history X t = {xt1 , · · · xtT |xti ∈ Rdx } to pre-
accelerate the training process (Fan et al. 2017). RL is ap- dict a vector of future values Y t = {y1t , · · · , yH
t
|yit ∈ Rdy },
pealing for this type of weight determination problem for dx dy
where R and R are the dimensions of the input/output
several reasons: data, and H is the forecast horizon. Time series forecasting
• Model-free RL methods enable us to learn a policy purely tasks can be further categorized into two fundamentally dif-
from logged data without knowing the underlying com- ferent cases: time series data task and panel data task. In the
plex system dynamics. time series data task, we are always predicting for the same
time series; while in the panel data task, we need to predict
• Compared with supervised learning based methods, RL for multiple different time series at the same time. Further-
is able to explore the search space more effectively to more, depends on the prediction output to be a point forecast
optimize the policy. or a distribution, the time series forecasting task can also be
In this paper, we propose a Reinforcement Learning based divided as point forecasting and probabilistic forecasting.
Model Combination (RLMC) method to learn complex pat-
terns from raw time series data by deep learning approaches. Reinforcement Learning
We aim to select the most suitable base model based on the
Reinforcement learning is usually formulated as a Markov
observation of the time series data . The main contribution
Decision Process (MDP), which can be defined as a tuple
of this paper can be summarized as follows:
M := ⟨S, A, P, r, γ⟩, where S is the set of states and A is
• We propose a general reinforcement learning based the set of actions, P(s′ |s, a) represents the dynamics func-
model combination method which outputs dynamic tion, r(s, a) represents the reward function, and γ ∈ [0, 1]
weights for time series forecasting problems with non- is the discount factor. The goal of an RL agent is to learn
stationary data distribution. a policy π(a|s) that maximizes the cumulative discounted
PL
• The model combination problem is analyzed from the re- rewards Rt = k=0 γ k rt+k , where L is the length of the
6640
horizon. Depends on whether we have the access to the dy- MDP Setting for Model Combination Problem
namics model P(s′ |s, a) and reward function r(s, a), RL Determining the weights for base models in a dynamic way
can be classified as model-free methods (Silver et al. 2014) can be treated as a sequential decision-making problem. The
and model-based methods (Sutton and Barto 2018). Model- MDP M = ⟨S, A, P, r, γ⟩ for the model selection problem
free RL algorithms learn a policy purely from the transitions can be summarized as:
collected by interacting with the environment, while model-
based RL algorithms can leverage the dynamics model to • State-space S. State st ∈ RT ×ds describes the informa-
generate transitions to optimize the policy. Based on whether tion about the time series at timestep t, where T is the
the control policy is modeled directly, RL algorithms can input sequence length and ds is the input dimension.
also be categorized into value-based methods (Mnih et al. • Action-space A. Action at ∈ RN is the non-negative
2013) and policy-based methods (Schulman et al. 2015). In model weights that sum to one for the N base models
the value-based method, we aim to approximate the opti- at timestep t.
mal value functions to select actions, while the policy-based • Transition dynamics P(st+1 |st , at ) = P(st+1 |st ) is
method search directly for the optimal policy parameters. actually irrelevant to the output model weights, which
means that our action at will not affect the next state st+1
Related Work in the model combination problems.
Traditional time-series methods based on linear auto- • Reward function r(s, a). Reward rt is defined as the pre-
regression or exponential smoothing (Hyndman and Athana- diction performance, i.e., prediction error or rank perfor-
sopoulos 2018) usually work on a few numbers of time- mance at timestep t.
series at a time. However, these methods can hardly be scal- • Discount factor γ ∈ [0, 1] describes how much we weigh
able to real-world problems with a large amount of train- for future performance. If we only care about one-step
ing samples. Due to the ability to model non-linear temporal prediction, then we can set γ = 0.
patterns, deep neural networks, especially the recurrent neu- The value of a state s is defined as:
ral networks (RNN) (Lai et al. 2018), dilated convolutions " L #
(Oord et al. 2016) and Transformers (Zhou et al. 2021), have π
X
t
gained popularity in time-series forecasting problems. V (s) = E γ r(st , at )|s0 = s, at ∼ πϕ (at |st ) ,
To solve model combination problem for time-series data, t=0
many early works (Collopy and Armstrong 1992; Arm- (1)
strong, Adya, and Collopy 2001) use hand-crafted rules to where L is the horizon length. The objective is to find a pol-
select models for prediction. However, rule-based methods icy πϕ (a|s) that maximizes the expected value of the states
rely heavily on expert knowledge and are limited to the spe- from the initial state distribution µ. In our problem, µ is the
cific tasks. Moreover, previous methods mostly either use uniform distribution across the samples in the test dataset.
the model performance or time-series data feature to select J(π) = Es∼µ [V π (s)] . (2)
model. For example, (Cerqueira et al. 2017; Sánchez 2008)
use some heuristics to weight base models according to their Insights from An RL Perspective
recent performances. On the other hand, FFORMS (Tala-
Unlike prevalent RL testbeds, such as playing video games,
gala et al. 2018) formulates the model selection problem as
the model combination problem has some unique properties.
a classification problem, in which it trains a random forest
Here, we provide two insights from the reinforcement learn-
to select the best model based on the input time-series fea-
ing perspective.
tures. FFORMA (Montero-Manso et al. 2020) later extends
Insight 1: Model-based exploration. Our first insight
FFORMS to output continuous weights to form a weighted
comes from the notice of the decoupling of state and action
forecast combination. Another line of research is to formu-
in the transition dynamics P(st+1 |st , at ) = P(st+1 |st ). If
late the model combination problem as an RL problem. For
we treat the transition dynamics P(st+1 |st ) as determinis-
example, DMS (Feng, Sun, and Zhang 2019) learns a Q-
tic by only using samples from the training set, then we are
learning agent to solve the model selection problem in which
actually in a model-based setting where the user-defined re-
the performance ranking improvement is used as the reward.
ward function r(st , at ) is also known. This insight indicates
However, DMS needs to learn a tabular Q-table for each
that we can generate arbitrary transitions (st , at , rt , st+1 )
rolling window with model index as input state, which re-
with P(st+1 |st ) and r(st , at ) to facilitate exploration.
quires learning thousands of individual Q-tables for large
Insight 2: Sparse optimal actions. Let us consider a spe-
scale problems and fail to leverage the information from the
cial case of the model combination problem, where we only
time series data.
select one model at a time. Then, the optimal policy is de-
terministic and outputs an one-hot vector unless there are
Methods multiple models that are equally well. In this case, a policy-
In this section, we first present the MDP formulation for the based method might be problematic, where we need to com-
model combination problem. Then, we discuss the insights pute an estimator ĝ of the policy gradient and optimize it
about the model combination problem from the reinforce- with gradient ascent algorithm (Schulman et al. 2015):
ment learning perspective. Lastly, we present the proposed h i
RL-based dynamic model combination method. ĝ = Êt ∇ϕ log πϕ (at |st )Ât , (3)
6641
where Ât is an estimator of the advantage function at where target y = r + γQθ′ (s′ , a′ )|a′ =πϕ (s) is computed by
timestep t. Because the optimal policy π ∗ is nearly deter- the target network Q′θ (s, a). For the actor, we perform gra-
ministic, which always selects the best model. Hence, most dient ascent to maximize the expected return:
action probabilities π(at |st ) will be close to zero except for max Es∼D [Qθ (s, πϕ (s))] . (5)
the optimal action a∗ . When we use samples to approximate ϕ
Eq 3 during training, the log-probability is likely to be close
to minus infinity. Though we have multiple optimal actions RL Based Model Combination (RLMC)
in the common model combination problem setting, how- RLMC agent. Inspired by some previous works (Feng,
ever, the potential explosive log-probability problem still ex- Sun, and Zhang 2019; Montero-Manso et al. 2020), at each
ists due to the sparse optimal actions. Fig 2 demonstrates timestep t, the RLMC model takes a combination of time
this problem by plotting the | log πϕ (a|s)| during training a series data X t = {xt1 , · · · xtT } and the history of model
policy-gradient agent on the ETT dataset (Zhou et al. 2021). performance Lt = {L1t−1 , · · · LN t−1 } as model inputs st =
(X t , Lt ), where X t is the length-T observed series history
at timestep t, and Lt is the history base model performance
at previous timestep. Specifically, the actor πϕ (st ) adopts
the dilated causal convolutions (Franceschi, Dieuleveut, and
Jaggi 2019) as the basic encoder structure to extract latent
time series features, and uses a rank embedding table to ex-
tract the base model features. A softmax function is later
used in the actor πϕ as the output layer to generate the com-
bination weight for base models in the ensemble. An illus-
tration of the RLMC actor is shown in Fig 3. The critic
Qθ (st , at ) used a similar dilated CNN-based encoder struc-
ture as the actor πϕ (st ).
time series
embedding
dilated CNN
time series RLMC softmax
Figure 2: Because the optimal model combination policy Actor πϕ(st)
only selects few optimal actions, a policy-based method may 1
Lt−1
suffer from explosive log πϕ (at |st ) for bad actions at , which at = (wt1, ⋯, wtN )
...
6642
where δt is the sMAPE at timestep t, τ (δt ) = {0, · · · , 9} is Algorithm 1: RL-based Model Combination (RLMC)
the quantile of δt w.r.t. the base model prediction errors in 1: Input: pretrianed base models M = ⟨M1 , · · · , MN ⟩,
the training set, and rtp = {0, · · · , N − 1} is the rank of the training set Dtrain , total training steps T , exploration
prediction error w.r.t. other N base model prediction error at parameter ϵ, trajectory horizon H, reward threshold r̂t ,
timestep t. update frequency d, Polyak update parameter τ .
Overall pipeline. We apply the standard DQN training 2: Initialize critic network Qθ , and actor network πϕ with
pipeline (Mnih et al. 2013) to train the RLMC agent with a random parameters θ, ϕ.
replay buffer. For each trajectory, we first randomly select a 3: Initialize target networks θ′ ← θ, ϕ′ ← ϕ.
timestep t as the initial state s0 from the training set. Then 4: Initialize reply buffer B and the extra buffer B ′ .
we use our RLMC agent to interact with the environment 5: pre-train the actor model πϕ (s) on Dtrain for a multi-
by selecting model weights for a horizon of H steps. The class classification task w.r.t. the cross-entropy loss.
reward rt for each state-action pair (st , at ) is computed ac- 6: for i in {1, · · · , T } do
cording to the self-defined reward function (Eq 7). We store 7: Randomly select a timestep j from the training set,
the collected transitions to a replay buffer for later training. use the time series X t and base model performance
An overview of the RLMC training pipeline is shown in Fig Lt at timestep j and j + 1 as the state s and next state
4. Alg 1 describes the details of the RLMC approach 1 . s′ . We rollout this trajectory for H steps.
8: Select action a = πϕ (s) with the ϵ-greedily explo-
N ration strategy with sparsity inductive bias (Eq 8).
ensemble prediction yt̂ = wti yit̂
∑ 9: Compute the final time series prediction with the out-
1
Actor i
πϕ(st)
PN
2 3
put combination weight ŷt = i wti ŷti .
1
10: Compute the reward r with the pre-defined reward
Lt−1
function (Eq 6).
...
at = (wt1, ⋯, wtN ) prediction error δt 11: Store the transition (s, a, r, s′ ) to the replay buffer B.
Store the transition to the extra buffer B ′ if reward r
N
Lt−1
12:
4
is lower than the threshold r̂t .
6
Critic 13: if t mod d then
Qθ(st, at) 14: Sample mini-batch of transitions (s, a, r, s′ ) from
5
6643
where i is a random index, and ϵ is the random noise. graphic, finance, industry, macro and micro indicators. We
Thirdly, we find that the RL-based agent would suffer use the daily subset in the experiment.
from an overfitting problem if there exists a few superior
base models that win all the time. The imbalanced data dis- Dataset #Train #Dev #Test
tribution problem would cause the RL agent to degrade to
just copy the best model in the training set. Therefore, we M4-Daily 3,381 845 4,276
Climate 49,063 14,018 7,010
propose to use a second replay buffer to store the hard sam-
GEFCOM 18,710 4,677 2,488
ples with low rewards, and we train the RLMC agent by ETT 8,448 2,816 2,816
sampled transitions from both buffers to mitigate overfitting.
Table 1: Statistics of the datasets for experiment.
(1) pre-training the actor πϕ(s) (2) search around vertex
1 a* = (0,1)
y1t Experimental Details
Actor
πϕ(s)
...
6644
Uniform Single FFORMS FFORMA DMS M3 RLMC
σ1 σ2 σ1 σ2 σ1 σ2 σ1 σ2 σ1 σ2 σ1 σ2 σ1 σ2
D1 5.22 79.83 4.59 73.69 4.89 75.98 5.42 80.77 5.14 78.12 4.63 74.06 4.39 72.11
D2 1.70 26.44 1.67 25.80 1.78 26.65 1.69 25.78 1.70 25.89 1.66 25.66 1.47 25.50
D3 112.65 3.51 89.09 2.75 95.03 2.94 86.93 2.69 90.10 2.79 86.85 2.70 88.25 2.73
D4 250.39 3.93 147.61 2.39 183.90 3.12 161.52 2.72 150.36 2.48 148.60 2.45 145.78 2.31
Table 2: Metric σ1 and σ2 denote the MAE and sMAPE loss. D1, D2, D3, D4 denotes EET, Climate, Gefcom and M4 dataset.
each rolling window, which is not scalable. In the experi- 80 6.0 M(3)
2
DQN
ment, we extend to original DMS to a deep learning based
Percentage (%)
60 5.6
variant by learning a Q-network instead of a Q-table.
MAE
M3. We also compare RLMC to an NN-based sequential 40 5.2
expert select method, named M3 (Tang et al. 2019), which
20 4.8
uses a gating mechanism to compute a weighted sequence
representation for prediction. The original M3 method re- 0 4.4
quires to learn the base models at the same time. To keep 5 10 15 20 5 10 15 20
epoch epoch
consistent with other baselines, in the experiment, we adopt (a) (b)
a variant of the M3 model with a similar gating mechanism Figure 6: A case study of the overfitting problem.
based on the learned based models.
sMAPE(%)
26.0
basic deep learning models including GRU, LSTM, dialated 1.6
MAE
25.8
CNN, and transformers. Since our objective is to showcase
the ability to select appropriate models in the ensemble for 1.5 25.6
Results and Analysis Figure 7: Ablation study of the three proposed techniques
Table 2 summarizes the time series forecasting results of all for better exploration.
methods. We can observe that the proposed RLMC model
consistently outperforms other baselines on all 4 datasets.
We note that DMS, M3 and RLMC usually perform bet- Conclusion
ter than FFORMS and FFORMA, which implies that the
proposed DL-based encoder has superior ability in captur- Accurate analysis and forecast of time series data can be of
ing useful time series features. Further, we can observe that significant importance for real-world systems. In this paper,
other baseline methods usually fail to beat the single best we proposed a general reinforcement learning (RL) based
baseline. We attribute this failure to the overfitting problem model combination framework, named RLMC, for time se-
caused by the imbalanced best model distribution. We im- ries forecasting. We first analyzed the problem from the RL
plement a case study on the ETT dataset by training an DQN perspective and provide two insights. We then designed an
agent in DMS. In the experiment, the third base model M (3) off-policy method based on DDPG with some special strate-
is the best model in the training set. Fig 6 shows the percent- gies for effective exploration. The effectiveness of RLMC is
age of transitions that select the third model as the best one. validated by experiments on real-world data. One limitation
We can find that the model starts to overfit after around three of our method is that the RLMC agent only outputs deter-
epochs. Then the agent just select the best training model. ministic policy due to the use of DDPG. This may limit the
We further conduct ablation studies on the climate dataset full potential of RLMC when the dataset is extremely imbal-
to validate the effectiveness of the proposed techniques (Fig anced. In the future, we plan to investigate how to combine
7). We can find that the proposed exploration strategy is the some unsupervised time series representation learning meth-
most important ingredient for the success of RLMC. This ods with our framework.
6645
References Hyndman, R.; Koehler, A. B.; Ord, J. K.; and Snyder, R. D.
Armstrong, J. S.; Adya, M.; and Collopy, F. 2001. Rule- 2008. Forecasting with exponential smoothing: the state
based forecasting: Using judgment in time-series extrapola- space approach. Springer Science & Business Media.
tion. In Principles of Forecasting, 259–282. Springer. Hyndman, R. J.; and Athanasopoulos, G. 2018. Forecasting:
Barta, G.; Nagy, G. B. G.; Kazi, S.; and Henk, T. 2017. principles and practice. OTexts.
Gefcom 2014—probabilistic electricity price forecasting. In Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.;
International Conference on Intelligent Decision Technolo- Ye, Q.; and Liu, T.-Y. 2017. Lightgbm: A highly efficient
gies, 67–76. Springer. gradient boosting decision tree. Advances in neural infor-
Binkowski, M.; Marti, G.; and Donnat, P. 2018. Autoregres- mation processing systems, 30: 3146–3154.
sive convolutional neural networks for asynchronous time Lai, G.; Chang, W.-C.; Yang, Y.; and Liu, H. 2018. Model-
series. In International Conference on Machine Learning, ing long-and short-term temporal patterns with deep neural
580–589. PMLR. networks. In The 41st International ACM SIGIR Conference
Cerqueira, V.; Torgo, L.; Oliveira, M.; and Pfahringer, B. on Research & Development in Information Retrieval, 95–
2017. Dynamic and heterogeneous ensembles for time series 104.
forecasting. In 2017 IEEE international conference on data Lemke, C.; and Gabrys, B. 2010. Meta-learning for time se-
science and advanced analytics (DSAA), 242–251. IEEE. ries forecasting and forecast combination. Neurocomputing,
Chollet, F. 2017. Deep learning with Python. Simon and 73(10-12): 2006–2016.
Schuster. Liang, Y.; Ke, S.; Zhang, J.; Yi, X.; and Zheng, Y. 2018. Ge-
Collopy, F.; and Armstrong, J. S. 1992. Rule-based fore- oman: Multi-level attention networks for geo-sensory time
casting: Development and validation of an expert systems series prediction. In IJCAI, volume 2018, 3428–3434.
approach to combining time series extrapolations. Manage- Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;
ment science, 38(10): 1394–1414. Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous
Dietterich, T. G.; et al. 2002. Ensemble learning. The hand- control with deep reinforcement learning. arXiv preprint
book of brain theory and neural networks, 2(1): 110–125. arXiv:1509.02971.
Durbin, J.; and Koopman, S. J. 2012. Time series analysis Lim, B.; and Zohren, S. 2021. Time-series forecasting with
by state space methods. Oxford university press. deep learning: a survey. Philosophical Transactions of the
Fan, Y.; Tian, F.; Qin, T.; Bian, J.; and Liu, T.-Y. Royal Society A, 379(2194): 20200209.
2017. Learning what data to learn. arXiv preprint Makridakis, S.; Spiliotis, E.; and Assimakopoulos, V. 2018.
arXiv:1702.08635. The M4 Competition: Results, findings, conclusion and way
Fawaz, H. I.; Forestier, G.; Weber, J.; Idoumghar, L.; and forward. International Journal of Forecasting, 34(4): 802–
Muller, P.-A. 2019. Deep learning for time series classifica- 808.
tion: a review. Data mining and knowledge discovery, 33(4): Miller, C.; Arjunan, P.; Kathirgamanathan, A.; Fu, C.; Roth,
917–963. J.; Park, J. Y.; Balbach, C.; Gowri, K.; Nagy, Z.; Fontanini,
Feng, C.; Sun, M.; and Zhang, J. 2019. Reinforced determin- A. D.; et al. 2020. The ASHRAE Great Energy Predictor III
istic and probabilistic load forecasting via Q-learning dy- competition: Overview and results. Science and Technology
namic model selection. IEEE Transactions on Smart Grid, for the Built Environment, 26(10): 1427–1447.
11(2): 1377–1386. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
Feng, C.; and Zhang, J. 2019. Reinforcement learning based Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play-
dynamic model selection for short-term load forecasting. In ing atari with deep reinforcement learning. arXiv preprint
2019 IEEE Power & Energy Society Innovative Smart Grid arXiv:1312.5602.
Technologies Conference (ISGT), 1–5. IEEE. Montero-Manso, P.; Athanasopoulos, G.; Hyndman, R. J.;
Franceschi, J.-Y.; Dieuleveut, A.; and Jaggi, M. 2019. Un- and Talagala, T. S. 2020. FFORMA: Feature-based fore-
supervised scalable representation learning for multivariate cast model averaging. International Journal of Forecasting,
time series. arXiv preprint arXiv:1901.10738. 36(1): 86–92.
Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing Nichol, A.; Achiam, J.; and Schulman, J. 2018. On
function approximation error in actor-critic methods. In In- first-order meta-learning algorithms. arXiv preprint
ternational Conference on Machine Learning, 1587–1596. arXiv:1803.02999.
PMLR. Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.;
Gowrisankaran, G.; Reynolds, S. S.; and Samano, M. 2016. Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and
Intermittency and the value of renewable energy. Journal of Kavukcuoglu, K. 2016. Wavenet: A generative model for
Political Economy, 124(4): 1187–1234. raw audio. arXiv preprint arXiv:1609.03499.
Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Oreshkin, B. N.; Carpov, D.; Chapados, N.; and Ben-
Soft actor-critic: Off-policy maximum entropy deep rein- gio, Y. 2019. N-BEATS: Neural basis expansion analy-
forcement learning with a stochastic actor. In International sis for interpretable time series forecasting. arXiv preprint
conference on machine learning, 1861–1870. PMLR. arXiv:1905.10437.
6646
Pai, P.-F.; and Lin, C.-S. 2005. A hybrid ARIMA and knowledge distillation. In Proceedings of the AAAI Confer-
support vector machines model in stock price forecasting. ence on Artificial Intelligence, volume 35, 14284–14291.
Omega, 33(6): 497–505. Zhang, G. P. 2003. Time series forecasting using a hybrid
Prudêncio, R. B.; and Ludermir, T. B. 2004. Meta-learning ARIMA and neural network model. Neurocomputing, 50:
approaches to selecting time series models. Neurocomput- 159–175.
ing, 61: 121–137. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.;
Rangapuram, S. S.; Seeger, M. W.; Gasthaus, J.; Stella, L.; and Zhang, W. 2021. Informer: Beyond efficient transformer
Wang, Y.; and Januschowski, T. 2018. Deep state space for long sequence time-series forecasting. In Proceedings of
models for time series forecasting. Advances in neural in- AAAI.
formation processing systems, 31: 7785–7794. Zhou, Z.-H. 2021. Ensemble learning. In Machine Learning,
Sánchez, I. 2008. Adaptive combination of forecasts with 181–210. Springer.
application to wind energy. International Journal of Fore-
casting, 24(4): 679–693.
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and
Abbeel, P. 2015. High-dimensional continuous control
using generalized advantage estimation. arXiv preprint
arXiv:1506.02438.
Seeger, M.; Salinas, D.; and Flunkert, V. 2016. Bayesian in-
termittent demand forecasting for large inventories. In Pro-
ceedings of the 30th International Conference on Neural In-
formation Processing Systems, 4653–4661.
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.;
and Riedmiller, M. 2014. Deterministic policy gradient al-
gorithms. In International conference on machine learning,
387–395. PMLR.
Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learn-
ing: An introduction. MIT press.
Talagala, T. S.; Hyndman, R. J.; Athanasopoulos, G.; et al.
2018. Meta-learning how to forecast time series. Monash
Econometrics and Business Statistics Working Papers, 6: 18.
Tanaka, K. 2017. Time series analysis: Nonstationary and
noninvertible distribution theory, volume 4. John Wiley &
Sons.
Tang, J.; Belletti, F.; Jain, S.; Chen, M.; Beutel, A.; Xu, C.;
and H. Chi, E. 2019. Towards neural mixture recommender
for long range dependent user sequences. In The World Wide
Web Conference, 1782–1793.
Taylor, J. W.; McSharry, P. E.; and Buizza, R. 2009. Wind
power density forecasting using ensemble predictions and
time series models. IEEE Transactions on Energy Conver-
sion, 24(3): 775–782.
Vilalta, R.; and Drissi, Y. 2002. A perspective view and sur-
vey of meta-learning. Artificial intelligence review, 18(2):
77–95.
Weigel, A. P.; Liniger, M.; and Appenzeller, C. 2008.
Can multi-model combination really enhance the predic-
tion skill of probabilistic ensemble forecasts? Quarterly
Journal of the Royal Meteorological Society: A journal of
the atmospheric sciences, applied meteorology and physical
oceanography, 134(630): 241–260.
Wolpert, D. H. 1996. The lack of a priori distinctions be-
tween learning algorithms. Neural computation, 8(7): 1341–
1390.
Yuan, F.; Shou, L.; Pei, J.; Lin, W.; Gong, M.; Fu, Y.;
and Jiang, D. 2021. Reinforced multi-teacher selection for
6647