A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
A Support Vector Machine For Model Selection in Demand Forecasting Applications PDF
A R T I C LE I N FO A B S T R A C T
Keywords: Time series forecasting has been an active research area for decades, receiving considerable attention from very
Demand forecasting different domains, such as econometrics, statistics, engineering, mathematics, medicine and social sciences.
Supply chain Moreover, with the emergence of the big data era, the automatic identification with the appropriate techniques
SVM remains an intermediate compulsory stage of any big data implementation with predictive analytics purposes.
Time series analysis
Extensive research on model selection and combination has revealed the benefits of such techniques in terms of
Model selection
forecast accuracy and reliability. Several criteria for model selection have been proposed and used for decades
with very good results. Akaike information criterion and Schwarz Bayesian criterion are two of the most popular
criteria. However, research on the combination of several criteria along with other sources of information in a
unified methodology remains scarce.
This study proposes a new model selection approach that combines different criteria using a support vector
machine (SVM). Given a set of candidate models, rather than considering any individual criterion, an SVM is
trained at each forecasting origin to select the best model. This methodology will be particularly interesting for
scenarios with highly volatile demand because it allows changing the model when it does not fit the data
sufficiently well, thereby reducing the risk of misusing modeling techniques in the automatic processing of large
datasets.
The effects of the proposed approach are empirically explored using a set of representative forecasting
methods and a dataset of 229 weekly demand series from a leading household and personal care manufacturer in
the UK. Our findings suggest that the proposed approach results in more robust predictions with lower mean
forecasting error and biases than base forecasts.
⁎
Corresponding author.
E-mail address: [email protected] (M.A. Villegas).
https://doi.org/10.1016/j.cie.2018.04.042
Received 1 February 2017; Received in revised form 3 February 2018; Accepted 21 April 2018
Available online 04 May 2018
0360-8352/ © 2018 Elsevier Ltd. All rights reserved.
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7
individual selection outperforms aggregate selection, although with an Despite the abundant literature on model selection, the possibility of
associated higher complexity level and computational burden (Fildes & combining several selection criteria has been under-investigated, as
Petropoulos, 2015). well as the benefits of taking advantage of additional sources of in-
Regarding individual selection, different criteria for selecting the formation (SACF and PACF values, fitted parameters, unit root tests,
most adequate model can be found in the literature. For instance, in- and so forth) in an integrated approach that takes them all into account.
formation criteria such as Akaike information criterion (AIC) or This study proposes a new model selection approach that combines
Schwarz Bayesian criterion (SBC) are typically used (Liang, 2014). different criteria with additional information of the time series itself as
These information criteria produce a value that represents the tradeoff well as the responses and fitted parameters of the alternative models.
between goodness of fit and the number of parameters. Billah, King, Given a set of candidate models, rather than considering any individual
Snyder, and Koehler (2006) compared different information criteria to criterion, a support vector machine (SVM) is trained at each forecasting
select the most appropriate exponential smoothing model on simulated origin to select the best model using all this information.
data and a subset of the time series from the well-known forecasting The effects of this approach are explored for the 229 stock keeping
competition known as M3, where the AIC slightly outperformed the units (SKUs) of a leading household and personal care manufacturer in
remaining information criteria considered (Makridakis & Hibon, 2000). the UK. The data are highly volatile and have a small serial correlation.
The identification of the best forecasting model has also been ad- The experiment was developed by fitting a set of exponential smoothing
dressed depending on the time series features. Initially, Pegels (1969) and ARIMA models with different levels of complexity. After a feature
presented nine possible exponential smoothing methods in graphical selection process, 19 variables were included in the SVM training da-
form, considering all combinations of trend and cyclical effects in ad- taset as the most relevant variables. The results show that the proposed
ditive and multiplicative form. Collopy and Armstrong (1992) devel- approach improves the out-of-sample forecasting accuracy with respect
oped a rule-based selection procedure model (RBF) based on a set of 99 to a single model selection criterion. This methodology will be parti-
rules for selecting and combining methods based on 18 time series cularly interesting for scenarios with highly volatile demand because it
features. To automate this procedure, Adya, Collopy, Armstrong, and allows the model to be changed when it does not fit the data sufficiently
Kennedy (2001) developed and automated heuristics to detect six fea- well, thereby reducing the risk of misusing modeling techniques in the
tures that had previously been judgmentally identified in RBF using automatic processing of large datasets.
simple statistics, achieving a similar performance in terms of fore- The key contributions of this paper are as follows: (i) propose a
casting accuracy. Petropoulos et al. (2014) analyzed via regression novel model selection approach for time series forecasting based on
analysis the main determinants of forecasting accuracy involving 14 SVM classification, (ii) compare base and ensemble forecast error
popular forecasting methods (and combinations of them), seven time characteristics out-of-sample, and (iii) investigate the effects of the
series features and the forecasting horizon as a strategic decision. ensemble on forecasting errors, as measured in terms of median, mean,
Wang, Pedrycz, and Liu (2015) proposed a rather different approach for bias and variance.
long-term forecasting based on dynamic time warping of information The remainder of this paper is organized as follows. Section 2 in-
granules. Yu, Dai, and Tang (2016) focused on finding an empirical troduces the forecasting models and the use of the SVM for automatic
decomposition (intrinsic mode functions) to aggregate the individually model selection. Section 3 presents an empirical evaluation of the ap-
forecast components later into an ensemble result as the final predic- proach in a demand planning case study with real data. Section 4
tion. analyzes the results, followed by some final considerations and after-
An alternative for selecting among forecasts is evaluating the per- thoughts.
formance of the methods in a hold-out sample (Fildes & Petropoulos,
2015; Poler & Mula, 2011), where forecasts are computed for single or 2. Methods
multiple origins (cross-validation), typically using a rolling-origin pro-
cess (Tashman, 2000). Some pragmatic approaches have been used, 2.1. Forecasting models
such as forward selection used by Kim, Dekker, and Heij (2017) for
determining the order of an AR model. Let z t be the mean-corrected output demand data sampled at a
Finally, another option is to explore combination procedures weekly rate, at be a white noise sequence (i.e., serially uncorrelated
(Clemen, 1989). In fact, Fildes and Petropoulos (2015) concluded that a with zero mean and constant variance), θi be a set of parameters to
combination could outperform individual or aggregate selection for estimate and B be the backshift operator in the sense that B lz t = z t − l .
non-trended data. Different combination operators (mode, median and Then, considering that no seasonality is present in the data, the fore-
mean) to compute neural network ensembles were analyzed by casting models considered in this paper are the following:
Kourentzes, Barrow, and Crone (2014), where the mode was found to
M1: z t = at (1)
provide the most accurate forecasts. In addition to the forecasting
models considering time series, the automatic identification algorithms M2: z t = (1 + θ1 B + θ2 B2) at (2)
developed for causal models should be mentioned. For instance, mar-
keting analytics models to forecast sales under the presence of pro- M3 (ETS): (1−B ) z t = (1 + θ1 B ) at (3)
motions were analyzed by Trapero, Kourentzes, and Fildes (2015).
M4: (1−B ) z t = (1 + θ1 B + θ2 B2) at (4)
Additionally, models capable of incorporating data from other compa-
nies in a supply chain collaboration context with information sharing Mean : Mean of forecasts M1 to M4 (5)
were explored by Trapero, Kourentzes, and Fildes (2012).
Median : Median of forecasts M1 to M4 (6)
In addition to traditional time series modeling techniques, artificial
intelligence (AI) algorithms have proven to be quite effective as a Model M1 is white noise, model M2 is a MA(2), model M3 is an
means to build higher-level methodologies to face big data challenges IMA(1,1) that is actually treated as a simple exponential smoothing
in an effective manner, relying upon both traditional and AI low-level model or an ETS(A,N,N) in (Hyndman, Koehler, Ord, & Snyder, 2008)
techniques. An initial attempt was conducted by Garcia, Villalba, and nomenclature (where E, T, S, A and N stand for error, trend, seasonal,
Portela (2012), where multiple time series were classified according to additive and none, respectively), M4 is an IMA(1,2), and Mean and
a simple autocorrelation function (SACF) and partial autocorrelation Median are combination methods. In essence, two stationary models,
function (PACF) to reduce the number of forecasting ARIMA models to three non-stationary models and two combinations of models are con-
be fitted. However, the forecasting implications of that procedure in sidered.
terms of out-of-sample accuracy were not described. Note that some models are nested versions of other models. For
2
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7
⎝ 2σ 2 ⎠ (8)
confidence limit. This result means that, depending on the level of
confidence, for 93.89%, 97.82% and 100% of the SKU series, at least where σ is a parameter that controls the flexibility of the kernel. In this
one of the models is correctly specified in the sense of not leaving study, Gaussian kernels are extensively used, and the parameter σ is
significant serial correlation below the confidence limits mentioned estimated via cross-validation, as explained in Section 2.3.
above. This means that the models fulfill the requirement of being a Although they have been extended to regression problems since
sufficient representation of the data while simultaneously preserving their early days (Müller et al., 1997), SVMs were originally designed as
parsimony. a classification algorithm (Cortes & Vapnik, 1995) and have been ex-
tensively exploited in a wide variety of classification contexts, e.g.,
2.2. Support vector machines hand-written digit recognition, genomic DNA (Furey et al., 2000), text
classification (Joachims, 2002), and sentiment analysis (Pang, Lee, &
The forecasting models shown in the previous section are data Vaithyanathan, 2002). Surprisingly, however, SVMs have not been
specific and should be changed for each particular application. The applied to the problem of model selection in the context of multiple
novelty of this paper relies on the use of SVMs as a means to select the forecasting models. This paper constitutes a novel contribution in this
best model among the candidates for each time series at each fore- area.
casting horizon.
The SVM classifier is basically a binary classifier algorithm that 2.3. Feature selection and extraction
searches for an optimal hyperplane as a decision function in a high-
dimensional feature space (Shawe-Taylor & Cristianini, 2004). Consider Reliable results with SVM and other data-driven modeling techni-
the training data set {x k,yk } , where x k ∈ n are the training examples ques are considerably conditioned to the quality of the data available
(k = 1,2,…,m ) and yk ∈ {−1,1} are the class labels. The training ex- for training. Apart from the correctness of the data itself, there are some
amples are first mapped into another space, referred to as the feature other aspects regarding the dimensionality of the dataset. In fact, it is
space, which is eventually of a much higher dimension than n , via the well known that as the number of variables increases, the amount of
mapping function Φ . Then, a decision function of the form data required to provide a reliable analysis exponentially increases
f (x) = 〈w,Φ(x) 〉 + b in the feature space is computed by maximizing (Hira & Gillies, 2015). Many feature selection (removing variables that
the distance between the set of points Φ(x k) to the hyperplane are irrelevant) and feature extraction (applying some transformations
3
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7
,QíVDPSOe 2XWíRIíVDPSOe
to the existing variables to obtain a new one) techniques have been
10 20 30
discussed to reduce the dimensionality of the data (Kira & Rendell,
SKU
1992), as well as some other approaches based on linear transformation
and covariance analysis, such as principal component analysis (PCA)
0
0 50 100 150
and linear discriminant analysis (LDA; Cao, Chua, Chong, Lee, & Gu, Week
2003; Duin & Loog, 2004).
80 120
,QíVDPSOe 2XWíRIíVDPSOe
For the experiments performed in this work, the dataset contained
SKU
14,885 records (65 origins and 229 products). A thorough range of
40
features estimated from the signals and models were initially con-
0
sidered to not miss any relevant information. The process resulted in 39
0 50 100 150
features, including information criteria of models, estimation informa- Week
tion, formal statistical tests on residuals and forecasting results. The ,QíVDPSOe 2XWíRIíVDPSOe
150
SKU
• AIC and SBC information criteria on forecasting models M1-M4 (8
0 50
features). 0 50 100 150
0 20 40 60 80
features). ,QíVDPSOe 2XWíRIíVDPSOe
SKU
model (4 features, Jarque & Bera, 1987).
• P-values of heteroscedasticity tests for the residuals of each model (4
features). The tests are estimated as a variance ratio test of the first 0 50 100 150
Week
third of the sample on top of the last third of the sample.
• Estimated parameters (5 features). Two parameters/features for Fig. 2. Example of some SKUs from the dataset.
model M2, one for model M3, and two for model M4.
• Four last SKU values available at each point in time (4 features).
sample span one week at a time. All forecasting models are fitted in the
• Four predictions for the next week provided by each forecasting
in-sample partition using the available data up to time T with
method (M1 to M4, 4 features).
101 ⩽ T ⩽ 169and tested in the out-of-sample partition using observa-
• All possible relative distances among the predictions provided by all
tions T + 1,…,T + 4 . The out-of-sample forecasting errors are therefore
the forecasting methods (M1 to M4, 6 features).
calculated for the four forecasting horizons on each of the 229 products.
We measure the forecast error using the scaled error (sEl ) and scaled
After a long process of feature selection and extraction via cross-
squared error (sSEl ) of the lead time forecast according to the following
validation, the number of variables was reduced to 19, resulting in a
formulas:
matrix W of dimension 14,885 × 19. The final features selected are the
l l
last four items in the previous list, namely, the estimated parameters for ∑ j = 1 zT + j− ∑ j = 1 zT̂ + j
all models, the four last SKU values, the predictions of all models and sEl = 1 T
,
∑i = 1 z i (9)
the distances or differences among predictions. T
4
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7
Table 1 Table 2
Percentage of SKUs for which each model is the best according to SBC on dif- SBC for all models in two different data partitions and sMSE for out-of-sample
ferent data partitions and the out-of-sample forecast performance. for SKU in Fig. 4.
(101) (173) (101) (173)
SBC AIC SBC AIC Out-of-sample SBC AIC SBC AIC sMSE
M1 55.46% 51.05% 39.30% 36.18% 17.03% M1 1.52 1.40 3.31 3.05 0.053
M2 14.41% 13.99% 9.61% 9.33% 16.16% M2 1.61 1.56 2.89 2.80 0.038
M3 13.97% 13.67% 29.26% 28.63% 34.93% M3 1.59 1.56 2.26 2.21 0.013
M4 16.16% 21.29% 21.83% 25.86% 31.88% M4 1.64 1.58 2.19 2.12 0.012
40
best model according to SBC in 39.30% of the cases (36.18% according
to AIC), followed by the exponential smoothing with 29.26% (28.63%
20
for AIC) of the cases.
The fifth column of Table 1 shows the best models contributing to
0 the accuracy in the out-of-sample partition (samples between 101 and
0 50 100 150 200
SKU 173) according to sSE4 . The disagreements with the SBC values are 69%
and 44% for the small and full sample sizes, respectively. Taking the
Fig. 3. Proportion of time origins at which the best model is the best for all information in Table 1 altogether, it shows evidence of the little cor-
SKUs.
relation structure observed in the data, which tends to become more
important with longer time series.
35 Fig. 3 shows for each SKU the proportion of times out of the 69
forecasting origins in the rolling experiment that the best model is ac-
tually the best according to the forecasting errors. For example, a 50%
30
for a single SKU in that figure means that the best model was the best in
35 of the forecasting origins. Only in 4 SKUs was the winning model the
25 best in more than 60% of the forecasting origins, and only in 28 SKUs
were the proportions greater than 50%. This result means that even
when a model is the best at minimizing the forecasting error for a single
Sales
20 SKU, rarely is it the best at more than 50% of the forecasting origins.
To obtain deeper insights into the complexity of the problem, Fig. 4
shows a single SKU, where it can be observed that, taken up to ob-
15
servation 101, it may be considered stationary, and therefore, either M1
or M2 may be appropriate candidates. However, the fact that a trend
10
subsequently appears implies that such a model might no longer be
optimal.
Such intuitions are supported by Table 2, which shows the SBC and
5 AIC for all models with samples up to 101 and the full sample, in ad-
0 20 40 60 80 100 120 140 160 dition to the sMSEs. The preferred model according to SBC and AIC in
Weeks the small sample is M1, with a Q(8) statistic of 3.72, which indicates
Fig. 4. Example of SKU. that there is no correlation remaining on the residuals. This model is the
worst for the full sample. Additionally, the best model for the full
sample switches to M4 (Q(8) is 9.54), while the forecasting criterion
assume that there is not necessarily a single stochastic process that
suggests that M4 is the best, with a slight margin over M3. Interest-
underlies the observed data. In other words, several stochastic pro-
ingly, the model that is the best considering all forecasting origins is
cesses may be required for a single SKU to describe it.
model M4 with only 53% of the time.
This is particularly true for this case study, where there is little
This evidence is complemented with Fig. 5, which shows the best
correlation structure in the data. Consequently, one might expect that
model according to its forecasting performance for each forecast origin
the best model in terms of forecasting accuracy will change for different
in the out-of-sample span for the same SKU. At the very beginning, the
forecast origins and/or horizons.
best models tended to be M2 or M3. Subsequently, however, as the
Some evidence emerges in the in-sample properties of the models.
trend becomes more prominent, the best model switches to M4 most of
For example, computing the SBC for all the SKUs with 101 observations
the time, although not always.
and the full sample, we observe a different model selection in 37% of
The previous evidence shows that there is not a best model that
the time series (Table 1). This proportion remains the same when
outperforms the rest for all SKUs, all forecasting origins and all fore-
considered AIC instead. Detailed information about model selection is
casting horizons. Moreover, even for a single SKU, there is not a con-
shown in the first four columns of Table 1, where the sample size is
sistent best model along time. At this point, the SVM-based ensemble
shown in parentheses. The SBC tends to select models with a higher
approach is introduced to test the hypothesis that there is some pattern
5
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7
the exponential smoothing, and both provide virtually the same results.
4
Finally, and most importantly, the SVM-based approach is the overall
best approach for all forecasting horizons with errors that fall between
Best model M#
the M3 model and the minimum possible forecasting error (that would
result from always selecting the model with the minimum error). The
3 advantages of the proposed method are once again appreciated more
clearly for higher forecast horizons.
Table 4 shows the bias considering the sEt measurements from Eq.
(9). Minimum sME (sMdE) for every forecasting horizon is highlighted
2 in bold. All biases are small, considering that the highest bias in the
table is 0.124 and the normalization imposed on the data implies a mean
100 110 120 130 140 150 160 170 of 1. The conclusions about bias are quite different depending on
Weeks whether we rely on SME or sMdE, but due to the robustness of the
Fig. 5. Best model for each out-of-sample forecast origin for SKU in Fig. 4. median, it is safer to use the sMdE values in parentheses. In essence, the
bias replicates what was observed in the squared errors, i.e., the models
with the smallest squared errors are simultaneously the models with the
Table 3
smallest bias. The best is the SVM-based method, followed by the ex-
Forecast accuracy for out-of-sample sets in sMSE (sMdSE). ponential smoothing (M3), then model M4 and combinations of
models.
Out t + 1 Out t + 2 Out t + 3 Out t + 4
The SVM-based approach is allowed to select among the different
Naïve 0.184 (0.041) 0.558 (0.128) 1.114 (0.266) 1.856 (0.437) forecasting models at each forecast origin, and therefore, it is more
flexible to adapt to stochastic or structural changes in the SKUs. This
M1 0.115 (0.032) 0.278 (0.075) 0.486 (0.134) 0.743 (0.195) fact explains why the SVM-based criterion outperforms all the con-
M2 0.109 (0.030) 0.255 (0.072) 0.447 (0.130) 0.689 (0.192)
sidered alternatives in terms of forecast accuracy.
M3 0.100 (0.026) 0.221 (0.054) 0.363 (0.087) 0.533 (0.123)
M4 0.102 (0.027) 0.230 (0.059) 0.380 (0.096) 0.555 (0.139)
Mean 0.101 (0.027) 0.226 (0.060) 0.374 (0.101) 0.549 (0.150) 5. Conclusions
Median 0.101 (0.027) 0.225 (0.059) 0.373 (0.101) 0.549 (0.150)
SVM-based 0.099 (0.026) 0.212 (0.052) 0.334 (0.081) 0.471 (0.110)
This study proposes a novel SVM-based approach for model selec-
Baseline 0.071 (0.011) 0.149 (0.022) 0.234 (0.034) 0.327 (0.049) tion. Since forecasting models shape business decisions at different le-
vels within companies, this paper aims at enhancing the power of
forecasting techniques by using AI techniques, SVM in particular,
that would allow improving the forecast accuracy over all models and working on a wide feature space in the context of supply chain fore-
possible combinations of models. In this sense, the proposed approach casting.
might be considered a sophisticated combination method in itself. The procedure consists of selecting the best forecast model available
Table 3 shows the scaled mean (median) squared errors for all from a pool of alternatives at each point in time using an SVM trained in
forecasting models and methods, including a naïve model (each forecast a feature space that embeds the most recent information, forecasts, the
is simply equal to the last observed value) that serves as a benchmark. relative performance and the fitted parameters of the models involved.
The last row corresponds to the errors generated by selecting the best To the authors’ best knowledge, this work is the first in which an SVM is
possible model for every forecasting step out of the model set con- used in this context in this particular way.
sidered. Minimum sMSE (sMdSE) for every forecasting horizon is The approach is empirically applied to a leading household and
highlighted in bold. personal care manufacturer in the UK with 229 weekly SKUs to forecast,
Several facts emerge from Table 3. First, taken as a whole, all with a horizon of 1 to 4 weeks ahead. The findings suggest that (i)
models outperform the naïve model by a wide margin, implying that all exponential smoothing techniques are very good in this context, both in
models capture, at least in some part of the experiment, the correlation terms of forecast accuracy and minimization of bias; (ii) simple com-
structure of the data. Second, for individual models M1 to M4, the best binations of forecasts (such as mean and median) do not help much in
model is consistently the exponential smoothing M3 model, with an this regard; and (iii) SVM-based model selection certainly manages to
advantage that increases with the forecasting horizon. Third, combi- improve the forecasting results in terms of both errors and bias.
nations of methods (mean and median) do not manage to outperform
Table 4
Forecast bias multiplied by 10 2 for out-of-sample sets in sME (sMdE).
Out t + 1 Out t + 2 Out t + 3 Out t + 4
6
M.A. Villegas et al. Computers & Industrial Engineering 121 (2018) 1–7