Volatility Forecasting With Machine Learning and I (1)
Volatility Forecasting With Machine Learning and I (1)
https://doi.org/10.1093/jjfinec/nbad005
Article
Address correspondence to Chao Zhang, Department of Statistics, University of Oxford, Oxford, UK, or
e-mail: [email protected].
The first two authors contributed equally to this work.
Received February 7, 2022; revised February 15, 2023; editorial decision February 18, 2023; accepted February 24, 2023
Abstract
We apply machine learning models to forecast intraday realized volatility (RV), by
exploiting commonality in intraday volatility via pooling stock data together, and by
incorporating a proxy for the market volatility. Neural networks dominate linear
regressions and tree-based models in terms of performance, due to their ability to
uncover and model complex latent interactions among variables. Our findings re-
main robust when we apply trained models to new stocks that have not been
included in the training set, thus providing new empirical evidence for a universal
volatility mechanism among stocks. Finally, we propose a new approach to forecast-
ing 1-day-ahead RVs using past intraday RVs as predictors, and highlight interesting
time-of-day effects that aid the forecasting mechanism. The results demonstrate
that the proposed methodology yields superior out-of-sample forecasts over a
strong set of traditional baselines that only rely on past daily RVs.
Key words: commonality, intraday volatility forecasting, neural networks, realized volatility
JEL classification: C45, C53, G17
*We would like to thank two anonymous referees, the associate editor and the editor, Dacheng Xiu, for
their valuable comments. We are grateful to Rama Cont, Alvaro Cartea, Blanka Horvath, and partici-
pants at the 11th Bachelier World Congress 2022 and the 2022 Asian Finance Association Annual
Conference for helpful comments. We also thank the Oxford Suzhou Centre for Advanced Research for
providing the computational facility. The first author acknowledges the support from Clarendon Fund.
The second author acknowledges the support from EPSRC Centre for Doctoral Training in
Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1).
Forecasting and modeling stock return volatility has been of interest to both academics and
practitioners. Recent advances in high-frequency trading (HFT) highlight the need for ro-
bust and accurate intraday volatility forecasts. For example, Deutsche Börse, one of the
world’s leading data and technology service providers, launched the “Intraday Volatility
1 Related Literature
Our study is built upon several research streams proposed by various authors over the re-
cent years. The first stream is related to the research on the commonality in financial mar-
with traditional economic models. Gu, Kelly, and Xiu (2020) have pointed out the superior
performance of ML models for empirical asset pricing. Recently, Xiong, Nichols, and Shen
(2015) have applied LSTMs to forecast S&P 500 volatility, with Google domestic trends as
predictors, and Bucci (2020) has demonstrated that recurrent NNs (RNNs) are able to out-
2 Data and RV
2.1. Data
We use the Nasdaq ITCH data from LOBSTER1 to compute intraday returns via mid-
prices. We select the top 100 components of S&P 500 index, for the period between July 1,
2011 and June 30, 2021. After filtering out the stocks for which the dataset does not span
the entire sample period, we are left with 93 stocks. Table 1 presents the number of stocks
in each sector, according to the Global Industry Classification Standard (GICS) sector
division.2
where li is the drift, ri;t is the instantaneous volatility, and Wt is the standard Brownian
motion. The theoretical integrated variance (IV) of stock i during ðt h; t is estimated as
ðt
IVi;t ðhÞ ¼ r2i;s ds; (2)
th
Information Technology 20 AAPL ACN ADBE ADP AVGO CRM CSCO FIS FISV IBM
To reduce the impact of extreme values, we consider the logarithm, in line with Andersen
et al. (2003), Bucci (2020) and Herskovic et al. (2016). Specifically, during a period
ðt h; t, the RV is defined as follows3:
" t #
ðhÞ
X
RVi;t :¼ log r2i;s : (4)
s¼thþ1
As pointed out by Pascalau and Poirier (2021), there are no conclusive methods to in-
corporate the overnight session’s information content into the daily volatility. In line with
Engle and Sokalska (2012), overnight information is excluded from our empirical analysis
of daily volatility. For simplicity, we refer to this daily scenario (excluding the overnight) as
the “1-day” scenario, throughout the rest of this article.
3 Liu, Patton, and Sheppard (2015) demonstrate that no sub-sampling frequency significantly outper-
forms a 5-min interval in terms of forecasting daily RVs, making it a widely accepted time interval
in the literature. In this article, we use 1-min returns since our main focus is intraday RVs, such as
10-min RVs.
Zhang et al. j Volatility Forecasting with Machine Learning 7
Notes: The panels (a)–(d) are based on observations in the frequency of 10-min, 30-min, 65-min, and
1-day, respectively. The dashed vertical lines represent the average correlation values of RVs and
returns.
Notes: The blue curve represents cross-sectional average of daily RV across stocks, with the inner
area covering the 25th percentile to the 75th percentile, and the outer area covering the 5th percentile
to the 95th percentile.
during the periods of higher volatility, such as, stock market crashes in August 2011 (European
sovereign debt crisis), between June 2015 and June 2016 (Chinese stock market turbulence and
Brexit), in March 2018 (China–U.S. trade war), in March 2020 (COVID-19). Figure 3 shows
8 Journal of Financial Econometrics
Notes: The blue curve represents cross-sectional average of 30-min RV across stocks and days, with
the inner area covering the 25th percentile to the 75th percentile and the outer area covering the 5th
percentile to the 95th percentile.
that the diurnal volatility forms a so-called reverse-J-shape, namely larger fluctuations near the
open and close (see Harris 1986; Engle and Sokalska 2012).
3 Commonality Estimation
Inspired by prior studies (e.g., Chordia, Roll, and Subrahmanyam 2000; Morck, Yeung,
and Yu 2000; Karolyi, Lee, and Van Dijk 2012; Dang, Moshirian, and Zhang 2015), we
follow an analogous procedure to estimate the commonality in volatility. Specifically, we
use the average adjusted R2 value from the following regressions across stocks, as a measure
of commonality in volatility (denoted as R2ðhÞ )4
ðhÞ ðhÞ
RVi;t ¼ ai þ bi RVM;t þ i;t ; (5)
ðhÞ
where RVM;t (see Bollerslev et al. 2018) is the contemporaneous market volatility during
ðt h; t for stock i, which is calculated as the equally weighted average5 of all individual
stock volatilities during ðt h; t, that is,
ðhÞ 1X N
ðhÞ
RVM;t ¼ RVi;t : (6)
N i¼1
Figure 4 presents the commonality in RV, averaged across stocks for each month. To
create this figure, we use the observations in each month to obtain the R2 value from
Equation (5). We notice that commonality effects in intraday scenarios (especially 30-min
and 65-min) are substantially larger than the daily ones. For example, as reported in
Table 2, the average commonality in 65-min data is around 74.3%, while only 35.5% in
daily data. Moreover, R2ðhÞ is much more turbulent at the daily frequency. The last column
4 We also perform another regression, where except for contemporaneous market volatility, the lag
one (thus t 1 in Equation (5)) and lead one (thus t þ 1 in Equation (5), hence not computable in
real time due to the forward looking bias) in market volatility are also included, in order to explain
non-contemporaneous trading, in line with Chordia, Roll, and Subrahmanyam et al. (2000); Karolyi,
Lee, and Van Dijk (2012); and Dang, Moshirian, and Zhang (2015). The R2 values are similar to the
ones of Equation (5).
5 We also implemented the value-weighted market volatility and the results are similar to the equally
weighted market volatility.
Zhang et al. j Volatility Forecasting with Machine Learning 9
Notes: The commonality is averaged across stocks for each month during the sample period of July
2011–June 2021.
Note: VIX represents the market volatility from the Chicago Board Options Exchange.
in Table 2 also reports the results of the relation between the average commonality and the
market volatility. As the horizon extends, the average commonality has a higher correlation
with the market volatility.6
Figure 5 reports the averaged values and standard deviations (black vertical lines) of
commonality for each half-hour in the trading session. To create this figure, we use the
observations in a given interval, such as [09:30, 10:00], to fit Equation (5). We observe a
gradual increase in commonality throughout the trading session as we get closer to market
close, in sharp contrast to the diurnal volatility pattern in Figure 3.
4 Methodology
In this section, we leverage the commonality for the task of predicting cross-asset volatility.
We construct the prediction model as follows:
ðhÞ
RVi;tþh ¼ Fi ðu; hÞ þ i;tþh
(7)
ðhÞ ðhÞ ðhÞ ðhÞ
¼ Fi RVi;t ; . . . ; RVi;tðp1Þh ; RVM;t ; . . . ; RVM;tðp1Þh ; h þ i;tþh ;
ðhÞ
where RVi;tþh is the volatility of asset i during ðt; t þ h. u represents the input features,
which can be further separated into two categories: (i) a multi-dimensional vector of
Notes: The commonality is averaged across stocks for each half-hour during the sample period of July
2011–June 2021.
predictor variables
for a specific stocki0 available up to time t, denoted as individual fea-
ðhÞ ðhÞ
tures, such as RVi;t ; . . . ; RVi;tðp1Þh and (ii) a vector of features for all stocks inthe
0
ðhÞ ðhÞ
studied universe up to t, denoted as market features, such as RVM;t ; . . . ; RVM;tðp1Þh . h
refers to the parameters that need to be estimated. Whenever is clear from the context and
no ambiguity arises, we use also use h to denote the forecasting model. We are aiming to
find a function of variables that minimizes the out-of-sample errors for future RV.
4.1. Models
This section summarizes the collection of ML models employed in our numerical experiments.
performance on daily data (Patton and Sheppard 2015; Izzeldin et al. 2019). For day t, the
forecast of HAR is based on
ðdÞ ðdÞ ðdÞ ðwÞ ðwÞ ðmÞ ðmÞ
RVi;tþ1 ¼ ai þ bi RVi;t þ bi RVi;t þ bi RVi;t þ i;tþ1 ; (9)
where Di;stþh denotes the average diurnal RV in the bucket-of-the-day stþh computed from
the last 21 days. For example, when t ¼ 10:30 and h ¼ 30 min, then stþh corresponds to
ðhÞ ðdÞ ðwÞ ðmÞ
the bucket 10:30–11:00. RVi;t represents the lag ¼ 1 intraday RV. RVi;t (RVi;t , RVi;t )
denotes the aggregated daily (weekly, monthly) RV. When we consider the daily scenarios,
Equation (10) becomes the standard HAR model (Equation 9), by removing the diurnal
term and the intraday component.
7 Since we use the log-version realized volatility, the multiplication of daily, diurnal, and stochastic
intraday components in Engle and Sokalska (2012) translates to the addition in our model (10).
12 Journal of Financial Econometrics
(2009), LASSO performs both variable selection and regularization, therefore enhances the
prediction accuracy and interpretability of regression models. The objective function of
LASSO is the sum of squared residuals and an additional l1 constraint on the regression
coefficients, as shown in Equation (12). Here, the hyperparameter k controls the penalty
4.1.5 XGBoost
Linear models are unable to capture the possible non-linear relations between the depend-
ent variable and the predictors, and the interactions among predictors. As pointed by Bucci
(2020), RVs are subject to structural breaks and regime-switching, hence the need to con-
sider non-linear models. One way to add non-linearity and interactions is the decision tree,
see more in Hastie, Tibshirani, and Friedman (2009).
XGBoost is a decision-tree-based ensemble algorithm, implemented under a distributed
gradient boosting framework by Chen and Guestrin (2016). There is abundant empirical
evidence showing the success of XGBoost, such as in a large number of Kaggle competi-
tions. In this work, we only review the essential idea behind XGBoost—tree boosting
model. For more details about other important features of XGBoost, such as the scalability
in various scenarios, parallelization, distributed computing, feature importance to enhance
interpretability, etc., the reader may refer to Chen and Guestrin (2016). Let u represent the
vector of input features,
X
B
Fi ðuÞ ¼ fl ðuÞ; fl 2 F ; (13)
l¼1
where F is the space of regression trees. An example of the tree ensemble model is depicted
in Figure 6. The tree ensemble model in Equation (13) is trained sequentially. Boosting (see
Friedman 2001) means that new models are added to minimize the errors made by existing
models, until no further improvements are achieved.
Notes: B represents the number of trees. The final prediction of a tree ensemble model is the sum of
predictions from each tree, as shown in Equation (13).
output space (Bucci 2020). The parameters in MLPs can be updated via stochastic gradient
descent. In this work, we use Adam (see Kingma and Ba 2014), which is based on adaptive
estimates of lower-order moments. Let u 2 Rp represent the input variables
Fi ðu; hÞ ¼ WL r WL1 . . . rðW1 u þ b1 Þ . . . þ bL1 þ bL ; (14)
f t ¼ rg ðWf ut þ Uf ht1 þ bf Þ
it ¼ rg ðWi ut þ Ui ht1 þ bi Þ
ot ¼ rg ðWo ut þ Uo ht1 þ bo Þ
(15)
~c t ¼ rc ðWc ut þ Uc ht1 þ bc Þ
ct ¼ f t ct1 þ it ~c t
ht ¼ ot rh ðct Þ;
where ut is the input vector, f t is the forget gate’s activation vector, it is the update gate’s ac-
tivation vector, ot is the output gate’s activation vector, c~t is the cell input activation vector,
ct is the cell state vector, and ht is the hidden state vector, that is, output vector of the
LSTM unit. is the Hadamard product function. rg is the sigmoid function, and rc ; rh are
14 Journal of Financial Econometrics
hyperbolic tangent function. Wf ði;o;cÞ ; bf ði;o;cÞ refer to weight matrices and bias vectors that
need to be estimated.
To summarize, we first consider a traditional time-series model ARIMA, then include
three linear regression models, that is, HAR(-D), OLS, and LASSO. To account for the non-
• Single denotes that we train customized models Fi for each stock i, as in Bucci (2020)
and Hansen and Lunde (2005). We use a stock’s own past RVs only as predictor fea-
tures, namely
0
ðhÞ ðhÞ
u ¼ RVi;t ; . . . ; RVi;tðp1Þh
Sokalska 2012; Bollerslev et al. 2018; Bucci 2020; Rahimikia and Poon 2020; Pascalau and
Poirier 2021). Both functions measure losses, so lower values are preferred. Patton and
Sheppard (2009) demonstrate that QLIKE has the highest power in the Diebold–Mariano
(DM) test. Consequently, we focus more on the QLIKE rather than the MSE:
ðhÞ ðhÞ
where d RV i;t represents the predicted value of RVi;t , the RV for stock i during ðt h; t. N
is the number of stocks in our universe, T test is the testing period, and #T test is the length of
the testing period.
where Lij;t is the loss difference between models i and j at day t in terms of a specific loss
function L, such as MSE and QLIKE. The model confidence set (MCS) procedure renders it
possible to make statements about the statistical significance from multiple pairwise com-
parisons. For additional details, we refer to the studies of Hansen, Lunde, and Nason
(2011).
c
Et uðWtþ1 Þ :¼ Uðxt Þ ¼ Wt xt Et ðretþ1 Þ x2t Et expðRVtþ1 Þ ; (18)
2
The optimal portfolio that maximizes this utility is obtained by investing the following
fraction of wealth to the risky asset:
SR=c
xt ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi : (20)
Et expðRVtþ1 Þ
To determine the utility gains based on different risk models, the expectation based on
model h is denoted by Eht ðÞ. Assuming that the investor uses model h, then the position
SR=c
xht ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi is chosen. By plugging xt into Equation (19) and replacing
h
Eht expðRVtþ1 Þ
Et ðexpðRVtþ1 ÞÞ with the RV expðRVtþ1 Þ, the expected utility per unit of the wealth (called
realized utility, or in short RU) is given by
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
SR2 expðRVtþ1 Þ SR2 expðRVtþ1 Þ
RUt ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h : (21)
c E ðexpðRV ÞÞ
h 2c E t ðexpðRVtþ1 ÞÞ
t tþ1
If a risk model is ideal, that is, it predicts perfectly the realized volatilities
2
Eht ðexpðRVtþ1 ÞÞ ¼ expðRVtþ1 Þ, then its realized utility is SR2c . Alternatively, the investor is
SR2
willing to give up 2c of the wealth in order to utilize the perfect risk model instead of inves-
ting only in the risk-free asset. In this article, the same Sharpe ratio (SR ¼ 0.4) and the same
coefficient of risk aversion (c ¼ 2) are applied as in Bollerslev et al. (2018) and Li and Tang
(2020).
The previous comparisons are based on a frictionless setting, ignoring the trading cost.
The case of incorporating the effect of transaction costs is also considered. Following
Bollerslev et al. (2018) and Li and Tang (2020), we assume that transaction costs are linear
in the absolute magnitude of the change in the positions, and use the full median bid–ask
spread for each of the assets over the last 90 trading days. The realized utility with trading
costs deducted, denoted as RU-TC, is simply the realized utility after subtracting the simu-
lated costs. We evaluate this realized utility (with and without trading cost) empirically by
averaging the corresponding realized expressions over stocks and the same rolling out-of-
sample forecasts.
5 Experiments
5.1. Implementation
For each data set, we divide the observations into three non-overlapping periods and main-
tain their chronological order: training, validation, and testing. For a given trading day t,
Zhang et al. j Volatility Forecasting with Machine Learning 17
the training data, including the samples in the first period [July 1, 2011; t 251, are used
to estimate models subject to a given architecture. Validation data, including the recent
1-year samples ½t 250; t, are deployed to tune the hyperparameters of the models.
Finally, testing data are samples in the next year ½t þ 1; t þ 251; they are out-of-sample in
8 To more formally assess the statistical significance of the differences in out-of-sample volatility
forecasts, Table C.1 in Appendix C also reports the results of all DM tests in terms of QLIKE.
9 Note that (S)ARIMA is for Single time series.
18 Journal of Financial Econometrics
Statistical performance MSE QLIKE MSE QLIKE MSE QLIKE MSE QLIKE
SARIMA Single 2.095 1.473 3.041 2.624 3.314 2.861 3.550 3.515
HAR-D Single 2.690 2.065 3.457 3.040 3.542 3.095 3.548 3.515
Universal 2.574 1.975 3.429 3.016 3.541 3.095 3.547 3.514
Augmented 2.790 2.280 3.428 3.022 3.552 3.107 3.571 3.536
OLS Single 2.660 1.901 3.506 3.027 3.579 3.118 3.543 3.504
Universal 2.601 2.039 3.432 2.984 3.580 3.127 3.546 3.513
Augmented 2.845 2.271 3.485 3.036 3.587 3.130 3.576 3.536
LASSO Single 2.631 1.893 3.526 3.061 3.567 3.108 3.545 3.501
Universal 2.593 2.044 3.432 2.989 3.578 3.126 3.543 3.512
Augmented 2.852 2.292 3.487 3.046 3.586 3.132 3.575 3.537
XGBoost Single 2.492 1.552 3.408 2.888 3.520 3.039 3.508 3.449
Universal 2.890 2.200 3.532 3.067 3.592 3.116 3.546 3.505
Augmented 2.864 2.212 3.545 3.083 3.581 3.109 3.571 3.524
MLP Single – – – – – – – –
Universal 2.952 2.380 3.564 3.119 3.607 3.139 3.543 3.506
Augmented 2.993 2.442 3.569 3.126 3.609 3.145 3.571 3.534
LSTM Single – – – – – – – –
Universal 2.975 2.455 3.575 3.144 3.610 3.149 3.552 3.514
Augmented 3.028 2.532 3.595 3.170 3.614 3.166 3.567 3.533
Notes: The table reports the out-of-sample results for predicting future RV over multiple horizons using differ-
ent models under three training schemes. For each horizon, the model with the best (second best) out-of-sample
performance in QLIKE (in Panel A)/RU (in Panel B) is highlighted in red (blue), respectively. An asterisk (*)
indicates models that are included in the MCS at the 5% significance level.
Zhang et al. j Volatility Forecasting with Machine Learning 19
Notes: DQLIKE is averaged across stocks in each month during the testing period July 2015–June
2021. The dashed horizontal lines represent the average reductions in QLIKE.
are reduced from 0.453 (respectively, 0.227, 0.186) with the best HAR-D model (i.e., under
Augmented) to 0.430 (respectively, 0.204, 0.171) with the best OLS model (i.e., under
Augmented), across the three horizons (i.e., 10-, 30-, and 65-min), respectively. Within the
OLS models, conclusions are similar with HAR-D models, that is, no benefits from
Universal while significant benefits from Augmented. We also observe similar findings in
LASSO as in OLS, suggesting that regularization does not further aid performance. On the
other hand, MLPs and LSTMs achieve state-of-the-art accuracy across all measures and
intraday horizons (i.e., 10-, 30-, and 65-min), implying the complex interactions between
predictors. Further analysis is provided in Section 5.3.
Interestingly, linear models slightly outperform MLPs and LSTMs at the 1-day horizon.
This is perhaps expected, and might be due to the availability of only a small amount of
data at the 1-day horizon, rendering the NNs to underperform due to lack of training data.
Echoing the findings from Panel A, OLS based on the 21-day rolling daily RVs deliver
the higher utility than the HAR-type models, consistent with Bollerslev et al. (2018). NNs
still perform the best, with the highest realized utility achieved by LSTMs.
Let us now consider the OLS model as an illustrative example for understanding the
relative reduction in error. We compare its QLIKEs under these three schemes, at a monthly
level, as shown in Figure 7. For better readability, we report the reduction in error of
Universal relative to Single (denoted as Univ–Single), the reduction of Augmented relative
to Universal (denoted as Aug–Univ), and the reduction of Augmented relative to Single
(denoted as Aug–Single). Note that Aug–Single ¼ (Aug–Univ) þ (Univ–Single). Negative
values of DQLIKE indicate an improvement on out-of-sample data and positive values indi-
cate degradation. To arrive at this figure, we average the DQLIKE values in each month,
across stocks. Figure 7 reveals that the improvement of Universal compared with Single is
relatively small but consistent. In terms of the benefits of Augmented, it is typically the case
that incorporating the market volatility as an additional feature helps improve the forecast-
ing performance, especially for turmoil periods.
20 Journal of Financial Econometrics
Notes: Q1, respectively Q5, denotes the subset of stocks with the lowest, respectively, highest, 20%
values for the commonality.
Notes: For ease of readability, we only report the sensitivity values for the most recent 30 lagged RVs
(i.e., in the last five days for 65-min horizon).
X
N X @F
Sensitivityk ¼ j j (22)
i¼1 t2T train
@uk u¼ui;t
Here, F is the fitted model under the Augmented scheme, u represents the vector of pre-
dictors, and uk is the k-th element in u. ui;t represents the input features of stock i at time t.
We normalize the sensitivity of all variables such that they sum up to one. In a special case
of linear regression, the sensitivity measure is the normalized absolute slope coefficient.
Considering the 65-min scenario as an example, Figure 9 reveals that for both OLS and
MLP, there has been a tendency of the lagged features to decline in terms of sensitivity, as
the lag increases. Additionally, we observe that the sensitivity values rise to a high point at
every six lags, corresponding to one day. A distinct difference between the sensitivity values
implied by OLS and the ones implied by MLP is that the latter places more weight on the
lag ¼ 1 individual RV (Sensitivity ¼ 0.90) and less on the lag ¼ 1 market RV
(Sensitivity ¼ 0.059). On the other hand, for OLS, the sensitivities of lag ¼ 1 individual (re-
spectively, market) RV are 0.081 (respectively, 0.069).
10 Recall that the variables are normalized by removing the mean and scaling to unit variance.
22 Journal of Financial Econometrics
Notes: The figure plots the pattern of predicted RV (y-axis) as a function of the lag ¼ 1 individual RV (x-
axis) conditioned on various lag ¼ 1 market RV quantile values (keeping all other variables at their
mean values).
ðhÞ ðhÞ
interaction effects between RVi;t and RVM;t . As it can be observed from the rightmost re-
gion of Figure 10(b), the distances between the curves become relatively smaller, conveying
the message that, when an individual stock is very volatile, the market effect on it weakens.
11 The set of unseen stocks includes the following 16 tickers: AMAT, APD, BIIB, COF, DE, EQIX, EW,
GPN, HUM, ICE, ILMN, ITW, NOC, NSC, PLD, and SLB.
Zhang et al. j Volatility Forecasting with Machine Learning 23
OLS Unseen 0.664 0.372 0.329 0.219 0.287 0.205 0.348 0.254
OLS Universal 0.678 0.410 0.328 0.223 0.286 0.206 0.343 0.260
Augmented 0.639 0.359 0.317 0.222 0.278 0.208 0.327* 0.249*
LASSO Universal 0.683 0.419 0.330 0.225 0.286 0.207 0.344 0.261
Augmented 0.639 0.359 0.317 0.222 0.278 0.208 0.327* 0.249*
XGBoost Universal 0.655 0.476 0.314 0.206 0.278 0.201 0.353 0.266
Augmented 0.654 0.509 0.320 0.221 0.282 0.206 0.364 0.255
MLP Universal 0.623 0.328 0.306 0.203* 0.266 0.193* 0.342 0.266
Augmented 0.623 0.332 0.301* 0.203* 0.263* 0.194* 0.329 0.252
LSTM Universal 0.637 0.348 0.311 0.211 0.267 0.195 0.339 0.265
Augmented 0.622* 0.326* 0.303 0.205 0.263* 0.194 0.332 0.255
OLS Unseen 3.107 1.996 3.475 2.672 3.503 2.715 3.385 3.320
OLS Universal 2.988 2.280 3.461 2.700 3.498 2.712 3.363 3.298
Augmented 3.138 2.355 3.459 2.710 3.487 2.712 3.389 3.311
LASSO Universal 2.959 2.270 3.457 2.704 3.496 2.712 3.359 3.296
Augmented 3.137 2.376 3.458 2.720 3.485 2.716 3.389 3.315
XGBoost Universal 2.688 1.640 3.510 2.711 3.511 2.701 3.349 3.269
Augmented 2.563 1.578 3.464 2.688 3.495 2.680 3.388 3.302
MLP Universal 3.233 2.396 3.515 2.736 3.529 2.730 3.340 3.266
Augmented 3.221 2.444 3.514 2.749 3.522 2.735 3.378 3.302
LSTM Universal 3.167 2.415 3.493 2.769 3.523 2.737 3.345 3.271
Augmented 3.238 2.533 3.507 2.787 3.524 2.762 3.371 3.302
Notes: The table reports the out-of-sample results for predicting future RV of unseen stocks over multiple hori-
zons using different models under three training schemes. The row OLS Unseen represents the baseline results
based on OLS models estimated for each unseen stock. Other rows represent the results of models estimated on
raw stocks under the Universal and Augmented settings. For each horizon, the model with the best (second
best) out-of-sample performance in terms of QLIKE (in Panel A)/RU (in Panel B) is highlighted in red (blue), re-
spectively. An asterisk (*) indicates models that are included in the MCS at the 5% significance level.
intraday information, daily RV is a superior proxy for the unobserved daily volatility,
when compared with the parametric volatility measures generated from the GARCH and
SV models of daily returns (see Barndorff-Nielsen and Shephard 2002; Andersen et al.
2003; Izzeldin et al. 2019). It is worth noting that in these traditional forecasting daily RV
Notes: In each box, dots in the top line represent the intraday returns. The traditional approaches em-
ploy the aggregated daily (or weekly, or monthly) RVs (the remaining left segments) as predictors,
while the Intraday2Daily approach employs intraday RVs (the short segments marked with h at day t).
h represents the horizon of intraday RVs. In this example, h ¼ 130 min.
market volatilities into models. Figure 11 illustrates the comparison between the traditional
approach and our Intraday2Daily approach.
The advantages of the Intraday2Daily approach over traditional approaches can be
summarized as follows. First, the Intraday2Daily approach significantly enriches the infor-
mation content of daily volatility. Second, it contributes to the literature in the modeling of
daily volatility by examining the coefficients of intraday RVs. Third, the essential idea
underlying the Intraday2Daily approach can be possibly applied to estimate other daily risk
measures, such as value-at-risk (VaR), etc. For example, one may use half-hour VaRs to
forecast the 1-day-ahead VaR. Finally, practitioners can better adjust their portfolios with
more accurate forecasts from the Intraday2Daily approach rather than traditional
approaches. To the best of our knowledge, this is the first study to explicitly investigate the
predictive power of intraday RVs on daily volatility and to demonstrate the additional ac-
curacy improvements it brings to the forecasting task.
6.3. Experiments
The forecasting performance of traditional approaches with daily variables is already summar-
ized in the column “1-day” of Table 3. Table 5 reports the results of models combined with
the Intraday2Daily approach.12 In other words, models in Table 5 use sub-sampled intraday
12 We observe similar findings when applying the Intraday2Daily approach to forecast the raw vola-
tilities (not in logs).
26 Journal of Financial Econometrics
RVs rather than the lag-one total RV in the column “1-day” of Table 3. For example, the lag-
one total RV in HAR (Equation 9) is replaced by non-overlapped intraday RVs.
By comparing the column “1-day” of Table 3 with Table 5, we establish that the
Intraday2Daily approach generally helps improve the out-of-sample performance of
6.4.1 Semi-variance-HAR
Patton and Sheppard (2015) proposed the semi-variance-HAR (SHAR) model as an exten-
sion of the standard HAR model (see further details in Section 4.1.2), in order to exploit
the well-documented leverage effect by decomposing the total RV of the first lag via signed
intraday returns, as shown in Equation (25) (see Barndorff-Nielsen, Kinnebrock, and
Shephard 2008). In other words, the lag-one RV in SHAR (Equation 26) is split into the
sum of squared positive returns and the sum of squared negative returns, as follows:
ðdÞþ PM1
RVi;t ¼ l¼0 r2i;tlD IfrtlD >0g ;
In the above, D denotes the interval for computing the intraday returns.
28 Journal of Financial Econometrics
6.4.2 HARQ
Bollerslev, Patton, and Quaedvlieg (2016) pointed out that the beta coefficients in the HAR
model may be affected by measurement errors in the realized volatilities. By exploiting the
asymptotic theory for high-frequency RV estimation, the authors propose an easy-to-
ðdÞ
X
M M1
RQi;t ¼ r4 (27)
3 l¼0 tlD
qffiffiffiffiffiffiffiffiffiffiffiffiffi
ðdÞ ðdÞ ðdÞQ ðdÞ ðdÞ ðwÞ ðwÞ ðmÞ ðmÞ
RVi;tþ1 ¼ ai þ bi þ bi RQi;t RVi;t þ bi RVi;t þ bi RVi;t þ i;tþ1 : (28)
We compute the corresponding intraday variables of semi-RVs and RQs and then in-
clude them as new predictors in the Intraday2Daily approach. From Table 6, we first ob-
serve that the SHAR model generally performs as well as the standard HAR model (in
Table 3), in line with Bollerslev, Patton, and Quaedvlieg (2016). HARQ outperforms HAR
and SHAR, when applied to individual stocks studied in the present paper. Comparing the
“Traditional” column with others, we conclude that in general, replacing the daily RVs
with intraday RVs as predictors helps improve the out-of-sample performance of bench-
mark models.
13 We attain similar results for models using intraday RVs based on other frequencies.
14 The 30-min that can make or break the trading day. https://www.wsj.com/articles/the-30-minutes-
that-can-make-or-break-the-trading-day-11583886131 (accessed on February 28, 2022).
Zhang et al. j Volatility Forecasting with Machine Learning 29
SHAR Single 0.277 0.191 0.257 0.178 0.253 0.176 0.261 0.183
Universal 0.285 0.198 0.263 0.183 0.255 0.178 0.261 0.182
Augmented 0.261 0.181 0.253 0.175 0.250 0.174 0.254 0.178
HARQ Single 0.264 0.204 0.254 0.178 0.253 0.176 0.256 0.179
Universal 0.253 0.176 0.253 0.176 0.254 0.176 0.257 0.179
Augmented 0.251 0.174 0.248* 0.172* 0.250 0.174 0.253 0.176
SHAR Single 3.528 3.497 3.559 3.525 3.563 3.529 3.548 3.515
Universal 3.510 3.499 3.548 3.525 3.560 3.529 3.550 3.516
Augmented 3.563 3.533 3.576 3.545 3.578 3.545 3.571 3.537
HARQ Single 3.467 3.425 3.556 3.520 3.564 3.528 3.557 3.525
Universal 3.564 3.530 3.564 3.530 3.564 3.530 3.558 3.525
Augmented 3.580 3.544 3.583 3.546 3.578 3.541 3.575 3.538
Notes: The table reports the out-of-sample results of SHAR and HARQ for predicting future daily RV under
three training schemes. Columns “10-min,” “30-min,” and “65-min” represent the Intraday2Daily approach
with different frequencies of predictors while the column “Traditional” represents that lagged daily RVs are
used as predictors. The dependent variable in this table always corresponds to future daily volatility. The
model with the best (second best) out-of-sample performance in QLIKE (in Panel A)/RU (in Panel B) is high-
lighted in red (blue), respectively. An asterisk (*) indicates models that are included in the MCS at the 5% sig-
nificance level.
Notes: The Intraday2Daily OLS model uses lagged individual 30-min RVs to forecast the next day’s
volatility. The x-axis represents the time of day. The y-axis represents the coefficients of lagged RVs.
30 Journal of Financial Econometrics
7 Conclusion
In this article, the commonality in intraday volatility over multiple horizons across the U.S.
equity market is studied. By leveraging the information content of commonality, we have
Appendix
A: What May Drive Commonality in Volatility?
Previous studies, especially in the behavioral finance field, have shown that investor senti-
ments could affect stock prices (e.g., Kogan et al. 2006; Baker and Wurgler 2007; Hameed,
Kang, and Viswanathan 2010; Da, Engelberg, and Gao 2011, 2015; Karolyi, Lee, and Van
Dijk 2012; Bollerslev et al. 2018). Keynes (2018) argued that animal spirits affect consumer
confidence, thereby moving prices in times of high levels of uncertainty. De Long et al.
(1990), Shleifer and Summers (1990), and Kogan et al. (2006) found that investor senti-
ments induce excess volatility. Karolyi, Lee, and Van Dijk (2012) considered the investor
sentiment index as an important source of commonality in liquidity. Bollerslev et al. (2018)
found a monotonic relationship between volatility and sentiment, possibly driven by corre-
lated trading. In this section, we are interested in the relation between investor sentiments
and commonality in volatility.
Traditionally, there are two approaches to measuring investor sentiments (see Da,
Engelberg, and Gao 2015), that is, market-based measures and survey-based indices.
Following Baker and Wurgler (2007), we consider the daily market volatility index (VIX)
from Chicago Board Options Exchange to be the market sentiment measure. We use the
Zhang et al. j Volatility Forecasting with Machine Learning 31
Consumer Sentiment Index (CSI)15 by the University of Michigan’s Survey Research Center
as a proxy for survey-based indices (see Carroll, Fuhrer, and Wilcox 1994; Lemmon and
Portniaguina 2006). Generally speaking, CSI is a consumer confidence index, calculated by
subtracting the percentage of unfavorable consumer replies from the percentage of favor-
Table A.1 reports the estimation results. First, we notice that a large proportion of the
variance for the commonality is explained by these three sentiment factors. For example,
the commonality for the 1-day scenario is 51.6%. In terms of intraday scenarios, the R2 val-
ues for 30-min and 65-min horizons are slightly small, 48.6% and 48.1%, respectively. The
results on 10-min data are somewhat surprising, where the R2 reaches to 55.6%. One pos-
sible reason is that economic policy uncertainty is significant in the 10-min scenario. In an-
other unreported robustness test, we estimate the regressions without the EPU factor. The
Notes: The table reports the results of time-series regressions of average commonality in volatility ðR2ðhÞ;m ÞL
over different horizons against three sentiment measures, VIX, CSI, and EPU. Superscript * denotes the signifi-
cance levels of 5%. To compare the effects of various investor sentiments, we normalize each explanatory vari-
able by removing its mean and scaling to the unit variance.
adjusted R2 value in the regression of 10-min data declines 2.5% while for other regres-
sions, the changes in adjusted R2 are subtle.
Besides the market volatility (VIX), we also find a significant effect of consumer senti-
ment (CSI) on the commonality of volatility over every studied horizon. The level of com-
B: Hyperparameter Tuning
There is no hyperparameter to tune in HAR-D and OLS. For LASSO, we use the standard
five-fold cross-validation method to determine k1 . Hyperparameters for other models in the
main analysis are summarized as follows.
To assess the robustness of NNs to different architectures, we repeat the main analysis
using one, two, and three hidden layers.17 The results reported in Table B.2 are generally
consistent with those reported in Table 3.
C: DM Test
DM test is used to discriminate the significant differences of forecasting accuracy between
different time-series models (e.g., Diebold and Mariano 1995; Diebold 2015). Denote the
loss associated with forecast error et by Lðet Þ, for example, Lðet Þ ¼ e2t . Then the loss differ-
ðabÞ ðaÞ ðbÞ
ence between the forecasts of models a and b is given by dt ¼ Lðet Þ Lðet Þ; where
ðaÞ ðbÞ
et (et ) represents the forecast error from model a (b), respectively. The DM test makes
17 The number of neurons is chosen based on the geometric pyramid rule, following Gu, Kelly, and
Xiu (2020).
Zhang et al. j Volatility Forecasting with Machine Learning 33
Notes: MLP1 has Single hidden layer with 128 neurons. MLP2 has two hidden layers of 128 and 64 neurons,
respectively. MLP3 has three hidden layers of 128, 64, and 32 neurons, respectively. LSTM variants have
similar meanings.
ðabÞ
one assumption that dt is covariance stationary. The null hypothesis is that
ðabÞ
Eðdt Þ ¼ 0. Under the covariance stationary assumption, we have the test statistic
ðabÞ
d
DM12 ¼ ! Nð0; 1Þ; (30)
r ðabÞ
b
ðabÞ PT ðabÞ ðabÞ
where d ¼ T1 t¼1 dt is the sample mean of dt r ðabÞ is a consistent estimate
and b
ðabÞ
of the standard deviation of d .
Following Gu, Kelly, and Xiu (2020), we apply a modified DM test, to make pairwise
comparisons of models’ performance when forecasting multi-asset volatility. Specifically,
the modified DM test compares the cross-sectional average of prediction errors from each
model, rather than comparing errors for each individual asset, that is,
N
ðabÞ 1X ðaÞ ðbÞ
dt ¼ Lðei;t Þ Lðei;t Þ ; (31)
N i¼1
ðaÞ ðbÞ
where ei;t (ei;t ) refers to the forecast error for stock i at time t from model a (b),
respectively.
To assess the statistical significance of the differences in out-of-sample volatility forecasts
as shown in Table 3, we report the results of all DM tests in terms of QLIKE for each
horizon.
34 Journal of Financial Econometrics
Panel A: 10-min.
Panel B: 30-min.
(continued)
Zhang et al. j Volatility Forecasting with Machine Learning 35
Panel C: 65-min.
Panel D: 1-day
Weekly Single 1.013 0.483 0.332 0.221 0.270 0.190 0.267 0.188
Universal 1.021 0.517 0.333 0.230 0.270 0.190 0.268 0.189
Augmented 0.995 0.453 0.323 0.228 0.262 0.185 0.256 0.180
Monthly Single 1.013 0.483 0.332 0.222 0.270 0.190 0.267 0.189
Universal 1.021 0.517 0.333 0.230 0.270 0.191 0.268 0.190
Augmented 0.995 0.453 0.323 0.227 0.262 0.185 0.256 0.180
Yearly Single 1.013 0.484 0.332 0.222 0.270 0.190 0.269 0.190
Universal 1.021 0.518 0.333 0.230 0.270 0.191 0.269 0.190
Augmented 0.995 0.453 0.323 0.227 0.262 0.186 0.257 0.180
Weekly Single 2.694 2.069 3.459 3.042 3.543 3.096 3.551 3.518
Universal 2.575 1.972 3.427 3.014 3.541 3.095 3.548 3.516
Augmented 2.790 2.280 3.427 3.020 3.553 3.108 3.571 3.536
Monthly Single 2.693 2.068 3.458 3.042 3.542 3.096 3.549 3.516
Universal 2.574 1.972 3.427 3.014 3.541 3.095 3.547 3.514
Augmented 2.789 2.279 3.426 3.020 3.553 3.107 3.571 3.536
Yearly Single 2.690 2.065 3.457 3.040 3.542 3.095 3.548 3.515
Universal 2.574 1.975 3.429 3.016 3.541 3.095 3.547 3.514
Augmented 2.790 2.280 3.428 3.022 3.552 3.107 3.571 3.536
References
Andersen, Torben G., and Tim Bollerslev. 1997. Intraday Periodicity and Volatility Persistence in
Financial Markets. Journal of Empirical Finance 4: 115–158.
Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Heiko Ebens. 2001. The
Distribution of Realized Stock Return Volatility. Journal of Financial Economics 61: 43–76.
Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. 2003. Modeling and
Forecasting Realized Volatility. Econometrica 71: 579–625.
Zhang et al. j Volatility Forecasting with Machine Learning 37
Andersen, Torben G., Tim Bollerslev, Peter F. Christoffersen, and Francis X. Diebold. 2006.
“Volatility and Correlation Forecasting.” In G. Elliott, C. Granger, and A. Timmermann (eds.),
Handbook of Economic Forecasting. Elsevier, 1 edition, Vol. 1, pp. 777–878.
Baker, Malcolm, and Jeffrey Wurgler. 2007. Investor Sentiment in the Stock Market. Journal of
Diebold, Francis X. 2015. Comparing Predictive Accuracy, Twenty Years Later: A Personal
Perspective on the Use and Abuse of Diebold–Mariano Tests. Journal of Business & Economic
Statistics 33: 1–1.
Diebold, Francis X., and Roberto S. Mariano. 1995. Comparing Predictive Accuracy. Journal of
Keynes, John Maynard. 2018. The General Theory of Employment, Interest, and Money.
Springer.
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.”
Working paper.