Autoencoder Asset Pricing Models
Autoencoder Asset Pricing Models
Journal of Econometrics
journal homepage: www.elsevier.com/locate/jeconom
article info a b s t r a c t
Article history: We propose a new latent factor conditional asset pricing model. Like Kelly, Pruitt, and
Available online 29 July 2020 Su (KPS, 2019), our model allows for latent factors and factor exposures that depend on
covariates such as asset characteristics. But, unlike the linearity assumption of KPS, we
Keywords:
model factor exposures as a flexible nonlinear function of covariates. Our model retrofits
Stock returns
Conditional asset pricing model the workhorse unsupervised dimension reduction device from the machine learning
Nonlinear factor model literature – autoencoder neural networks – to incorporate information from covariates
Machine learning along with returns themselves. This delivers estimates of nonlinear conditional exposures
Autoencoder and the associated latent factors. Furthermore, our machine learning framework imposes
Neural networks the economic restriction of no-arbitrage. Our autoencoder asset pricing model delivers
Big data out-of-sample pricing errors that are far smaller (and generally insignificant) compared
to other leading factor models.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction
A recent asset pricing literature has emerged challenging the ‘‘anomaly’’ view of characteristic-based asset return
prediction. The anomaly view suggests that certain asset attributes have the power to forecast returns above and beyond
the expected return variation warranted as compensation for aggregate risk exposures. Kelly, Pruitt, and Su (KPS, 2019)
provide empirical evidence that these so-called anomaly asset characteristics in fact proxy for unobservable and time-
varying exposures to risk factors, and shows that characteristics contain little (if any) anomalous return predictability once
their explanatory power for factor exposures has been accounted for. In other words, characteristics appear to predict
returns because they help pinpoint compensated aggregate risk exposures.
The asset pricing model proposed by KPS assumes that individual returns ri,t possess a K -factor structure:
✩ Disclaimer: The views and opinions expressed are those of the authors and do not necessarily reflect the views of AQR Capital Management,
its affiliates, or its employees; do not constitute an offer, solicitation of an offer, or any advice or recommendation, to purchase any securities or
other financial instruments, and may not be construed as such.
∗ Corresponding author.
E-mail addresses: [email protected] (S. Gu), [email protected] (D. Xiu).
https://doi.org/10.1016/j.jeconom.2020.07.009
0304-4076/© 2020 Elsevier B.V. All rights reserved.
430 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
There are, nonetheless, no obvious theoretical or intuitive justifications for this convenient linearity assumption.
To the contrary, there are many reasons to expect that this assumption is violated. Essentially all leading theoretical
asset pricing models predict nonlinearities in return dynamics as a function of state variables; Campbell and Cochrane
(1999), Bansal and Yaron (2004), and He and Krishnamurthy (2013) are prominent examples. Theory also predicts complex
dynamics in factor risk exposures, as shown for example in the general equilibrium model of Santos and Veronesi (2004).
Moreover, Pohl et al. (2018) show that linear approximations to nonlinear models can lead to considerable errors in the
model predictions for the magnitude of the equity premium or return predictability.
We generalize the factor model in (1) using models from the autoencoder family. Autoencoders are workhorse
dimension reduction models in the field of machine learning. They can be thought of as a nonlinear, neural network
counterpart to principal components analysis (PCA). Both autoencoders and PCA are unsupervised methods—they attempt
to model the full panel of asset returns using only the returns themselves as inputs. The statistical content of both methods
is a bottleneck that enforces a parsimonious representation of the return data set. The PCA bottleneck uses a linear
mapping from N individual returns into K ≪ N factors, while autoencoders allow for a nonlinear mapping through
neural networks.
Neither method, in their standard form, uses information in covariates to guide dimension reduction. KPS propose
‘‘instrumented’’ PCA (IPCA), which allows the information in covariates to guide the reduction via Eq. (2) but remains
reliant on the linear model formulation.
In this paper, we introduce a new conditional autoencoder model for individual stock returns which, like IPCA, allows
covariates to help guide dimension reduction. Our autoencoder uses a neural network-based compression of returns into a
low-dimensional set of factors, allowing stock characteristic covariates to have nonlinear and interactive effects on factor
exposures. At the same time, we make economically guided choices in structuring the autoencoders, imposing that the
factors are interpretable as portfolios, i.e., that they are linear combinations of individual equity returns. Ultimately, ours is
a nonlinear conditional asset pricing model, where the nonlinearities manifest through a flexible neural network mapping
of covariates into betas.
Our empirical analysis of a 60-year history of individual equity returns in the US shows that our autoencoder model
dominates observable factor models in the tradition of Fama and French (1993) that use static factor betas, as well as the
more sophisticated models such as the linear conditional beta specification of KPS. We follow KPS and compare models
based on two statistical criteria. The first is a model’s ‘‘total R2 ’’, which describes the fraction of variation in the full panel
of returns explained by contemporaneous factor realizations. This measures the model’s ability to describe the riskiness,
or joint return covariation, in the set of assets. The second criterion is a model’s predictive R2 , which describes the fraction
of return variation described by lagged conditional betas. This measures a model’s ability to describe differences in risk
compensation across assets.
We conduct all of our comparisons on a purely out-of-sample basis. Focusing, for example, on three-factor specifica-
tions, we find that monthly total R2 from our preferred autoencoder, from IPCA, and from the Fama–French three-factor
model are 12.6%, 13.3%, and 3.4%, respectively, while the predictive R2 is 0.50% and 0.23% for the autoencoder and IPCA,
and is negative for the Fama–French model.
We also compare the relative performance of models in economic terms. In particular, we form long–short decile spread
portfolios directly sorted on out-of-sample stock return predictions from each model. Portfolios based on the three-factor
autoencoder, IPCA, and Fama–French models earn annualized Sharpe Ratios of 2.16, 1.26, and −0.40, respectively, when
portfolios are equal weighted. For value weighted portfolios, the respective Sharpe ratios are 0.92, 0.59, and -0.69. In
summary, our conditional autoencoder model outperforms its leading competitors by a wide margin.
We contribute to a burgeoning literature using high dimensional statistical methods, including machine learning
techniques, to analyze the cross section of risk and return in financial markets. The leading example in this vein is (Gu
et al., 2019), who conduct a comparison of machine learning methods for predicting the panel of individual US stock
returns. That paper outlines a new research agenda marrying machine learning with the study of asset risk premia.1 (Gu
et al., 2019) focus on supervised prediction models but take no stand on the risk-return tradeoff. In other words, their
approaches do not constitute asset pricing models. In this paper, we develop asset pricing models using unsupervised
and semi-supervised learning methods that model the risk-return tradeoff explicitly. Autoencoders are critical tools in
the machine learning suite that have enjoyed success in wide range of practical applications.2 Our contribution links the
methodology literature on autoencoders with the large finance literature on factor pricing models.
KPS, who focus specifically on conditional factor pricing models, is the closest predecessor to our analysis.3 They
unify the literature on linear latent factor APT models (starting from Ross, 1976) with the literature on characteristic-
based ‘‘anomaly’’ return prediction. One of our contributions is to extend their work by allowing for a general nonlinear
specification of the return factor structure. To do so, we augment the traditional autoencoder by embedding a neural
1 Other related asset pricing papers include Feng et al. (2019b), Freyberger et al. (2017), Kozak et al. (2017), Feng et al. (2019a), Giglio and Xiu
(2018), and Gagliardini et al. (2016), among others.
2 See, for example, Gallinari et al. (1987), Bourlard and Kamp (1988), Hinton and Zemel (1994), and Goodfellow et al. (2016).
3 A related contemporaneous paper that builds on the instrumented beta approach of KPS using kernel methods is (Kozak, 2019).
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 431
network in the specification of conditional betas, which allows characteristics to determine risk exposures through flexible
nonlinearities and interactions, generalizing the linear ‘‘instrumented’’ beta specification of KPS.4
Another related predecessor is (Kozak et al., 2018), who propose an approach to factor analysis for asset pricing using
principal components from ‘‘anomaly’’-sorted portfolios (reviving an earlier literature on this approach exemplified by
Chamberlain and Rothschild, 1983; Connor and Korajczyk, 1986). This approach, however, fails to explain the risk-return
tradeoff when the test assets are individual stocks. Our conditional autoencoder model does not suffer this shortcoming.
It accurately describes the risk and expected return successfully for individual stocks as well as for anomaly or other
stock portfolios. More broadly, we show that our conditional autoencoder formulation is a valid asset pricing model. It is
equivalent to a nonparametric model for a stochastic discount factor, and imposes the economic restriction of no-arbitrage
pricing. One illustration of the conditional autoencoder’s empirical success as a pricing model is that it correctly prices so-
called anomaly portfolios with small and insignificant pricing errors. The pricing errors in our model, which are measured
on a purely out-of-sample basis, are a fraction of the magnitude of those from traditional Fama–French factor models.
The rest of the paper is organized as follows. In Section 2, we set up the model and present our methodology. Section 3
presents our empirical studies. Section 4 provides Monte Carlo simulations that demonstrate the performance of our
procedures. Section 5 concludes. The appendix contains mathematical proofs and detailed algorithms.
2. Methodology
In this section we first describe standard autoencoders, then we introduce our new conditional autoencoder that
leverages covariates as conditioning information. We highlight parallels between autoencoder models and widely studied
factor pricing models, and illustrate that linear factor models (and their associated PCA and IPCA estimators) are a special
case of autoencoders. Finally, we describe our autoencoder estimation approach.
An autoencoder is a special neural network in which the outputs attempt to approximate the input variables. The input
variables pass through a small number of neurons in the hidden layer(s), forging a compressed representation of the input
(encoding), which is then unpacked and mapped to the output layer (decoding). Because no other variables are used in
this model besides the inputs, an autoencoder is an unsupervised learning device.
Neural network models (including autoencoders) with L hidden layers can be written using the following recursive
(l)
formula. Let K (l) denote the number of neurons in each layer l = 1, . . . , L. Define the output of neuron k in layer l as rk ,
(l) (l)
and the vector of all outputs for this layer as r (l) = (r1 , . . . , rK (l) )′ . In each hidden layer, inputs from the previous layer
are transformed according to a nonlinear activation function g(·) before being passed to the next layer. To initialize the
network, the input layer uses the cross section of returns as raw predictors, r (0) = r = (r1 , . . . , rN )′ . The recursive output
formula for the neural network in layer l > 0 is then
r (l) = g b(l−1) + W (l−1) r (l−1) ,
( )
(3)
where W (l−1) is a K (l) × K (l−1) matrix of weight parameters, and b(l−1) is a K (l) × 1 vector of so-called bias parameters.
Throughout this paper we use as our nonlinear activation function the rectified linear unit (ReLU), g(y) = max(y, 0).5 The
number of parameters in each hidden layer l is K (l) (1 + K (l−1) ). The final output of the autoencoder,
G(r , b, W ) = b(L) + W (L) r (L) , (4)
which shares the same dimension as the input, is employed to approximate r itself. Fig. 1 illustrates the architecture of
a simple linear autoencoder with a single hidden layer.
4 Our approach to autoencoder learning for factor models is also related to work relying on kernel or sieve approximations, e.g., Connor et al. (2012)
and Fan et al. (2016). Recent work also considers machine learning tools for factor model estimation, such as nuclear norm penalization (Bai and Ng,
2017; Moon and Weidner, 2018) and partial least squares (Kelly and Pruitt, 2015). Our autoencoder approach allows for flexible (and conditional)
model structures, and uses stochastic gradient descent to manage the computational complexity in the resulting big data and high parameter setting.
That said, a thorough theoretical analysis on the statistical properties of this alternative learning method is left for future research.
5 Without ambiguity, we regard g(y) as an entry-wise vector-valued function whose length is equal to that of its input vector y.
6 For example, Connor and Korajczyk (1986) and Kozak et al. (2017).
432 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Fig. 1. Standard autoencoder model. Note: This figure describes a standard autoencoder with one hidden layer. The output and input layers are
identical, while the hidden layer is a low dimensional compression of inputs variables into latent factors, which can be expressed as weighted linear
combinations of input variables.
where rt is a vector of returns in excess of the risk free rate, ft is a K × 1 vector of factor returns, ut is a N × 1 vector
of idiosyncratic errors (uncorrelated with ft ), and β is an N × K matrix of factor loadings. This resembles that standard
factor model setting whose econometric properties are studied in Bai and Ng (2002), Bai (2003), and Giglio and Xiu (2018),
among others.
Stacking the time series vectors, the matrix form of the factor model is
R = βF + U.
Following Stock and Watson (2002) and Bai and Ng (2002), a factor model can be estimated with PCA on the covariance
matrix of returns.7 Equivalently, a singular value decomposition (SVD) of R̄, the demeaned returns of R, yields estimates
of factors and factor loadings directly.8 That is,
R̄ = P̂ ΛQ̂ + Û , (6)
where P̂ and Q̂ are, respectively, the N × K and K × T matrices of left and right singular vectors, and Û is a N × T matrix
of residuals.
The machine learning literature has long recognized the close connection between autoencoders and PCA (e.g., Baldi
and Hornik, 1989) and, by extension, between autoencoders and latent factor asset pricing models. When the autoencoder
has one hidden layer and a linear activation function, it is equivalent to the PCA estimator for linear factor models
described above.
More specifically, we can write the one-layer, linear autoencoder with K neurons as:
where W (0) , W (1) , b(1) and b(0) are K × N, N × K , N × 1, and K × 1 matrices of parameters, respectively. The model can
be estimated by solving the following optimization problem:
T
rt − b(1) + W (1) (b(0) + W (0) rt ) 2
∑ ( )
min
b,W
t =1
)2
= min R − b(1) ι′ + W (1) (b(0) ι′ + W (0) R) ,
(
F
(8)
b,W
where the F subscript denotes the Frobenius norm, and ι is T × 1 vector of 1s. The next proposition establishes the link
between this simple autoencoder and the PCA estimator.
7 A subtle consideration in estimating asset pricing models with PCA is choosing whether to impose the zero intercept no-arbitrage restriction.
Imposing the restriction amounts to applying PCA to the uncentered second moment matrix of excess returns, rather than to the (centered) covariance
matrix.
8 This approach is applicable with any (fixed) number of factors K . An extensive literature studies methods for choosing K for a large N and
large T panel, including Bai and Ng (2002), Hallin and Liška (2007), Amengual and Watson (2007), Alessi et al. (2010), Kapetanios (2010), Onatski
(2010), Ahn and Horenstein (2013), and Aït-Sahalia and Xiu (2017). Throughout the paper, we will treat K as a tuning parameter, which in principle
can be estimated via cross validation.
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 433
Fig. 2. Conditional autoencoder model. Note: This figure presents the diagram of an autoencoder augmented to incorporate covariates in the factor
loading specification. The left-hand side describes how factor loadings βt −1 at time t − 1 (in green) depend on firm characteristics Zt −1 (in yellow)
of the input layer 1 through an activation function g on neurons of the hidden layer. Each row of yellow neurons represents the P × 1 vector of
characteristics of one ticker. The right-hand side describes the corresponding factors at time t. ft nodes (in purple) are weighted combinations of
neurons of the input layer 2, which can either be P characteristic-managed portfolios xt (in pink) or N individual asset returns rt (in red). In the
latter case, the input layer 2 is exactly what the output layer aims to approximate, which is the same as a standard autoencoder. (For interpretation
of the references to color in this figure legend, the reader is referred to the web version of this article.)
Ŵ (1) = P̂A, Ŵ (0) = (Ŵ (1)′ Ŵ (1) )−1 Ŵ (1)′ , b̂(1) = r̄ − Ŵ (1) b̂(0) − Ŵ (1) Ŵ (0) r̄ , b̂(0) = a,
where A is any K × K non-singular matrix, a is a constant scalar, r̄ is the sample average of rt , and P̂ is from Eq. (6).9
Proposition 1 shows that the linear autoencoder with a hidden layer of K neurons is a linear factor model with K
latent factors. The estimated factor loadings are Ŵ (1) , and the estimated factors are Ŵ (0) R. These span the same spaces
as P̂ and Q̂ in Eq. (6). Needless to say, autoencoder models are more general than the linear factor model as they allow
for dimension reduction via layers of nonlinear transformations of rt . This additional flexibility has proven valuable in a
variety of applications outside of finance. For example, Hinton and Salakhutdinov (2006) show that deep autoencoders
handily outperform shallow or linear autoencoders for image recognition.
The static linear factor model has been an extremely productive tool for studying asset returns, but recent research
highlights a number of its limitations. The distribution of asset returns is well known to be highly time-varying, and
static factor models abstract from a wealth of relevant conditioning information. For example, KPS demonstrate large
empirical gains from incorporating asset-specific covariates in the specification of factor loadings. Not only do these
covariates improve the estimation of loadings, they also indirectly improve estimates of the latent factors themselves.
Their model formulation amounts to a combination of two linear models: One linear specification for the latent factors,
shown in Eq. (1), and another linear specification for conditional betas, shown in (2).
While the standard autoencoder in (5) is a powerful tool for dimension reduction, it shares the same limitation
as PCA that it does not leverage conditioning variables to identify the factor structure, and instead relies only on
returns themselves. To overcome this limitation, we design a new neural network structure by augmenting the standard
autoencoder model to incorporate covariates.
Fig. 2 illustrates the basic structure of our conditional autoencoder. The left side of the network models factor loadings
as a nonlinear function of covariates (e.g., asset characteristics), while the right side network models factors as portfolios
of individual stock returns. At the highest level, the mathematical representation of the model is identical to Eq. (1):
9 We provide a proof for the equivalence between the standard (centered) PCA and the linear autoencoder (with biases). The equivalence in the
uncentered case is similar.
434 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
The first key difference between our model and IPCA is in the formulation of conditional betas. We specify the K × 1
vector βi,t −1 as a neural network model of lagged firm characteristics, zi,t −1 . The recursive formulation for the nonlinear
beta function is:
(0)
zi,t −1 = zi,t −1 , (10)
( )
(l) (l−1)
zi,t −1 =g b (l−1)
+ W (l−1) zi,t −1 , l = 1, . . . , Lβ , (11)
(Lβ )
βi,t −1 = b(Lβ ) + W (Lβ ) zi,t −1 . (12)
Eq. (10) initializes the network as a function of the baseline characteristic data, zi,t −1 . The equations in (11) describe the
nonlinear (and interactive) transformation of characteristics as they propagate through hidden layer neurons.10 Eq. (12)
describes how a set of K -dimensional factor betas emerge from the terminal output layer. This formalizes the left side of
Fig. 2.
On the right side of Fig. 2, we see an otherwise standard autoencoder for the factor specification. The recursive
mathematical formulation of the factors is:
(0)
rt = rt , (13)
( )
(l)
rt = g̃ b̃(l−1) + W̃ (l−1) rt(l−1) , l = 1, . . . , Lf , (14)
(Lf )
ft = b̃(Lf ) + W̃ (Lf ) rt . (15)
Eq. (13) initializes the network with the vector of individual asset returns, rt . Equations in (14) transform and compress
the dimensionality of returns as they propagate through hidden layers. Eq. (15) describes the final set of K factors at the
output layer. Throughout our empirical analysis, we assume a single linear layer on the factor network, that is, Lf = 1, in
that this structure maintains the economic interpretation of factors: they are themselves portfolios (linear combination
of underlying asset returns).
At last, the ‘‘dotted operation’’ multiplies the N × K matrix output from the beta network with the K × 1 output from
the factor network to produce the final model fit for each individual asset return.
In practice, using the full cross section of individual stock returns in the factor network faces two daunting obstacles.
The first is that the number of individual firms in our sample is roughly 30,000, which means that the number of weight
parameters in the factor network can be astronomical, while the number of time series observations in our data set is a
mere 720. Second, the panel is extremely unbalanced—in any given month, we have on average around 6,000 non-missing
stocks, thus most of the stock-level weight parameters would require estimation from very few time series observations.
We therefore make one key modification to the factor side of the model that massively reduces the model’s
computational cost up front. Instead of initializing the network with the full cross section of stock returns in Eq. (13),
we instead initialize it with a set of portfolios, defined as
This P × 1 vector is a set of portfolios that are dynamically re-weighted (or ‘‘managed’’) on the basis of stock-level
characteristics. The jth element of xt is akin to a return on a long–short portfolio constructed by sorting stocks based
(0)
on the jth characteristic. Initializing the model with rt = xt accomplishes three things at once. First, it performs a
preliminary reduction of the data that eliminates tens of thousands of parameters from the model. This preprocessing
step in (16) can be viewed as adding a new initial layer to the factor neural network that dynamically (as a function
of Zt −1 ) collapses the N returns, rt , down to P neurons, xt , before proceeding with the rest of the network propagation.
Second, it sidesteps issues of panel incompleteness, since portfolios are formed from the subset of stocks that are non-
missing at each point in time. Third, it connects the factor autoencoder to the finance literature on characteristic-managed
portfolios, including KPS, Feng et al. (2019a), Kozak et al. (2017), and Giglio and Xiu (2018). KPS and Giglio and Xiu (2018),
in particular, show that conditional linear factor models for individual stocks can be recast as static factor analysis on
characteristic-managed portfolios.
10 Eqs. (11) and (14) assume the convention that z (l) and r (l) are vectors that stack the output from all neurons in the lth layer.
i,t −1 t
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 435
where F = (f1 , f2 , . . . , fT ) and Zt = (z1′ ,t , z2′ ,t , . . . , zN′ ,t )′ , and subject to restrictions that identify a unique rotation of
factors.11
For comparison, consider a particularly simple version of the conditional autoencoder that uses a linear activation
function and one layer of K neurons on both the beta side and the factor side of the network. In this case, βi′,t = Zt −1 W0′
and ft = W1 xt , so the estimation objective of the conditional autoencoder is:12
T
rt − Zt −1 W ′ W1 xt 2 .
∑
min 0 (18)
W0 ,W1
t =1
The next proposition formalizes the equivalence between IPCA and this linear conditional autoencoder.
Proposition 2. The solution to (18) is equivalent to the solution of (17) if Zt′ Zt = Σ for a constant matrix Σ .
In the general case where Zt′ Zt is non-constant, the two estimators are similar but no longer equivalent (as we can see
from the proof). We find that the empirical performance of (17) and (18) is similar in our data.
Autoencoders, like neural networks more broadly, have many advantages relative to traditional linear factor models. In
particular, the high capacity of a neural network model enhances its flexibility to construct the most informative features
from data. With enhanced flexibility, however, comes a higher propensity to overfit. Following Gu et al. (2019), we take
a variety of measures to alleviate overfitting, including a careful design of the empirical strategy and the extensive use of
regularization.
where φ (θ ) is a penalty function that regularizes the model. There are many choices for the penalty function φ (·); we use
LASSO, or ‘‘l1 ’’, penalization, which takes the form
∑
φ (θ; λ) = λ |θj |.
j
The fortunate geometry of the LASSO penalty sets coefficients on a subset of covariates to exactly zero. In this sense, the
LASSO imposes sparsity on weight parameters, encouraging insignificant weights to vanish. The LASSO penalty involves
a non-negative hyperparameter, λ, which is determined in the validation sample.
In addition to l1 -penalization, we employ a second machine learning regularization tool known as ‘‘early stopping’’. It
begins from an initial parameter guess that imposes parsimonious parameterization (for example, setting all θ values close
to zero). In each step of the optimization algorithm, the parameter guesses are gradually updated to reduce fitting errors
in the training sample. At each new guess, estimates are also constructed for the validation sample, and the optimization
is terminated when the validation sample errors begin to increase. This typically occurs before the fitting errors are
minimized in the training sample, hence its name (see Algorithm 1). By ending the parameter search early, parameters
are shrunken toward the initial guess, and this is how early stopping regularizes against overfit. It is a popular substitute
to ‘‘l2 ’’-penalization of θ parameters because it achieves regularization at a much lower computational cost.
As a third regularization technique, we adopt an ensemble approach in training our neural networks. In particular, we
use multiple random seeds, say, 10, to initialize neural network estimation and construct model predictions by averaging
estimates from all networks. This enhances the stability of the results because the stochastic nature of the optimization
can cause different seeds to settle at different optima.
3.1. Data
We analyze the same dataset studied in Gu et al. (2019), which contains monthly individual stock returns from the
Center for Research in Securities Prices (CRSP) for all firms listed in the three major exchanges: NYSE, AMEX, and NASDAQ.
We use the Treasury bill rate to proxy for the risk-free rate from which we calculate individual excess returns. Our sample
begins in March 1957 (the start date of the S&P 500) and ends in December 2016, totaling 60 years.
In addition, we build a large collection of stock-level predictive characteristics based on the cross section of stock
returns literature. These include 94 characteristics (61 of which are updated annually, 13 updated quarterly, and 20
updated monthly), see Gu et al. (2019) for a full list. Most of these characteristics are released to the public with a delay.
To avoid a forward-looking bias, we assume that monthly characteristics are delayed by at most one month, quarterly
releases are delayed with a four month lag, and annual releases with a six month lag. Thus, we match realized returns
at month t with the most recent monthly characteristics at the end of month t − 1, the most recent quarterly data as of
t − 4, and most recent annual data as of t − 6. Observations are occasionally missing some characteristics. We replace a
missing characteristic with the cross-sectional median of that characteristic during that month.
Distributions of some characteristics are highly skewed and leptokurtic. Following a common tack in the literature
that avoids undue influence of outlying observations, we rank-normalize all characteristics into the interval (−1, 1) for
each month t. We then form 94 managed portfolios using (16). We also include one equal-weighted market portfolio that
corresponds to a constant regressor in Zt −1 .
Unlike the existing literature, we do not impose any filters based on stock prices or share codes, or rule out financial
firms. Past literature has imposed these filters in large part because they find it difficult to reconcile the return behavior
of low price stocks, uncommon share codes, and financial sector stocks with the rest of the sample. We have no such
difficulty thanks to the superior capacity of our model and the rich feature sets we allow for in our framework. The total
number of stocks in our sample is nearly 30,000, with the average number of stocks per month exceeding 6,200.
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 437
We compare a range of latent factor models in our empirical analysis. The first model, which we refer to as ‘‘PCA’’,
corresponds to specification (5). It assumes a linear functional form, constant betas, no conditioning information, and is
estimated via PCA.13 The second model we refer to as ‘‘IPCA’’ and follows the KPS specification of (1) and (2). In particular,
it assumes a linear factor structure and conditional betas that are likewise linear in covariates.
We then consider a range of conditional autoencoder (CA) architectures with varying degrees of complexity. The
simplest, which we denote CA0 , uses a single linear layer in both the beta and factor networks as described in (18),
making it similar (but not identical) to IPCA. Next, CA1 adds a hidden layer with 32 neurons in the beta network. Finally,
CA2 and CA3 add a second and third hidden layer, with 16 and 8 neurons respectively, to the beta side.
CA0 through CA3 all maintain a one-layer linear specification on the factor side of the model. In these cases, the only
variation in factor specification is in the number of neurons, which we allow to range from 1 to 6, and which corresponds
to the number of factors in the model.
We also compare the autoencoder specifications against benchmark models with observable factors. We refer to these
models collectively as ‘‘FF’’, which possess 1 to 6 factors. The first observable factor is the excess market return, then we
add SMB, HML, and UMD, sequentially. The five-factor model is the market, SMB, HML, CMA, and RMW, and the six-factor
model again appends UMD.14
We divide the 60 years of data into 18 years of training sample (1957–1974), 12 years of validation sample (1975–
1986), and the remaining 30 years (1987–2016) for out-of-sample testing. Because machine learning algorithms are
computationally intensive, we avoid recursively refitting models each month. Instead, we refit once every year as most
of our signals are updated once per year. Each time we refit, we increase the training sample by one year. We maintain
the same size of the validation sample, but roll it forward to include the most recent twelve months.
We evaluate out-of-sample model performance using the total and predictive R2 s defined by KPS. These pool errors
across firms and over time into grand panel-level assessments of each model. The total R2 quantifies the explanatory
power of contemporaneous factor realizations, and thus assesses the model’s description of individual stock riskiness:
− β̂i′,t −1ˆ
ft )2
∑
(i,t)∈OOS (ri,t
R2total =1− ∑ 2
. (20)
(i,t)∈OOS ri,t
The OOS set indicates that fits are only assessed on the testing subsample, whose data never enter into model estimation
or tuning.
The predictive R2 assesses the accuracy of model-based predictions of future individual excess stock returns. This
quantifies a model’s ability to explain panel variation in risk compensation. It is defined as
− β̂i′,t −1 λ̂t −1 )2
∑
(i,t)∈OOS (ri,t
R2pred =1− ∑ 2
, (21)
(i,t)∈OOS ri,t
13 The universe of individual stocks changes over time, so our PCA estimation must cope with unbalanced panels. We use an EM algorithm for
PCA (Stock and Watson, 2002). The IPCA algorithm devised by KPS is robust to missing data. Likewise, the SGD algorithm for autoencoder models is
not affected by missing data, because individual stock returns across periods are collected in a single pool from which random batches of observations
are drawn.
14 Market, SMB, HML, CMA, RMW, and UMD factor returns are from Ken French’s website.
438 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Table 1
Out-of-sample R2total (%) comparison.
Model Test assets K
1 2 3 4 5 6
FF rt 4.8 4.6 3.4 0.1 −2.3 −6.1
xt 49.4 64.8 69.5 70.4 71.1 72.2
PCA rt 7.3 3.3 5.0 5.3 4.2 3.9
xt 66.6 31.7 46.2 47.1 39.8 34.8
IPCA rt 11.2 12.4 13.3 13.7 14.3 14.5
xt 78.0 86.8 92.1 93.8 96.0 96.7
CA0 rt 10.9 11.8 12.3 12.2 12.5 12.4
xt 73.7 80.4 86.2 87.1 87.7 85.9
CA1 rt 10.4 11.5 12.2 12.9 13.4 14.3
xt 71.4 78.3 82.2 85.2 87.2 92.2
CA2 rt 10.7 11.8 12.6 13.2 13.6 13.8
xt 72.7 79.5 84.0 86.4 88.2 89.3
CA3 rt 10.7 11.8 12.5 13.3 13.7 13.8
xt 73.1 79.9 83.6 87.1 88.8 89.0
Note: In this table, we report the out-of-sample total R2 (%) for individual stocks rt and managed portfolios xt using observable factor models
(FF), PCA, IPCA, and conditional autoencoders CA0 through CA3 . In all cases, the number of factors K varies from 1 to 6.
Table 1 also reports the out-of-sample total R2 at the level of managed portfolios, xt . Given our predictions at the
individual equity level, the prediction at the portfolio level is immediate available. Model refitting is not needed as the
portfolio weights are known ex ante (see Gu et al. (2019) for more discussion on this ‘‘bottom-up’’ approach for prediction).
These portfolios are large and diversified collections of individual stocks, thus much of the idiosyncratic risk in the data
is averaged out. As a result, total R2 at the portfolio-level tends to be far higher. It is still the case that IPCA provides the
best fit, followed by CA1 . The comparative performance of observable factor FF models is much improved at the portfolio
level,15 but nonetheless dominated by conditional latent factor models.
Next, Table 2 compares models in terms of predictive R2 . Whereas IPCA dominated in terms of total R2 , its predictive
2
R of 0.3% per month is nearly doubled by the predictive power of (deep) conditional autoencoders. CA1 , CA2 , and CA3
generate a predictive R2 of 0.53%, 0.58%, and 0.57%, respectively. All of conditional models, including IPCA and CA0 ,
dramatically outperform the static FF and PCA models, which generally fail to produce any out-of-sample predictability
whatsoever.
15 Intuitively, dynamically reweighting portfolios to maintain roughly constant characteristic values reduces time variation in portfolio betas and
thus gives static factor models (including PCA) a better chance to fit the data.
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 439
Table 2
Out-of-sample R2pred (%) comparison.
Model Test assets K
1 2 3 4 5 6
FF rt 0.08 0.08 <0 <0 <0 <0
xt 0.65 0.69 0.93 0.62 0.73 0.45
PCA rt <0 <0 <0 <0 <0 <0
xt <0 <0 <0 <0 <0 <0
IPCA rt 0.10 0.10 0.23 0.31 0.31 0.30
xt 0.49 0.53 0.88 0.76 0.76 0.68
CA0 rt 0.11 0.11 0.23 0.25 0.27 0.27
xt 0.63 0.66 0.89 0.79 0.84 0.77
CA1 rt 0.13 0.17 0.45 0.52 0.56 0.53
xt 0.60 0.70 0.80 0.85 1.17 0.83
CA2 rt 0.15 0.17 0.50 0.57 0.57 0.58
xt 0.70 0.66 0.95 1.20 1.06 1.17
CA3 rt 0.14 0.17 0.52 0.55 0.54 0.57
xt 0.69 0.63 1.10 0.97 0.85 1.12
Note: In this table, we report the out-of-sample predictive R2 (%) for individual stocks rt and managed portfolios xt using observable factor
models (FF), PCA, IPCA, and conditional autoencoders CA0 through CA3 . In all cases, the number of factors K varies from 1 to 6.
It is difficult to infer the economic contribution of a model from R2 alone. To assess model performance in economic
terms, we evaluate how return predictions from each model translate into Sharpe ratios for portfolios formed based on
those predictions.
For each model, we sort stocks into deciles based on the model’s out-of-sample return forecasts. We construct a zero-
net-investment portfolio that buys the highest expected return stocks (decile 10) and sells the lowest (decile 1). We
rebalance portfolios each month, and consider both equal-weighted and value-weighted portfolios.
Table 3 reports the annualized Sharpe ratios of these 10–1 spread portfolios over our 30-year out-of-sample period.
The results essentially recast the model comparison of predictive R2 in terms of economic magnitudes. The overall best
performing portfolio is that based on the conditional autoencoder with two hidden beta layers, CA2 . This model achieves a
Sharpe ratio of 2.63 for the equal-weighted portfolio, and 1.53 with value weights. The performance of CA1 and CA3 is only
slightly lower. Following the nonlinear conditional autoencoders, the best model is IPCA, which delivers Sharpe ratios of
2.25 and 0.96 with equal and value weights, respectively (and which outperforms the linear conditional autoencoder CA0 ).
Finally, corroborating the R2 results above, static linear models FF and PCA broadly exhibit poor out-of-sample portfolio
performance.
To evaluate the multi-factor mean–variance efficiency of each model, we report the ex ante unconditional tangency
portfolio Sharpe ratio among factor portfolios. We calculate out-of-sample factor returns following the same re-estimation
approach described earlier. The tangency portfolio return for a set of factors is constructed on a purely out-of-sample basis
by using the mean and covariance matrix of estimated factors through t and tracking the post-formation t + 1 return.
440 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Table 3
Out-of-sample sharpe ratios of long–short portfolios.
Equal-weight K
1 2 3 4 5 6
FF −0.66 −0.85 −0.40 −0.30 0.36 −0.21
PCA 0.28 0.09 0.13 −0.08 −0.12 0.15
IPCA 0.20 0.19 1.26 2.16 2.31 2.25
CA0 0.23 0.32 1.34 1.87 2.10 2.18
CA1 0.30 0.39 2.12 2.63 2.67 2.60
CA2 0.30 0.38 2.16 2.64 2.68 2.63
CA3 0.31 0.38 2.19 2.57 2.57 2.59
Value-weight K
1 2 3 4 5 6
FF −0.82 −1.13 −0.69 −0.60 0.18 −0.53
PCA 0.12 −0.18 0.05 −0.10 −0.30 −0.08
IPCA −0.15 −0.07 0.59 0.81 1.05 0.96
CA0 −0.11 −0.03 0.41 0.81 0.83 0.88
CA1 −0.03 0.11 0.91 1.30 1.48 1.40
CA2 −0.03 0.08 0.92 1.39 1.45 1.53
CA3 −0.02 0.08 1.09 1.41 1.34 1.51
Note: In this table, we report annualized out-of-sample Sharpe ratios for long–short portfolios using
Fama–French models (FF), a vanilla factor model (5), and a variety of conditional autoencoders, CA0 ,
CA1 , CA2 , CA3 , based on (9), respectively, where the number of factors in (5) or the number of neurons
in the hidden layer on the right-hand side of (9), K , varies from 1 to 6.
We report results in Table 4. All conditional factor specifications (IPCA and CA0 through CA3 ) produce high uncondi-
tional Sharpe ratio statistics, consistent with the findings of KPS. The most dominant overall model on this dimension is
CA3 with five factors, though performance is broadly similar for CA1 through CA3 . Static factor models perform markedly
worse than conditional models. Results in Table 4 reflect the fact that conditional models capture extensive comovement
among assets while, at the same time, reconciling their differences in average returns with their factor loadings. These
results should not be viewed as performance of implementable trading strategies. The factor tangency portfolios describe
the mean–variance efficiency of models without considering practical frictions such as trading costs. Instead, they should
be viewed as providing a non-implementable but nonetheless helpful quantitative comparison of models’ mean–variance
efficiency in economic terms.
An important implication emerges from a comparison of Table 2 versus the return prediction analysis of Gu et al.
(2019). In their paper, the best performing machine learning model forecasts monthly individual stock returns (in the
exact same data set as ours) with an R2 of 0.40%. Yet theirs are pure prediction models – there is no factor structure
or risk-return tradeoff – and thus they make no distinction between predictability coming through compensation for
risk exposure, and compensation from mispricing (i.e., alpha). In contrast, the nonlinear factor models in this paper
force all the characteristic-based predictability to come solely through factor risk exposures. That is, the conditional
autoencoder models are all specified without an intercept, thus they impose the economic restriction of no-arbitrage. Despite
this restriction, the conditional autoencoder model achieves nearly identical predictive power for monthly stock returns,
0.58% for the CA2 specification. This is a significant result—it suggests that stock characteristics predict returns not because
they capture ‘‘anomalous’’ compensation without risk, but rather because the characteristics proxy for (and help identify)
compensated factor risk exposures.
In this section, we directly test whether the zero-intercept no-arbitrage restriction is satisfied in the data. If it is, the
time series average of model residuals for each asset—that is, the pricing errors in our model—should be statistically
indistinguishable from zero. We focus this analysis on unconditional pricing errors, defined as:
αi := E(ui,t ) = E(ri,t ) − E(βi′,t −1 ft ).
Alphas for the managed portfolios xt are defined analogously.
We focus our pricing error tests on xt , whose comparatively low dimensionality avoids inferential difficulties that arise
with rt .16 To construct estimates of the out-of-sample pricing error, we calculate the average difference between xt and
its out-of-sample model fit. These pricing errors can be interpreted as the average gain of a hedging portfolio that has a
zero-exposure on any systematic factors. In a no-arbitrage model, zero-exposure assets should earn zero excess return.
Fig. 3 scatters the estimated out-of-sample pricing errors for each model against the average returns of xt . The figure
also reports the number of alphas whose t-statistics exceed 3.0. The overall magnitude of alphas shrinks as we move from
16 For example, stock-level idiosyncratic risk is so large that stock-level alpha estimates tend to be extremely noisy.
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 441
Table 4
Out-of-sample factor tangency portfolio sharpe ratios.
K
1 2 3 4 5 6
FF 0.51 0.41 0.53 0.71 0.71 0.82
PCA 0.35 0.23 0.25 0.38 0.48 0.55
IPCA 0.39 0.44 1.81 3.14 3.71 3.72
CA0 0.42 0.48 1.47 1.76 1.94 1.97
CA1 0.56 0.91 3.18 3.82 3.63 4.58
CA2 0.54 0.75 3.56 4.26 4.72 2.77
CA3 0.54 0.77 3.94 4.75 4.94 4.37
Fig. 3. Out-of-sample pricing errors across models. Note: The figure reports out-of-sample pricing errors (alphas) for 95 characteristic-managed
portfolios xt , relative to the Fama–French five-factor model (FF5), the static linear latent five-factor model (PCA), and conditional autoencoders (CA0
through CA3 ). Alphas with t-statistics in excess of 3.0 are shown in red dots, while insignificant alphas are shown in hollow squares.
442 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Fig. 4. Top twenty characteristics by variable importance. Note: This figure compares variable importance for the top twenty most influential variables
in each model, based on an average over all training samples. The variable importances within each model are normalized to sum to one. All models
fix K = 5.
static linear models to nonlinear conditional autoencoders. For the five-factor Fama–French model, 37 of the 95 managed
portfolios have alpha t-statistics in excess of 3.0. For CA2 , that number drops to 8 out of 95. Furthermore, those that
remain significant are economically small (below 7 basis points per month) compared to alphas from the Fama–French
model.
Following Gu et al. (2019), we identify influential covariates by ranking them according to a notion of variable
importance, defined as the reduction in total R2 resulting from setting all values of a given characteristic to zero while
holding the remaining model estimates fixed. For this analysis, we focus on the five-factor specification of each model.
Figs. 4 illustrates characteristic importance for each conditional autoencoder specification. It focuses on the top 20
characteristics for each model. Beyond these, variable importance hovers near zero (we show importance for the full list
of characteristics in Fig. 5). The total contribution by the top twenty characteristics is around 80% for CA0 , and 90% for
CA1 through CA3 .
Three categories of characteristics stand out as the most influential. The first is a price trend category, which
includes short-term reversal (mom1m), stock momentum (mom12m), momentum change (chmom), industry momentum
(indmom), recent maximum return (maxret), and long-term reversal (mom36m). The second category includes liquidity
variables, such as turnover and turnover volatility (turn, std_turn), log market equity (mvel1), dollar volume (dolvol),
Amihud illiquidity (ill), number of zero trading days (zerotrade), and bid–ask spread (baspread). Risk measures constitute
the third influential group, including total and idiosyncratic return volatility (retvol, idiovol), market beta (beta), and beta-
squared (betasq). Interestingly, all variants of the autoencoder model agree on the importance of these three categories.
Moreover, these results closely coincide with the findings of Gu et al. (2019), who track variable importance using
R2pred (there is no notion of R2total in their analysis because they focus solely on prediction and therefore do not consider
contemporaneous factor associations). This consistency of variable importance across different objectives is an indication
of robustness in our list of key variables, and an indication that these variables matters for understanding variation in
both expected returns and realized returns.
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 443
Fig. 5. Overall importance rankings of all characteristics. Note: This figure ranks 94 stock-level characteristics in terms of overall model contribution.
Characteristics are ordered based on the sum of their ranks over all models, with the most influential characteristics on top and least influential
on bottom. Columns correspond to individual models, and color gradients within each column indicate the most influential (dark blue) to least
influential (white) variables. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
444 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Table 5
Using subsamples of stocks split by odd or even permnos.
Testing sample Training sample
Total R2 (%) Predictive R2 (%)
Odd Even Odd Even
Odd 13.7 13.6 0.48 0.49
Even 13.6 13.5 0.52 0.54
Testing sample Equal-weight SR Value-weight SR
Odd Even Odd Even
Odd 2.42 2.38 1.28 1.26
Even 2.52 2.53 1.29 1.19
We further look into the importance of characteristics for the beta and factor networks separately, in Fig. 6. To calculate
the characteristics importance for the beta (resp. factor) network, we again set all values of a given characteristic in the
beta (resp. factor) network to zero, without altering the values of this characteristic in the factor (resp. beta) network,
and then measure the reduction in total R2 . Interestingly, the relative importance of characteristics is consistent for the
two networks.
Last but not least, we demonstrate the robustness with respect to the choice of assets in the training and testing
samples. In particular, we re-train the CA2 model using subsamples of stocks comprised of odd or even permnos,
respectively. We report the out-of-sample total R2 (%), predictive R2 (%), equal-weight and value-weight Sharpe ratios for
the subsamples in Table 5. Throughout, the CA2 model performs almost equally well, even when the assets used in the
training and testing samples are completely non-overlapped.
To demonstrate the finite sample performance of our autoencoder learning method, we simulate a conditional 3-factor
model for excess returns rt , for t = 1, 2, . . . , T :
Fig. 6. Separate importance rankings of all characteristics. Note: These two plots rank 94 stock-level characteristics in terms of model contribution
to β (zi,t −1 ) and ft , respectively. Characteristics are ordered according to the same order in Fig. 5. Columns correspond to individual models, and color
gradients within each column indicate the most influential (dark blue) to least influential (white) variables. (For interpretation of the references to
color in this figure legend, the reader is referred to the web version of this article.)
446 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Table 6
Comparison of total R2 (%)s and predictive R2 (%)s in simulations.
Model (a) K
Total. R2 1 2 3 4 5 6
PCA 3.5 4.7 5.5 6.3 7.1 7.8
IPCA 18.6 32.2 40.7 41.0 41.4 41.7
CA0 15.6 26.7 33.7 33.5 33.4 33.2
CA1 17.6 30.3 38.1 37.7 37.3 37.1
CA2 17.7 29.2 36.8 36.5 36.3 35.9
CA3 17.6 25.6 30.0 29.5 26.3 23.4
Pred. R2
PCA 0.17 0.10 0.04 0.01 −0.01 −0.03
IPCA 2.20 2.93 3.33 3.32 3.32 3.32
CA0 2.04 2.84 3.17 3.14 3.12 3.13
CA1 2.11 2.93 3.27 3.29 3.26 3.26
CA2 2.10 2.85 3.22 3.22 3.23 3.22
CA3 2.06 2.57 2.89 2.86 2.58 2.39
Model (b) K
Total. R2 1 2 3 4 5 6
PCA 3.4 5.1 6.0 6.6 7.3 7.9
IPCA 11.0 11.4 11.9 12.3 12.7 13.1
CA0 8.5 8.2 7.9 7.6 7.4 7.2
CA1 15.0 24.6 31.8 32.0 31.9 31.8
CA2 15.7 23.5 30.9 31.8 30.2 28.2
CA3 15.9 15.6 14.6 14.0 11.2 9.2
Pred. R2
PCA 0.15 0.19 0.15 0.12 0.10 0.09
IPCA 0.84 0.82 0.81 0.80 0.79 0.79
CA0 0.80 0.76 0.77 0.76 0.72 0.70
CA1 1.83 2.31 2.70 2.70 2.71 2.73
CA2 1.95 2.24 2.73 2.80 2.69 2.53
CA3 1.77 1.43 1.32 1.26 1.06 0.86
Note: In this table, we report the average out-of-sample (OOS) Total R2 (%)s and Predictive
R2 (%)s for models (a) and (b) using PCA, IPCA, CA0 , CA1 , CA2 and CA3 , respectively. We
fix N = 200, T = 180, and Pc = Px = 50. The number of Monte Carlo repetitions is 100.
Throughout, we fix N = 200, T = 180 and Pc = Px = 50. In both cases, g ⋆ (·) only depends on 3 covariates, so there
are only 3 non-zero entries in θ , denoted as θ0 . Case (a) is simple and sparse linear model. Case (b) involves a nonlinear
2
covariate ci1 ,t , a nonlinear and interaction term (ci1,t × ci2,t ), and a dummy variable sgn(ci3,t ). We calibrate the values of
θ0 such that the total R2 is around 40%, and the predictive R2 is 5%.
For each Monte Carlo sample, we divide the whole time series into 3 consecutive subsamples of equal length for
training, validation, and testing, respectively. For Vanilla PCA and IPCA, we combine training and validation samples
together because they do not have any tuning parameter. For CA0 , CA1 , CA2 and CA3 we estimate them in the training
sample, then choose tuning parameters for each method in the validation sample, and calculate the prediction errors in
the testing sample.
We report the average OOS total and predictive R2 s for each method over 100 Monte Carlo repetitions in Table 6.
For model (a), IPCA delivers the best OOS total and predictive R2 s. This is not surprising given that the true model is
sparse and linear in the input covariates. More advanced methods such as CA1 , CA2 and CA3 tend to overfit, so their
performance is slightly worse. By contrast, for model (b), these methods clearly beat IPCA, because the latter cannot
capture the nonlinearity in the model. The Vanilla PCA method is always overfitting in the training sample so that it
cannot achieve a good OOS R2 s. The comparison among autoencoder models demonstrates a stark trade-off between model
flexibility and implementation difficulty. As shown in the table, shallower conditional autoencoders tend to outperform
in our simulation setting, which is consistent with our findings in empirical analysis.
Overall, the simulation results suggest that the conditional autoencoder methods are successful in learning the factor
structure in both linear and nonlinear situations. This is not surprising, as these methods are implemented to improve
fitting and prediction by allowing more complex functional forms of conditional factor loadings.
5. Conclusion
We propose a new approach to latent factor modeling for asset pricing that draws on autoencoder methods from the
machine learning literature. We adapt the standard autoencoder to allow latent factors and factor exposures to depend
on asset characteristic conditioning variables. The result is a nonlinear conditional asset pricing model that embeds the
economic restriction of no-arbitrage within a broader neural network framework.
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 447
In the empirical context of monthly US stock returns, our conditional autoencoder model dominates competing asset
pricing models, including Fama–French models, PCA methods, and linear conditioning methods such as IPCA. A long–short
decile spread portfolios sorted on stock return predictions from our preferred autoencoder produces an annualized value-
weighted Sharpe ratio of 1.53, beating the next closest competitor (IPCA, with Sharpe ratio 0.96) by a wide margin, and
on a purely out-of-sample basis. Finally, the pricing errors in our model (likewise measured on an out-of-sample basis)
are a fraction of the magnitude of those from traditional Fama–French factor models.
A.1. Notation
We use (A : B) to denote the column concatenation of two matrices A and B. The vector ei has a value of 1 in the
ith position and 0 elsewhere (and whose dimension implicitly conforms to the context). Likewise, the ∑vector ι denotes a
T
conformable vector with all entries being 1. For any time series of vectors {at }Tt=1 , we denote ā = T1 t =1 at . In addition,
we write āt = at − ā. A denotes the matrix (a1 : a2 : . . . : aT ), and Ā = A − āι′ correspondingly. We use λj (A) to denote
the jth largest eigenvalue of A, and σj (A) the jth largest singular value.
√ We use ∥A∥√and ∥A∥F to denote the operator norm
(or L2 norm) and the Frobenius norm of a matrix A = (aij ), that is, λ1 (A′ A) and Tr(A′ A), respectively.
A.2. Proofs
Proof of Proposition 1. At first, we set the partial derivative with respect to b(1) to zero.
∂ R − b(1) ι′ + W (1) (b(0) ι′ + W (0) R) 2 = 0
( )
∂ b(1) F
1(
b̂(1) = Rι − TW (1) b(0) − W (1) W (0) Rι .
)
T
Then we insert the solution into (8).
2
min R − b(1) ι′ + W (1) (b(0) ι′ + W (0) R) F
( )
b,W
( ) ( )2
1 ′ 1 ′
ιι ιι
(1) (0)
= min R − R − W W R − R
W T T F
2
= min R̄ − W (1) W (0) R̄ ,
W F
where R̄ is a matrix of demeaned returns. Thus, the problem becomes independent of the bias terms. We focus on the
weights W (0) and W (1) .
Next, we set the partial derivative with respect to W (0) to zero and assume R̄R̄′ and W (1)′ W (1) both have full rank.
∂ R̄ − W (1) W (0) R̄2 = 0
∂ W (0) F
The Eckart–Young–Mirsky theorem for the Frobenius norm states that the best rank K approximation for a matrix is
K
i=1 si p̂i q̂i = P̂ ΛQ̂ . Therefore, W
∑ ′ (1)
is the solution if and only if it satisfies
It is obvious that W (1) = P̂ is one solution for the above equation, because
Proof of Proposition 2. At first, we consider the objective function of IPCA when Zt′−1 Zt −1 = Σ . The first-order condition
for F is given by
Then we plug in ˆ
ft to the objective function of IPCA (17)
T T
rt − Zt −1 Γ ′ ft 2 = min rt − Zt −1 Γ ′ (Γ ΣΓ ′ )−1 Γ Z ′ rt 2 .
∑ ∑
min t −1 (A.2)
Γ ,F Γ
t =1 t =1
For the two-sided Autoencoder model, we use the managed portfolios xt = (Zt′−1 Zt −1 )−1 Zt′−1 rt as the inputs of the
right-hand side. We can rewrite the objective function (18) as
T T
rt − Zt −1 W ′ W1 xt 2 = arg min rt − (x′ ⊗ Zt −1 W ′ )vec(W1 )2 .
∑ ∑
min 0 t 0 (A.3)
W0 ,W1 W0 , W1
t =1 t =1
We plug in Zt′−1 Zt −1 = Σ ,
( T
)−1 ( T
)
∑ ∑
xt xt ⊗ W0 Σ W0
′ ′
W0 Zt′−1 rt
( )
vec(W1 ) = xt ⊗
t =1 t =1
⎛( )−1 ⎞( )
T T
∑ ∑
′
⊗ (W0 Σ W0 )
′ −1 ⎠ ′
( )
= ⎝ xt xt xt ⊗ W0 Zt −1 rt
t =1 t =1
⎛( )−1 ⎞
T T
∑ ∑
= ⎝ ′
xt xt xt ⊗ (W0 Σ W0 ) W0 Zt −1 rt ⎠ .
′ − 1 ′
t =1 t =1
= (W0 Σ W0 ) W0 Σ . ′ −1
We can plug in the solution of W1 to the objective function of the two-sided Autoencoder (18),
T T
rt − Zt −1 W ′ W1 xt 2 = arg min rt − Zt −1 W ′ (W0 Σ W ′ )−1 W0 Σ xt 2 .
∑ ∑
min 0 0 0 (A.4)
W0 , W1 W0
t =1 t =1
We see IPCA and the two-sided Autoencoder have the same objective functions (A.2) and (A.4). The Autoencoder solution
and the IPCA solution are identical with Γ = W0 . They give us identical factor estimates fˆt and factor loading estimates
β̂i,t −1 . □
S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450 449
Appendix B. Algorithms
References
Ahn, Seung C., Horenstein, Alex R., 2013. Eigenvalue ratio test for the number of factors. Econometrica 81, 1203–1227.
Aït-Sahalia, Yacine, Xiu, Dacheng, 2017. Using principal component analysis to estimate a high dimensional factor model with high-frequency data.
J. Econometrics 201, 388–399.
Alessi, Lucia, Barigozzi, Matteo, Capasso, Marco, 2010. Improved penalization for determining the number of factors in approximate factor models.
Statist. Probab. Lett. 80, 1806–1813.
Amengual, Dante, Watson, Mark W., 2007. Consistent estimation of the number of dynamic factors in a large N and T panel. J. Bus. Econom. Statist.
25, 91–96.
Bai, Jushan, 2003. Inferential theory for factor models of large dimensions. Econometrica 71 (1), 135–171.
Bai, Jushan, Ng, Serena, 2002. Determining the number of factors in approximate factor models. Econometrica 70, 191–221.
Bai, Jushan, Ng, Serena, 2017. Principal Components and Regularized Estimation of Factor Models. Technical Report, Columbia University.
Baldi, Pierre, Hornik, Kurt, 1989. Neural networks and principal component analysis: Learning from examples without local minima. Neural Netw. 2
(1), 53–58.
Bansal, Ravi, Yaron, Amir, 2004. Risks for the long run: A potential resolution of asset pricing puzzles. J. Finance 59 (4), 1481–1509.
Bourlard, Hervé, Kamp, Yves, 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59 (4–5), 291–294.
Campbell, John Y., Cochrane, John H., 1999. By force of habit: A consumption-based explanation of aggregate stock market behavior. J. Polit. Econ.
107 (2), 205–251.
450 S. Gu, B. Kelly and D. Xiu / Journal of Econometrics 222 (2021) 429–450
Chamberlain, Gary, Rothschild, Michael, 1983. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica 51,
1281–1304.
Connor, Gregory, Hagmann, Matthias, Linton, Oliver, 2012. Efficient semiparametric estimation of the Fama–French model and extensions.
Econometrica 80 (2), 713–754.
Connor, Gregory, Korajczyk, Robert A., 1986. Performance measurement with the arbitrage pricing theory: A new framework for analysis. J. Financ.
Econ. 15 (3), 373–394.
Fama, Eugene F., French, Kenneth R., 1993. Common risk factors in the returns on stocks and bonds. J. Financ. Econ. 33 (1), 3–56.
Fan, Jianqing, Liao, Yuan, Wang, Weichen, 2016. Projected principal component analysis in factor models. Ann. Stat. 44 (1), 219.
Feng, Guanhao, Giglio, Stefano, Xiu, Dacheng, 2019a. Taming the factor zoo: A test of new factors. J. Finance (forthcoming).
Feng, Guanhao, Polson, Nicholas G., Xu, Jianeng, 2019b. Deep Learning in Asset Pricing. Technical Report, University of Chicago.
Freyberger, Joachim, Neuhierl, Andreas, Weber, Michael, 2017. Dissecting Characteristics Nonparametrically. Technical Report, University of
Wisconsin-Madison.
Gagliardini, Patrick, Ossola, Elisa, Scaillet, Olivier, 2016. Time-varying risk premium in large cross-sectional equity datasets. Econometrica 84 (3),
985–1046.
Gallinari, Patrick, LeCun, Yann, Thiria, Sylvie, Fogelman-Soulie, Francoise, 1987. Memoires associatives distribuees. Proc. COGNITIVA 87, 93.
Giglio, Stefano W., Xiu, Dacheng, 2018. Asset Pricing with Omitted Factors. Technical Report, University of Chicago.
Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, 2016. Deep Learning. MIT Press, http://www.deeplearningbook.org.
Gu, Shihao, Kelly, Bryan, Xiu, Dacheng, 2019. Empirical asset pricing via machine learning. Rev. Financ. Stud. 33 (5), 2223–2273.
Hallin, Marc, Liška, Roman, 2007. Determining the number of factors in the general dynamic factor model. J. Amer. Statist. Assoc. 102 (478), 603–617.
He, Zhiguo, Krishnamurthy, Arvind, 2013. Intermediary asset pricing. Amer. Econ. Rev. 103 (2), 732–770.
Hinton, Geoffrey E., Salakhutdinov, Ruslan R., 2006. Reducing the dimensionality of data with neural networks. science 313 (5786), 504–507.
Hinton, Geoffrey E., Zemel, Richard S., 1994. Autoencoders, minimum description length and Helmholtz free energy. In: Advances in Neural Information
Processing Systems. pp. 3–10.
Ioffe, Sergey, Szegedy, Christian, 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Int. Conf. Mach.
Learn. 448–456.
Kapetanios, George, 2010. A testing procedure for determining the number of factors in approximate factor models. J. Bus. Econom. Statist. 28,
397–409.
Kelly, Bryan, Pruitt, Seth, 2015. The three-pass regression filter: A new approach to forecasting using many predictors. J. Econometrics 186 (2),
294–316.
Kelly, Bryan, Pruitt, Seth, Su, Yinan, 2019. Characteristics are covariances: A unified model of risk and return. J. Financ. Econ. (forthcoming).
Kingma, Diederik, Ba, Jimmy, 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kozak, Serhiy, 2019. Kernel trick for the cross section. SSRN Working Paper.
Kozak, Serhiy, Nagel, Stefan, Santosh, Shrihari, 2017. Shrinking the Cross Section. Technical Report, University of Michigan.
Kozak, Serhiy, Nagel, Stefan, Santosh, Shrihari, 2018. Interpreting factor models. J. Finance 73 (3), 1183–1223.
Moon, Hyungsik Roger, Weidner, Martin, 2018. Nuclear Norm Regularized Estimation of Panel Regression Models. Technical Report, University of
Southern California.
Onatski, Alexei, 2010. Determining the number of factors from empirical distribution of eigenvalues. Rev. Econ. Stat. 92, 1004–1016.
Pohl, Walter, Schmedders, Karl, Wilms, Ole, 2018. Higher order effects in asset pricing models with long-run risks. J. Finance 73 (3), 1061–1111.
Ross, Stephen A., 1976. The arbitrage theory of capital asset pricing. J. Econ. Theory 13 (3), 341–360.
Santos, Tano, Veronesi, Pietro, 2004. Conditional Betas. Technical Report, National Bureau of Economic Research.
Stock, James H., Watson, Mark W., 2002. Macroeconomic forecasting using diffusion indexes. J. Bus. Econom. Statist. 20 (2), 147–162.