giglio-et-al-2022-factor-models-machine-learning-and-asset-pricing
giglio-et-al-2022-factor-models-machine-learning-and-asset-pricing
1
Yale School of Management, Yale University, New Haven, Connecticut, USA;
email: [email protected], [email protected]
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
2
AQR Capital Management, Greenwich, Connecticut, USA
3
Booth School of Business, University of Chicago, Chicago, Illinois, USA;
email: [email protected]
337
1. INTRODUCTION
Factor models are natural workhorses for modeling equity returns because they offer a parsimo-
nious statistical description of the returns’ cross-sectional dependence structure. Return factor
models evolved from early asset pricing theories, most notably the capital asset pricing model
(CAPM) of Sharpe (1964) and the intertemporal CAPM (ICAPM) of Merton (1973). These and
other seminal factor models used observable financial and macroeconomic variables as risk factors
motivated by economic theory.
The arbitrage pricing theory (APT) of Ross (1976) later provided a rigorous economic link
between the factor structure in returns and risk premia through no-arbitrage conditions. One
important innovation in the APT was the ability to speak directly to foundational economic con-
cepts, such as risk exposures and risk premia, without requiring a specific identity or economic
interpretation for the factors. The APT’s focus on a common factor structure (that could be rep-
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
resented by any type of factors, whether observable or unobservable, traded or nontraded) spurred
a line of inquiry lending itself to primarily statistically oriented models of returns. In light of this,
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
factor models have become the single most widely adopted empirical research paradigm for aca-
demics and practitioners alike. In particular, the APT opens the door to latent factor models for
returns. Being intimately tied to unsupervised and semisupervised machine learning, the APT and
latent factor models can be viewed as catalysts for the revolution of machine learning methods in
empirical asset pricing.
Linking risk premia and the (observable or latent) factor structure requires first providing a
measurement of those quantities. Measurement issues are notoriously difficult for expected re-
turns because market efficiency forces return variation to be dominated by unforecastable news.
In addition, the sample size of equity returns is small relative to the predictor count. Structural
breaks, regime switches, and nonstationarity in general further diminish the effective sample size.
Furthermore, the collection of candidate conditioning variables is large, and such variables are
often close cousins and highly correlated. Further still, complicating the problem is ambiguity
regarding functional forms through which the high-dimensional predictor set enters into the ex-
pected returns. All these issues result in a low signal-to-noise ratio environment that affects the
measurement of risk premia and is in stark contrast to prediction problems in computer science
and other domains.
Certain aspects of the machine learning paradigm, such as variable selection and dimension
reduction, have been part of empirical asset pricing since the very beginning of this research field.
In early days, economic theories and parsimonious model specifications were adopted in order to
regularize learning problems in financial markets. Indeed, we have become accustomed to sorting
stocks by their characteristics, forming equal or value-weighted portfolios, and selecting a small
number of portfolios as factor proxies. These choices have been made, either explicitly or implic-
itly, to cope with nonlinearity, low signal-to-noise ratios, and the curse of dimensionality, which
are difficult realities when studying asset returns.
Recent decades have seen the rapid growth of exploratory and predictive techniques proposed
by the statistics and machine learning communities. These tools complement economic theory to
provide a data-driven solution to the empirical challenges of asset pricing. Embracing these tools
enables economists to make rigorous, robust, and powerful empirical discoveries about which
economic theory alone may not be a sufficient guide. Conversely, these new discoveries can offer
new insights from data that in turn lead to improved economic theories.
Our objectives in this article are twofold. First, we survey recent methodological contributions
in empirical asset pricing. We categorize these methodologies based on their primary purposes,
which range from estimating expected returns, factors and assets’ factor exposures, risk premia, and
stochastic discount factors (SDFs) to comparing asset pricing models and testing alphas. Second,
λi (A) to denote the ith largest eigenvalue of A. Similarly, we use σ i (A) to denote the ith singular
value of A. We use A1 , A∞ , A, and AF to denote the L1 norm, the L∞ norm, the operator
norm (or L2 norm),
and the Frobenius norm of a matrix A = (aij ), that is, max j i |aij |, max i j |aij |,
λmax (A A), and Tr(A A), respectively. We also use AMAX = max i, j |aij | to denote the L∞ norm
of A on the vector space. Finally, we use Diag(A) to denote the diagonal matrix of A and A[I] , a
submatrix of A whose rows are indexed in I.
2. MODEL SPECIFICATIONS
We start by introducing a static factor model, which serves as a benchmark throughout the article.
individual assets. Yet it is important that models are capable of describing the behavior of individual
assets, not just sorted portfolios, to more thoroughly understand the full range of heterogeneity
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
in asset markets. Once we begin considering individual assets, conditional model formulations
become critical.1 For example, risk exposures of individual stocks very likely change over time
as firms evolve. In addition, assets with fixed maturities and nonlinear payoff structures (e.g., bonds
and options) experience mechanical variation in their risk exposures as their maturity rolls down
or the value of the underlying asset changes (Kelly, Palhares & Pruitt 2021; Büchner & Kelly
2022). In this case, a factor model should accommodate time-varying conditional risk exposures.
In its general form, the conditional factor model is
r̃t = αt−1 + βt−1 γt−1 + βt−1 vt + ũt , 4.
where r̃t and ũt are M × 1 vectors of excess returns and idiosyncratic errors of individual assets,
respectively. In this equation, β t−1 γ t−1 is the conditional risk premium earned through exposure to
the common risk factor v t , as assets earn conditional compensation of γ t−1 per unit of conditional
beta on factors v t . The term α t−1 includes any excess compensation an asset earns that is not
associated with factor exposure.
Obviously, the right-hand side of Equation 4 contains too many degrees of freedom, and the
model cannot be identified without additional restrictions. One example of additional restric-
tions is provided by the model of Rosenberg (1974), which imposes that β t−1 = bt−1 β, where bt−1
is an M × N matrix of observable characteristics, and β is an N × K vector of parameters. In this
case, the general form of Equation 4 becomes
r̃t = bt−1 f˜t + ε̃t , 5.
where f˜t := β(γt−1 + vt ) is a new N × 1 vector of latent factors, and ε̃t := αt−1 + ũt .2 This is
the MSCI Barra model prototype that has been embraced by practitioners for its simplicity and
versatility in modeling individual equity returns.
Barra’s model includes several dozen characteristics and industrial variables in bt−1 . Their ad
hoc selection procedure is opaque, and evidence suggests it is heavily overparameterized. When
1 Even for static models, Ang, Liu & Schwarz (2020) discuss the benefits of using individual stocks rather than
portfolios when analyzing factor models. In early papers, researchers wrestled with the technical challenges
of dealing with large cross sections of test assets. Sections 3 and 4 discuss how modern methodologies exploit
large cross sections to develop tractable factor model estimators with attractive statistical properties.
2 This model also allows for additional approximation error, if any, of β
t−1 using bt−1 β because such error can
be absorbed into ε̃t as well.
such as β, N, and K, for both models). Moreover, (bt−1 bt−1 )−1 bt−1 can be interpreted as portfolio
weights for characteristic-sorted portfolio returns, (bt−1 bt−1 )−1 bt−1 r̃t . This derivation is consistent
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
with the convention of estimating (static) asset pricing models using characteristic-sorted port-
folios as test assets: The intuition is that to the extent that characteristics drive risk exposures,
sorting by those characteristics removes the time variation in exposure for the sorted portfolios.
Therefore, the static portfolio representation of Equation 1 can be applied directly to portfo-
lios appropriately sorted by relevant characteristics; alternatively, individual stocks can be used
as test assets, using IPCA to explicitly account for the time-varying risk loadings related to their
characteristics.
In general, the risk premia associated with factors γ t−1 := Et−1 ( ft ) could also be time varying,
but the time series path of risk premia {γ t−1 } is not identifiable without additional restrictions.
Only E(γ t−1 ) can be identified. To recover the path of risk premia, Gagliardini, Ossola & Scaillet
(2016) employ a parametric model of risk premia, as suggested by Harvey & Ferson (1999),
γt−1 = zt−1 θ , where z includes macro time series such as the term spread and θ is an unknown
parameter. Combined with the assumption that factor loadings are linear functions of observed
characteristics and macro time series, one can rewrite the dynamics of individual stock returns as
r̃i,t = xi,t β̃i + ε̃t , 8.
where {xi, t } are multidimensional regressors that depend on observable factors, macro variables,
and firm characteristics, and {β̃i } contain (functions of ) unknown parameters.
In essence, IPCA and related models employ a linear approximation for risk exposures based on
observable characteristics data. But there are no obvious theoretical or intuitive justifications for
the linearity assumption beyond tractability. To the contrary, there are many reasons to expect that
this assumption is violated. Essentially all leading theoretical asset pricing models predict nonlin-
earities in return dynamics as a function of state variables; Campbell & Cochrane (1999), Bansal
& Yaron (2004), Santos & Veronesi (2004), and He & Krishnamurthy (2013) provide prominent
examples.
To overcome this limitation, Connor, Hagmann & Linton (2012) and Fan, Liao & Wang (2016)
replace the assumption that factor betas are linear in characteristics with an assumption that factor
betas are nonparametric functions of characteristics (although these characteristics are assumed
to not vary over time for theoretical tractability). Kim, Korajczyk & Neuhierl (2021) adopt this
framework to construct arbitrage portfolios.
Gu, Kelly & Xiu (2021) extend the Barra and IPCA models to a nonlinear setting using a condi-
tional autoencoder model, augmented with additional explanatory variables. This model replaces
ling such complexities, though details are beyond the scope of this review. We refer interested
readers to Aït-Sahalia, Jacod & Xiu (2021).
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
3. METHODOLOGIES
The conventional methodologies for statistical inference of asset pricing models are designed for
low-dimensional settings, e.g., 25 test assets with a handful of factors over tens of years. Recently,
the set of explanatory variables (potentially) associated with equity returns has expanded rapidly
(e.g., Harvey, Liu & Zhu 2016), and researchers have begun using individual securities as test
assets (e.g., Kelly, Pruitt & Su 2019). With the transition to large-scale sets of factors and test
assets, high-dimensional statistical methods are increasingly relevant for empirical asset pricing
analysis. Our review covers classical methods but places particular emphasis on statistical method-
ologies designed to cope with a high-dimensional setting. We begin in Section 3.1 by discussing
machine learning methods to measure conditional expected returns without imposing restrictions
of factor pricing models. In Sections 3.2–3.5, we discuss various facets of factor model specifica-
tion, estimation, and evaluation. Then in Section 3.6 we focus on the divergence between expected
returns and factor exposures to discuss alpha tests.
degrees of freedom and condensing redundant variation among predictors. A first wave of high-
dimensional models used linear methods such as partial least squares (e.g., Kelly & Pruitt 2013,
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
Rapach et al. 2013) and lasso (Chinco, Clark-Joseph & Ye 2019; Freyberger, Neuhierl & Weber
2020).
More recently, Gu, Kelly & Xiu (2020) conduct a wide-ranging analysis of machine learning
methods for return prediction, considering not only regularized linear methods but also more
cutting-edge nonlinear methods including random forest, boosted regression trees, and deep
learning. Their research illustrates the substantial gains of incorporating machine learning when
estimating expected returns. This translates into improvements in out-of-sample predictive R2
as well as large gains for investment strategies that leverage machine learning predictions. The
empirical analysis also identifies the most informative predictor variables, which helps facilitate
deeper investigation into economic mechanisms of asset pricing.
Machine learning also makes it possible to improve expected return estimates using predic-
tive information in complex and unstructured data sets. For example, Ke, Kelly & Xiu (2019)
propose a new supervised topic model for constructing return predictions from raw news text
and demonstrate its prowess for out-of-sample forecasting. Jiang, Kelly & Xiu (2021) and Obaid
& Pukthuanthong (2022) demonstrate how to tap return predictive information in image data
using machine learning models from the computer vision literature. Both text and image data
confer particularly strong return forecasting gains at short horizons of days and weeks and are
likely underpinned by comparatively fast-moving market sentiments, rather than fundamental in-
formation that arguably plays a dominant role at forecast horizons of quarters or years. Indeed,
sentiment and related behavioral economic driving forces are becoming a core aspect of finan-
cial markets research. These are subtle phenomena with circuitous transmission and feedback
effects. As such, they are fertile ground for machine learning methods, which offer an ability
to capture approximate complex nonlinear associations by exploiting rich and unwieldy data
sets.
In general, the return prediction literature delves little into understanding the economic
mechanisms (such as risk-return trade-offs, market frictions, or behavioral biases) that may be
responsible for observed predictability. Distinguishing, for example, between risk premia and
mispricing requires a more structured modeling approach, and factor models are the dominant
tool researchers have used in this pursuit.
3 Inaddition to least squares regression, the literature often sorts assets into portfolios on the basis of
characteristics and studies portfolio averages—a form of nonparametric regression.
If asset returns are assumed to be time constant functions of static, asset-level characteristics, as
in Gagliardini, Ossola & Scaillet (2016) and Equation 8, then asset-by-asset TSRs yield estimates
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
CSR : = (β β )−1 β R.
F 10.
This approach is most commonly used for individual stocks, for which their loadings can be
proxied by firm characteristics. It is convenient for the CSR to accommodate time-varying char-
acteristics, as in Equation 5, in which case we can rewrite Equation 10 accordingly, for each t, as
˜
f t = (bt−1 bt−1 )−1 bt−1 r̃t . 11.
The limitation of TSRs and CSRs is their reliance on the strong assumption that either factors
or factor exposures are fully observable to the econometrician. Though theory offers some guid-
ance on the nature of common risk factors, and though firm attributes are likely to correlate with
their factor exposures, the necessary observability assumptions for the success of TSRs or CSRs
are unlikely to be satisfied in the data.
3.2.2. Principal components analysis. If neither factors nor loadings are known, we can re-
sort to PCA to extract latent factors and their loadings. The use of PCA in asset pricing dates
back to as early as Chamberlain & Rothschild (1983) and Connor & Korajczyk (1986) and has
become increasingly popular (see, e.g., Kozak, Nagel & Santosh 2018; Kelly, Pruitt & Su 2019;
Pukthuanthong, Roll & Subrahmanyam 2019; Giglio & Xiu 2021). For a static factor model
(Equation 1) PCA can identify factors and their loadings up to some unknown linear trans-
formation. It is more convenient to implement this via a singular value decomposition (SVD)
of R̄,
K
R̄ = ,
σ j ς j ξ j + U 12.
j=1
3.2.3. Risk-premia principal components analysis. One potential shortcoming of PCA is that
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
it extracts information about latent factors solely from realized return covariances. To see this, the
SVD in Equation 12 is applied to R̄, which eliminates the average return from each column of R.
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
In fact, if we assume α = 0 in Equation 2, the expected return is also spanned by beta, so that the
information in average returns (r̄) can be exploited for more efficient recovery of factors.
Lettau & Pelger (2020b) exploit this intuition and propose a so-called risk-premia PCA estima-
tor for factors. Instead of using T −1 R̄R̄ = T −1 RR − r̄ r̄ , they conduct PCA on T −1 RR + λr̄ r̄ ,
where λ is a tuning parameter. Risk-premia PCA generalizes the proposal of Connor & Korajczyk
(1986), which corresponds to the special case of λ = 0. Lettau & Pelger (2020a) further prove
that the risk-premia PCA could achieve a smaller asymptotic variance for factor loadings than the
standard PCA if all factors are pervasive; it outperforms PCA empirically when factors are weak.
We defer a more detailed discussion of weak factors to Section 3.3.4.
3.2.4. Instrumented principal components analysis. A limitation of PCA is that it only ap-
plies to static factor models. It also lacks the flexibility to incorporate other data beyond returns. To
address both issues, Kelly, Pruitt & Su (2019) estimate the conditional factor model (Equation 6) by
T
r̃t − bt−1 β ft 2 . The estimates satisfy first-order
solving the optimization problem minβ,{ ft } t=2
conditions:
ft = β −1 β
b bt−1 β b r̃t , 14.
t−1 t−1
−1
T
T
) =
vec(β bt−1 bt−1 ⊗
ft
ft (bt−1 ⊗
ft ) r̃t . 15.
t=2 t=2
Consistent with the discussion in Section 3.2.1, Equation 14 shows that, given conditional be-
tas, factors are estimated from CSRs of returns on betas. Equation 14 resembles Equation 11,
but the former accommodates a potentially larger number of characteristics because of the built-
in dimension reduction assumption. Equation 15 shows that conditional betas can be recovered
from panel regressions of returns onto characteristics interacted with factors. The authors recom-
mend an alternating least squares algorithm to iteratively update β and ft until convergence. Kelly,
Pruitt & Su (2020) develop the accompanying asymptotic inference for the extracted factors and
loadings.
Kelly, Moskowitz & Pruitt (2021) apply this IPCA framework to explain momentum and long-
term reversal phenomena in equity returns. The general framework of IPCA also extends beyond
equity into other asset classes, such as corporate bonds (Kelly, Palhares & Pruitt 2021) and options
(Büchner & Kelly 2022).
aging additional nonreturn conditioning information (in the spirit of IPCA). In the Gu, Kelly &
Xiu (2021) conditional autoencoder, stock characteristics are mapped into betas through a feed-
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
forward neural network, thus replacing IPCA betas with a more realistic nonlinear specification.
Figure 1 illustrates the model’s basic structure. At a high level, the mathematical representation
of the model is identical to Equation 4. On the left side of the network, factor loadings are a
Output layer
...
(M × 1)
Dot product
Beta output layer Factor output layer
(M × 1) (K × 1)
M
...
K
g g ... g g
Hidden layer (s) g g ... g g
...
Input layer 2
g g ... g g
... (N × 1)
...
Input layer 1 ... or
(M × N) M
...
... ... (M × 1)
N
(Beta) (Factor)
Figure 1
A diagram of a conditional encoder model in which an encoder is augmented to incorporate covariates in the factor loading
specification. (Left) A diagram of how factor loadings β t−1 at time t − 1 (green) depend on firm characteristics bt−1 (yellow) of input layer
1 through an activation function g on neurons of the hidden layer. Each row of yellow neurons represents the K × 1 vector of
characteristics of one ticker. (Right) A diagram showing the corresponding factors at time t. ft nodes (purple) are weighted combinations
of neurons of input layer 2, which can be either N characteristic-managed portfolios rt (pink) or M individual asset returns r̃t (red). In
the former case, the dashed arrows indicate that the characteristic-managed portfolios rely on individual assets through predetermined
weights (not to be estimated). In either case, the effective input can be regarded as individual asset returns, exactly what the output layer
(red) aims to approximate; thus, this model shares the same spirit as a standard encoder.
b(0)
i,t−1 = bi,t−1 , 16.
b(li,t−1
)
= g b(l−1) + W (l−1) b(l−1)
i,t−1 , l = 1, . . . , Lβ , 17.
β (L )
βi,t−1 = b(Lβ ) + W (Lβ ) bi,t−1 . 18.
Equation 16 initializes the network as a function of the baseline characteristic data, bi, t−1 .
Equation 17 describes the nonlinear (and interactive) transformation of characteristics as they
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
propagate through hidden layer neurons. Equation 18 describes how a set of K-dimensional factor
betas emerge from the terminal output layer.
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
On the right side of Figure 1, we see an otherwise standard autoencoder for the factor
specification. The recursive mathematical formulation of the factors is
(l−1) rt(l−1) ,
rt(l ) = g b(l−1) + W l = 1, . . . , L f , 20.
(L f ) rt(L f ) .
ft = b(L f ) + W 21.
Equation 19 initializes the network with characteristic-sorted portfolios of individual asset re-
turns, as defined by Equation 7. This sidesteps the incompleteness issue of the panel of individual
stock returns and in the meantime performs a preliminary reduction of data. The expressions in
Equation 20 transform and compress the dimensionality of returns as they propagate through
hidden layers. Equation 21 describes the final set of K factors at the output layer. If a single linear
layer is included on the factor network, that is, if Lf = 1, this structure maintains the economic
interpretation of factors: They are themselves portfolios (linear combination of returns).
At last, the so-called dotted operation multiplies the M × K matrix output from the beta net-
work with the K × 1 output from the factor network to produce the final model fit for each
individual asset return.
When the autoencoder has one hidden layer and a linear activation function, it is equivalent
to the PCA estimator for linear factor models described in Section 3.2.2. Just as the autoencoder
model nests the static linear factor model, the augmented autoencoder nests the IPCA factor
model as a special case. The high capacity of a neural network model enhances its flexibility to
construct the most informative features from data. With enhanced flexibility, however, comes a
higher propensity to overfit. Next we discuss several generic algorithms likely applicable to any
deep learning models.
3.2.5.1. Training, validation, and testing. To curb overfitting, the entire sample is typically
divided into three disjoint subsamples that maintain the temporal ordering of the data. The first, or
training, subsample is used to estimate the model subject to a specific set of tuning hyperparameter
values.
The second, or validation, subsample is used for tuning the hyperparameters. Fitted values
are constructed for data points in the validation sample based on the estimated model from the
3.2.5.2. Regularization techniques. The most common machine learning device for guarding
against overfitting is to append a penalty to the objective function in order to favor more parsimo-
nious specifications. This regularization approach mechanically deteriorates a model’s in-sample
performance in the hope of improving its stability out of sample. This is the case when penalization
manages to reduce the model’s fit of noise while preserving its fit of the signal.
Gu, Kelly & Xiu (2021) define the estimation objective to be
1
T N
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
where θ summarizes the weight parameters in the loading and factor networks of Equations 16–
21, and φ(θ) is a penalty function, such as lasso (or l1 ) penalization, which takes the form φ(θ; λ) =
λ j |θ j |.
In addition to l1 penalization, Gu, Kelly & Xiu (2021) employ a second machine learning reg-
ularization tool known as early stopping. By ending the parameter search early (as soon as the
validation sample error begins to increase), parameters are shrunken toward the initial guess, for
which parsimonious parameterization is often imposed. It is a popular substitute to l2 penaliza-
tion of θ parameters because of its convenience in implementation and effectiveness in combating
overfit.
As a third regularization technique, Gu, Kelly & Xiu (2021) adopt an ensemble approach in
training neural networks. In particular, they use multiple random seeds to initialize neural net-
work estimation and construct model predictions by averaging estimates from all networks. This
enhances the stability of the results because the stochastic nature of the optimization can cause
different seeds to settle at different optima.
3.2.5.3. Optimization algorithms. The high degree of nonlinearity and nonconvexity in neu-
ral networks, together with their rich parameterization, makes brute force optimization highly
computationally intensive (often to the point of infeasibility). Gu, Kelly & Xiu (2021) adopt the
adaptive moment estimation algorithm (Adam), an efficient version of stochastic gradient de-
scent introduced by Kingma & Ba (2014), which computes adaptive learning rates for individual
parameters using estimates of first and second moments of the gradients.
Gu, Kelly & Xiu (2021) also adopt batch normalization (Ioffe & Szegedy 2015) to control the
variability of predictors across different regions of the network and across different data sets. This
method is motivated by the phenomenon of internal covariate shift in which inputs of hidden
layers follow different distributions than their counterparts in the validation sample.
3.2.6. Matrix completion. It is not uncommon in finance applications to deal with unbalanced
panels. Giglio, Liao & Xiu (2021) adopt a matrix completion algorithm to handle missing data
when extracting factors and loadings of a factor model.
The matrix completion approach relies on the assumption that the full matrix can be writ-
ten as a noisy low-rank matrix. This assumption is naturally justified for Equation 1 (assuming
α = 0), which, in matrix form, can be rewritten as R = β(V + γ ιT ) + U and thus clearly satisfies
the assumption.
where Xn denotes the matrix nuclear norm, and λNT > 0 is a tuning parameter. By penalizing
the singular values of X, the algorithm achieves a low-rank matrix as the output. The latent factors
and betas can then be estimated via the corresponding singular vectors of X .
The risk premium of a factor is informative about the equilibrium compensation investors demand
to hold risk associated with that factor. One of the central predictions of asset pricing models
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
is that some risk factors—for example, consumption growth, intermediary capital, or aggregate
liquidity—should command a risk premium: Investors should be compensated for their exposure
to those factors, holding constant their exposure to all other sources of risk.
For tradable factors—such as the market portfolio in the CAPM—estimating risk premia re-
duces to calculating the sample average excess return of the factor. This estimate is simple and
robust and requires minimal modeling assumptions.
However, many theoretical models are formulated with regard to nontradable factors—factors
that are not themselves portfolios—such as consumption, inflation, liquidity, and so on. To estimate
the risk premium of any of these factors, it is necessary to construct its tradable incarnation. Such
a tradable factor is a hedging portfolio that isolates the risk of the nontradable factor while holding
all other risks constant. There are two standard approaches to constructing tradable counterparts
of a nontradable factor: two-pass regressions and factor-mimicking portfolios.
3.3.1. Classical two-pass regressions. The classical two-pass (or Fama-MacBeth) regression
requires a model like Equation 1 with all factors observable. The first time series pass yields esti-
mates of beta using regressions in Equation 9. Then the second cross-sectional pass estimates risk
premia via an ordinary least squares (OLS) regression of average returns on the estimated beta:
γ = (β )−1 β
β r̄. 24.
The generalized least squares (GLS) version of Equation 24 replaces the OLS in the
cross-sectional pass with
γGLS = (β
)−1 β
u−1 β
u−1 r̄, 25.
min{N ,T }
4 The nuclear norm is X n := i=1 ψi (X ), where ψ 1 (X) ≥ ψ 2 (X) ≥ . . . are the sorted singular values of
X.
3.3.2. Factor-mimicking portfolios. In contrast to Equation 24, Fama & MacBeth (1973)
propose an inference procedure that regresses realized returns at each time t onto β:
γt = (β
)−1 β
β rt . 26.
Note that the estimated slope of the Fama-MacBeth regression at each time t, γt , is itself a portfo-
lio return, corresponding to the portfolio weights (β β
)−1 β
. This highlights an important point:
The classical two-pass regression discussed above or Fama-MacBeth (both of which yield the
same point estimates of the risk premium) obtains the risk premium of a nontradable factor by
first building a factor-mimicking portfolio for it and then estimating the corresponding risk pre-
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
mium as the average excess return of this portfolio return, γt (Fama & MacBeth 1973). Regularity
conditions in Giglio & Xiu (2021) imply that
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
γt = (β
)−1 β
β [β(γ + vt ) + ut ] ≈ γ + vt ,
and thus the Fama-MacBeth procedure is an effective approach to estimating risk premia γ .
Now, suppose we are interested in estimating the risk premium of a measured nontradable risk
factor gt , say, a climate risk measure that is not tradable, that satisfies
gt = ξ + ηvt + zt . 27.
This is a general representation where the measured risk factor gt is related through the vector η to
the fundamental risk factors v t (it could itself be part of v t , but it could also just be correlated with
it and therefore still command a risk premium). The representation also allows for measurement
error zt (e.g., measurement error in consumption growth).
Obviously, the risk premium of gt is γ g = ηγ . This can be readily estimated using the Fama-
MacBeth procedure by first building the mimicking portfolios for v t ( γt ) and then obtaining
the mimicking portfolio for gt as η
γt , where η is simply the vector of coefficients of a TSR
(Equation 27). This yields the risk premium estimate as η γ.
Another standard approach to tracking a nontradable factor is the maximal-correlation factor-
mimicking portfolio approach (e.g., Huberman, Kandel & Stambaugh 1987; Lamont 2001). This
directly projects gt onto a set of basis asset returns, yt , that yields weights of the mimicking
portfolio,
whose returns and expected returns are given by wg yt and wg E(yt ), respectively.
How do we reconcile these two approaches? Is γ g the same as wg E(yt ) for some choice of yt ?
How do we select such yt ? Under data-generating processes given by Equations 1, 2, and 27, if
we select yt = Aft (recall that ft = μ + v t ) for any invertible matrix A, then w g = (A )−1 η, and
wg E(y) = ηγ . In this scenario, both approaches are equivalent, suggesting that we should use the
same factors we use in Fama-MacBeth regressions to build a mimicking portfolio for gt . Obviously,
this is only possible if ft is a vector of tradable portfolios. If not, we can use mimicking portfolios
of ft ,
γt , but this implies that the mimicking portfolio needs hedging portfolios already built by
the Fama-MacBeth approach described above, limiting its usefulness.
There is, however, a more interesting choice for yt that obviates the need to build hedging port-
folios from Fama-MacBeth regressions in the first place. The idea is to use all returns yt = rt as basis
assets when building a mimicking portfolio for gt . In this case, we find that wg = −1 β v η , and
3.3.3. Three-pass regressions and the omitted factor bias. Giglio & Xiu (2021) suggest
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
using principal component regression (PCR) when building the factor-mimicking portfolios for gt
using all returns rt as basis assets. The PCR approach is a natural choice among high-dimensional
regressions in that rt follows a factor model according to Equation 1. The three-pass method
proceeds as follows:
1. The first pass is an SVD of R̄ to obtain β as in Equation 13.
and V,
2. The second pass runs a cross-sectional OLS regression (Equation 24) to obtain risk premia
.
of V
3. Finally, the third pass projects gt onto V,
η = ḠV V
(V )−1 ,
As we discuss in Section 4.2, this estimator has asymptotic guarantees in the large N, large
T setting, but only if all factors are pervasive.
Because this estimator does not rely on any prespecified asset pricing model for risk-premia
estimation, it is especially useful for cases in which a researcher is interested in estimating the risk
premium of a nontradable factor predicted by theory (e.g., consumption growth, liquidity, etc.)
but does not want to take a stand on what the other factors are in the model. In contrast, stan-
dard Fama-MacBeth regressions would be biased if some true factors are omitted in this model.
Relatedly, Gagliardini, Ossola & Scaillet (2019) propose a diagnostic criterion for detecting the
number of omitted factors.
3.3.4. Weak factors. Besides the omitted factor bias, another severe issue that plagues the clas-
sical two-pass regression is weak identification. Kan & Zhang (1999) first noted that the inference
on risk premia from two-pass regressions becomes distorted when a useless factor—a factor to
which test assets have zero exposure—is included in the model. Kleibergen (2009) further points
out that standard inference fails if betas are relatively small. This issue is quite relevant in prac-
tice because many test assets are not very sensitive to macroeconomic shocks. Moreover, the same
rank-deficiency problem arises when betas are collinear (even if the factors are individually strong);
rect the error-in-variables bias. In the same spirit and more rigorously, Anatolyev & Mikusheva
(2022) propose a four-split approach that addresses the issues of weak factors and omitted factors.
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
They assume that part of v t in Equation 1, call it v 1t , is observable, though potentially weak. They
also assume that its beta, namely β 1 , fully spans the space of expected returns. The other part of
v t , call it v 2t , is latent (hence omitted by econometricians) and unpriced. The four-split estimator
aims for valid inference on the risk premia of v 1t . Note that omitted factors in their setup must be
unpriced to achieve valid inference. In practice, however, it is the omitted priced factors that are
most concerning.
Giglio, Xiu & Zhang (2021) argue that the weak factor problem is fundamentally an issue of test
asset selection. They argue that factor strength is not an inherent property of a factor but instead is
dictated by the selection of test assets. Weaker factors may still be priced, so just eliminating them
is an undesirable solution. Instead, Giglio, Xiu & Zhang (2021) suggest actively selecting test assets
to guarantee that the selected assets have sufficient exposure to the factors of interest; in other
words, a factor can be made stronger by appropriate asset selection—by selecting assets highly
exposed to it. To simultaneously address the weak and omitted factor problems, they propose
an iterative supervised PCA procedure that integrates correlation screening with the three-pass
estimator of Giglio & Xiu (2021). This estimator is robust to both omitted variable bias and the
weak factor problem, as well as to measurement error in observed factors.
Test assets are an important component of empirical asset pricing, yet little work has been
dedicated to rigorously and systematically investigating how they should be selected. When a
model is composed of tradable factors, many important asset pricing analyses are independent of
the test assets. For example, risk premia of tradable factors are best calculated as simple averages
of factor returns, the maximum Sharpe ratio portfolio in the model economy can be inferred from
the tradable factors alone, and model comparison can be conducted without test assets (Barillas &
Shanken 2017). In contrast, test assets are central to the study of nontradable factors because they
are used to construct the necessary factor-mimicking portfolios that in turn are inputs to most
asset pricing analyses.
The choice of test assets in the literature has mainly followed one of three approaches. The first
approach, adopted by the vast majority of the literature, uses a standard set of portfolios sorted on
a few characteristics, such as size and value, following the seminal work of Fama & French (1993).
Lewellen, Nagel & Shanken (2010) argue that this approach sets a rather low hurdle for a factor
pricing model. They suggest augmenting the set of test assets with industry portfolios. Giglio, Xiu
& Zhang (2021) argue that using the standard cross section likely creates a weak factor problem
because these assets may not have exposure to the factor of interest. Alternatively, Ahn, Conrad &
folios is expected to be particularly informative about the factor of interest, but it is affected by
the omitted factor problem because it tends to focus only on univariate exposures.
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
The approach of Giglio, Xiu & Zhang (2021) builds on these approaches by starting from a
large universe of test assets but then selecting only informative assets for estimation.
where b = v−1 γ , and v is the covariance matrix of factor innovations. The SDF is central to the
field of asset pricing because, in the absence of arbitrage, covariances with the SDF alone explain
cross-sectional differences in expected returns.
As shown in Equation 29, the vector of SDF loadings, b, is related to mean-variance optimal
portfolio weights. SDF loadings b and risk premia γ are directly related through the covariance
matrix of the factors, but they differ substantially in their interpretations. The SDF loading of a
factor tells us whether that factor is useful in pricing the cross section of returns. For example, a
factor could command a nonzero risk premium without appearing in the SDF simply because it is
correlated with the true factors driving the SDF. It is therefore not surprising to see many factors
with significant risk premia. For this reason, it makes more sense to tame the factor zoo by testing
if a new factor has a nonzero SDF loading (or has a nonzero weight in the mean-variance efficient
portfolio) rather than testing if it has a significant risk premium.
3.4.1. Generalized method of moments. The classical approach to estimating SDF loadings
is the generalized method of moments (GMM). In light of Equations 3 and 29 and the definition
of the SDF, we can formulate a set of moment conditions,
Because there are in total K + N moments with 2K parameters (μ and b), we need N ≥ K to ensure
the system is identified.
The GMM estimator is thereby defined as the solution to the optimization problem,
min
gT (b, μ)W gT (b, μ), 30.
b,μ
The inference procedure follows the usual GMM formulation (Hansen 1982). For efficiency rea-
sons, it is customary to choose the optimal weighting matrix, W opt = −1 , where
is a consistent
estimator of , as in Section 4.1. As an alternative, there is a special class of weighting matrices
for which a closed-form solution to Equation 30 is available,
b = (CW
11C)
−1 (C 11 r̄),
W μ = f¯ , 31.
where W 11 is the top N × N submatrix of some W, and C is the N × K sample covariance matrix be-
tween rt and v t . Recall that b = v−1 γ , and note also that β It follows that
v = C. b=v−1
γ , where
γ is given by Equation 24 (W 11 = IN ) or 25 (W11 = u−1 ). In other words, Equation 31 amounts to
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
running two-pass CSRs with (univariate) covariances in place of β. This is not surprising because
according to Equation 2 we have
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
3.4.2. Principal components analysis–based methods. Kozak, Nagel & Santosh (2018) argue
that the absence of near-arbitrage opportunities forces expected returns to (approximately) align
with common factor covariances, even in a world in which belief distortions can affect asset prices.
The strong covariation among asset returns suggests that the SDF can be represented as a function
of a few dominant sources of return variation. PCA of asset returns recovers the common compo-
nents that dominate return variation. Specifically, the first two passes of the three-pass procedure
in Section 3.3.3 yield an SDF estimator without relying on knowledge of factor identities:
m γ
t = 1 − vt , 33.
where .
vt is the tth column of V
3.4.3. Penalized regressions. In the PCA approach, the SDF is essentially parameterized as a
small number of linear combinations of factors, as shown in Equation 29. Kozak, Nagel & Santosh
(2020) consider an SDF represented in terms of a set of tradable test asset returns,
mt = 1 − b [rt − E(rt )], 34.
where b satisfies E(rt ) = b, and is the covariance matrix of rt . Giglio, Xiu & Zhang (2021)
show that the relationship between the two SDFs (Equations 29 and 34) depends on the degree
of completeness of markets. Assuming that rt follows Equation 1 and some regularity conditions
hold, these two forms of SDF are asymptotically equivalent as N → ∞ in the sense that
1
T
1
|mt − m t |2 .
T t=1 λmin (β β )
Because the right-hand side diminishes as N → ∞ even for relatively weak factors, there is
generally no theoretical difference between estimands.
To estimate the SDF (Equation 34), Kozak, Nagel & Santosh (2020) suggest solving an
optimization problem, which amounts to a regression of r̄ onto ,
b = arg min (r̄ − b) −1 (r̄ − b)
+ pλ (b) , 35.
b
−1 .
which avoids calculating
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
3.4.4. Double machine learning. A fundamental task facing the asset pricing field today is
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
how to bring more discipline to the proliferation of factors. In particular, a question that re-
mains open is how to judge whether a new factor adds explanatory power for asset pricing, relative
to the hundreds of factors the literature has so far produced. Feng, Giglio & Xiu (2020) attempt to
address this question by systematically evaluating the contribution of individual factors relative to
existing factors as well as for conducting appropriate statistical inference in this high-dimensional
setting. While machine learning methods discussed in the previous section perform well by em-
ploying regularization to trade off bias with variance, both regularization and overfitting cause
a bias that distorts inference. Chernozhukov et al. (2018) introduce a general double machine
learning (DML) framework to mitigate bias and restore valid inference on a low-dimensional pa-
rameter of interest in the presence of high-dimensional nuisance parameters. Feng, Giglio & Xiu
(2020) make use of this framework to test the SDF loading of a newly proposed factor.
Suppose that gt is the factor of interest and ht a vector of potentially confounding factors such
that vt = (gt : ht ) . To test if gt (e.g., a newly proposed factor) contributes to expected returns
beyond the variables in ht (e.g., factors that have already been discovered by previous literature),
we should conduct inference on bg while controlling bh , where b = (bg : bh ) satisfies E(rt ) = Cb =
Cg bg + Ch bh , and C = β v is the covariance between rt and v t . If the number of factors in v t , K,
is finite, then the GMM approach introduced in Section 3.4.1 is adequate. When it comes to a
large K setting, however, the classical inference procedure is no longer valid. This is certainly a
relevant case in practice, as T is typically in the hundreds, roughly of the same scale as the number
of factors studied.
In the spirit of DML, Feng, Giglio & Xiu (2020) select controls from {C h } via two respective
lasso regressions: r̄ onto C h and C g onto Ch . The selected controls are the union of the controls
selected by each of the two lasso regressions and are denoted by C h[I] and C
h[I] . Then, C g serve as
regressors in another CSR of r̄. The resulting estimator of bg ,
M C
bg = (C −1 r̄),
g Ch[I] g ) (Cg MCh[I]
is a desirable candidate
√ for inference because the regularization biases in lasso diminish at a faster
rate than does T after partialing out the effect of Ch from C g .
3.4.5. Parametric portfolios and deep learning stochastic discount factors. Because the
SDF (when projected onto tradable assets) is spanned by optimal portfolio returns, estimating the
SDF is effectively a problem of optimal portfolio formation. A fundamental obstacle to the con-
ventional mean-variance analysis is the low signal-to-noise ratio: Expected returns and covariances
1
T Nt
max U w(θ, bi,t−1 )r̃i,t ,
θ T t=2 i=1
where w(θ , bi, t−1 ) is a parametric function of stock characteristics, and U(·) is some prespecified
utility function. DeMiguel et al. (2020) show that this approach, when restricted to the special
case of a linear parametric weight function and mean-variance utility, is equivalent to the usual
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
optimize the Sharpe ratio of the portfolio (SDF) via reinforcement learning, with more than
50 features plus their lagged values. Chen, Pelger & Zhu (2019) parameterize the SDF loadings
and weights of test asset portfolios as two separate neural networks and adopt an adversarial min-
imax approach to estimate the SDF. Both adopt long-short-term memory models to incorporate
lagged time series information from macro variables, firm characteristics, or past returns.
3.5.1. GRS test and extensions. Specifically, assessments of factor pricing models can be
formalized as statistical hypothesis testing problems. Such tests most commonly focus on the zero-
alpha condition: If the factor model reflects the true SDF, then it should price all test assets with
zero alpha (up to sampling variation). A standard formulation for the null hypothesis is
H0 : α1 = α2 = · · · = αN = 0. 38.
In a simple setting in which all factors are observable and tradable, the model given by Equations 1
and 2 can be written as rt = α + βft + ut , so that alphas can be estimated via asset-wise TSRs:
αTS = (RMF ιT )(ιT MF ιT )−1 .
39.
Gibbons, Ross & Shanken (1989) constructed a quadratic test statistic,
−1
= T −N −K
F
αTS u αTS
, 40.
N ¯
(1 + f v−1 f¯ )
and developed its exact finite sample distribution, a noncentral F-distribution, under the
assumption of Gaussian errors.
An important limitation of this result is that it requires that T > N + K. In practice, N can be
much larger than T. Even in the case of N < T, the power of the Gibbons-Ross-Shanken (GRS)
test may be compromised because it employs an unrestricted sample covariance matrix, u , that is
T −1 v−1 f¯ )−1 − N
TαTS (u ) αTS (1 + f¯
J2 = √ .
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
2N
They propose other enhancements to improve the power of their test against sparse alternatives.
These extensions remedy some of the drawbacks of the GRS test and have asymptotic guaran-
tees as N, T → ∞, which represent an important step forward for tests of asset pricing models. Tests
in this section all rely on models entirely composed of tradable factors, but in light of Equation 46
below, the same test statistics and asymptotic inference should be directly applicable to models
with nontradable and latent factors via Equation 44.
3.5.2. Model comparison tests. Testing models is perhaps less informative than comparing
models. After all, all models are wrong, but some are more useful than others. As Gibbons, Ross
& Shanken (1989) emphasize, the factor model given in Equation 1 directly implies the following
equality for the GRS test statistic:
−1
βv β + u βv α + βγ
α u−1 α
≡ (α + βγ ) , γ
− γ v−1 γ . 41.
v β v γ
Going one step further, we obtain α u−1 α = SR2 ({rt , vt + γ }) − SR2 ({vt + γ }), where SR({at })
denotes the optimal Sharpe ratio of a portfolio using assets at . In other words, the classical GRS
test statistic can be interpreted as a test of whether the factors achieve the maximal Sharpe ratio,
or whether one can improve on that Sharpe ratio by trading the test assets in addition to the
factors. Intuitively, if {v t + γ } already span the optimal portfolio (i.e., the asset pricing model is
correctly specified), the Sharpe ratio gains from augmenting this portfolio with additional test
assets rt should be zero.
Indeed, we can compare models using the left-hand side of Equation 41 as a criterion. Specif-
ically, consider two models with tradable factor sets { ft(1) } and { ft(2) }, respectively. Barillas &
Shanken (2017) advocate comparing these models on their ability to price all returns, both test
assets and traded factors. With this perspective comes an insight that test assets tell us nothing
about model comparison beyond what we learn from each model’s ability to price factors of the
other models! This observation is verified from Equation 41, since
5 We omit finite sample adjustment terms from the original construction of their test statistic for simplicity
and clarity.
3.5.3. Bayesian approach. As the set of candidate models expands, model comparison via
pairwise asymptotic tests becomes a daunting task. And pairwise model comparison may not un-
ambiguously isolate the best-performing model. Moreover, multiple testing issues can arise. To
find the best factor pricing model, Barillas & Shanken (2018) develop a Bayesian procedure that
computes model probabilities for a collection of asset pricing models with tradable factors. They
adopt an off-the-shelf Jeffreys prior on betas and residual covariances, following the earlier work
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
Under the null hypothesis of no alpha, alpha follows a delta function concentrated at 0. Under the
alternative, alpha is distributed as
P(α|β, u ) = N (0, ku ), for some k > 0.
The benefit of this prior is its convenience and economic sensibility: It imposes that the expected
Sharpe ratio of the arbitrage portfolio, α −1 α, is kN, which does not take implausibly large values.
Having an otherwise diffuse prior on α would force the Bayes factor to favor the null (Kass &
Raftery 1995). Barillas & Shanken (2018) provide closed-form expressions of the Bayes factor for
testing zero alpha and, more importantly, of the marginal likelihood of each model. In light of
Barillas & Shanken (2017), the model comparison of Barillas & Shanken (2018) is based on an
aggregation of evidence from all possible multivariate regressions of excluded factors on factor
subsets—i.e., it takes test assets out of the picture. Chib, Zeng & Zhao (2020) show that the use
of the standard Jeffreys priors on model-specific nuisance parameters is unsound for Bayes factors
and propose a new class of improper priors for nuisance parameters based on invertible maps,
which leads to valid marginal likelihoods and model comparisons.
Bryzgalova, Huang & Julliard (2022) further extend the Bayesian framework for model selec-
tion in the presence of potentially weak and nontradable factors. They reparameterize the expected
returns using Equation 32 and propose a spike-and-slab prior on b to encourage model selection
and ensure the validity of Bayes factors (because a flat prior would otherwise inflate Bayes factors
for models that contain weak factors).
More specifically, they introduce a vector of binary latent variables δ = (δ 1 , δ 2 , . . . , δ K ) , where
δ j {0, 1}. δ indexes 2K possible models. The jth variable, bj (with associated loadings Cj ), is
included if and only if δ j = 1. Their prior on b has the following spike-and-slab form:
K
P(b|δ, σ 2 ) = (1 − δ j )Dirac(b j ) + δ j P(b j |σ 2 ), P(b j |σ 2 ) ∼ N (0, σ 2 ψ j );
j=1
K
P(δ|w) = wδ j (1 − w)1−δ j , w ∼ P(w); P(σ 2 ) ∼ σ −2 .
j=1
The Gaussian prior is used to model the nonnegligible entries (the slab), and the Dirac mass
at zero is used to model the negligible entries (the spike), which could be replaced by a con-
tinuous density heavily concentrated around zero. This prior, originally proposed by Mitchell &
Alphas are the portion of expected returns that cannot be explained by risk exposures. Thus, a
portfolio with a significant alpha relative to a status quo model (e.g., the CAPM or the Fama-
French three-factor model) is dubbed an anomaly. Harvey, Liu & Zhu (2016) investigate more
than 300 anomalies proposed in the literature and argue that many of these anomalies are statistical
artifacts due to data snooping or multiple testing (MT).
The literature in asset pricing has long been aware of data-snooping concerns and MT issues
in alpha tests and has taken various approaches to address them over the years. Leading examples
include Lo & MacKinlay (1990) and Sullivan, Timmermann & White (1999), among many others.
Early proposals suggest replacing a multitude of null hypotheses with one single null hypothe-
sis: H0 : maxi αi ≤ 0 or H0 : E(αi ) = 0 (see, e.g., White 2000, Kosowski et al. 2006, Fama & French
2010). While these are interesting null hypotheses for testing, more relevant and informative
hypotheses for alpha testing are perhaps
Hi0 : αi = 0, i = 1, . . . , N. 43.
This collection of hypotheses is fundamentally different from the single null hypothesis of
Gibbons, Ross & Shanken (1989) in Equation 38. Suppose ti is a test statistic for the null Hi0
(often taken as the t-statistic) and that a corresponding test rejects the null whenever ti > ci for
some prespecified cutoff ci . Let H0 ⊂ {1, . . . , N } denote the set of indices for which the corre-
sponding null hypotheses are true. In addition, let R be the total number of rejections in a sample,
and let F be the number of false rejections in that sample:
N
N
F= 1{i ≤ N : ti > ci and i ∈ H0 }, R= 1{i ≤ N : ti > ci }.
i=1 i=1
Both F and R are random variables. Note that, in a specific sample, we can obviously observe R,
but we cannot observe F. Nonetheless, we can design procedures to effectively limit F relative to
R in expectation.
More formally, the MT literature often works with the false discoveries proportion (defined as
FDP = F/max{R, 1}) and seeks procedures to control its expectation, known as the false discovery
6 In related work, Gospodinov, Kan & Robotti (2014) take a frequentist approach to this problem.
More recently, to obtain valid p-values and t-statistics for alphas in this context, Giglio, Liao
& Xiu (2021) develop a rigorous framework with asymptotic guarantees to conduct inference
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
on alphas in linear factor models, accounting for the high dimensionality of test assets, missing
data, and potentially omitted factors. Factor model presentations up to this point have imposed
that alphas are zero, which makes risk premia identifiable. Giglio, Liao & Xiu (2021) relax the
zero-alpha assumption and impose an assumption that alpha is cross-sectionally independent of
beta (and accompany this with a large N asymptotic scheme). Their alpha estimator is given by
γ ,
α = r̄ − β Mι β
γ = (β
)−1 (β
Mι r̄), 44.
N N
where β is given by Equation 9 if all factors are observable or by Equation 13 if factors are
latent. Including an intercept term in the CSR in Equation 44 allows for a possibly nonzero
cross-sectional mean for alpha. Then p-values of α can be constructed using Equation 46 below,
which serve as inputs for FDR control.
The aforementioned frequentist MT corrections tend to be very conservative to limit false dis-
coveries. Generally speaking, they widen confidence intervals and raise p-values but do not alter
the underlying point estimate. Jensen, Kelly & Pedersen (2021) take an empirical Bayes approach
to understanding alphas in the high-dimensional context of the factor zoo, including addressing
concerns about false anomaly discoveries. They propose a Bayesian hierarchical model to accom-
plish their MT correction, which leverages two key model attributes. First is a zero-alpha prior,
which imposes statistical conservatism in analogy to frequentist MT methods. It anchors alpha
estimates to a sensible null in case the data are insufficiently informative about the parameters
of interest. Bayesian false discovery control comes from shrinking estimates toward this prior. A
benefit of the Bayesian approach, however, is that the degree of FDR control decreases as data ac-
cumulate. Eventually, with enough data, the prior gets zero weight and there is no MT correction.
This is justified: In the large data limit, there are no false discoveries! In other words, Bayesian
modeling flexibly decides on the severity of MT correction based on how much information there
is in the data.
Second, the hierarchical structure in the Jensen, Kelly & Pedersen (2021) model leverages the
joint behavior of factors, allowing factors’ alpha estimates to borrow strength from one another.
As a result, alphas for different factors are shrunk not only toward zero but also toward each other.
The frequentist corrections typically treat factors in isolation, making those corrections even more
7 In addition, Harvey & Liu (2020) propose an innovative double-bootstrap method to control FDR while also
4. ASYMPTOTIC THEORY
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
Three main asymptotic schemes have emerged in the literature for characterizing the statistical
properties of factor models, risk premia, and alphas. Classical inference relies on the usual large T,
fixed N asymptotics. This remains the most common setup in asset pricing. The second scheme
allows both N and T to increase to ∞ (with some rate restrictions). The third scheme adopts a
large N, fixed T design. There are pros and cons with each scheme that should be considered when
conducting inference. We illustrate this point with several examples here.
As a result, it is straightforward to show that the asymptotic variances of both OLS and (infeasible)
GLS share the form
Avar(γ ) = T −1 v + O N −1 T −1 . 45.
Heuristically, we see that when N is large, there is no need to worry about estimating a large
covariance matrix u or making a Shanken adjustment. Moreover, both OLS and infeasible GLS
are asymptotically equivalent to the sample mean estimator f¯ regardless of whether f is tradable
or not. All these estimators achieve the same asymptotic variance, v /T. In this regard, adopting
the large N, large T scheme greatly simplifies the inference on γ !
Similarly, in light of the aforementioned relationship between γ (Equation 24) and b
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
−1
(Equation 31, so that b = v γ ), as well as Equation 45, we can heuristically derive the asymptotic
variance of b for both OLS and (infeasible) GLS in the large N, large T setting. Simply applying
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
γ − Hγ = H v̄ + OP (N −1 + T −1 ).
Even though these estimated factors cannot be interpreted, which is a major drawback of any
latent factor model, Giglio & Xiu (2021) show that these factors serve as controls that facili-
tate the inference on γ g = ηγ , which can be identified and hence interpreted for any factor of
interest, gt .
With respect to alphas, Giglio, Liao & Xiu (2021) show that alpha estimates satisfy
−1 d
σi,N αi − αi ) −→ N (0, 1),
T (
1 1 1
σi,N
2
T = Var(uit (1 − vt v−1 γ )) + Var(αi ) βi Sβ−1 βi , 46.
T N N
for each i ≤ N as N, T → ∞. Here we have Sβ = N1 β M1N β. The second term is OP (N−1 ), sug-
gesting that α is inconsistent if N is finite. This formula holds whether factors are observable or
latent. If T log N = o(N), the second term diminishes sufficiently fast that one only needs the first
term in Equation 46 to construct p-values for each individual alpha.
A critical assumption behind the above analysis is that all factors are pervasive. While this
assumption is widely adopted in modern factor analysis (e.g., Bai 2003) due to its simplicity and
convenience, it is often in conflict with empirical evidence. If this assumption is violated, factors
and their risk exposures may not be discovered by PCA.
There is a growing strand of econometrics literature on weak factor models. Bai & Ng (2008)
argue that the properties of idiosyncratic errors should be considered when constructing principal
distinguishable from idiosyncratic noise [see theoretical results by Onatski (2009, 2012) in similar
weak factor models]. In that case, no estimator can be consistent for either risk premia or the SDF.
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
Lettau & Pelger (2020a) show that risk-premia PCA does not consistently recover the SDF, but
it correlates more with the SDF than does the SDF obtained from standard PCA. Rather than
focusing on this extreme case of weak factors, Giglio, Xiu & Zhang (2021) develop asymptotic
theory covering a whole range of factor weaknesses, which permits consistent estimation of factors,
risk premia, and the SDF. Formally, they allow for the case in which the minimum eigenvalues of
the factor component in the covariance matrix of returns diverge, whereas the largest eigenvalue
due to the idiosyncratic errors is bounded. In this general setup, a weak factor problem arises if
and only if N/[λmin (β β)T] 0, in which case the three-pass estimator of Giglio & Xiu (2021),
ridge or partial least squares estimators, and the risk-premia PCA estimator of Lettau & Pelger
(2020a) all give a biased risk premium estimate, but the supervised PCA estimator of Giglio, Xiu
& Zhang (2021) still works.
5. CONCLUSION
Factor models have historically been the workhorse framework for empirical analysis in asset
pricing. In this review, we survey the next generation of factor models with an emphasis on
Machine learning factor models are one such example of this fusion. Almost all leading theo-
retical asset pricing models predict a low-dimensional factor structure in asset prices. Where these
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
models differ is in their predictions regarding the identity of the common factors. Much of the
frontier work in empirical asset pricing can be viewed as using the (widely agreed upon) factor
structure skeleton as a theory-based construct within which various machine learning schemes
are injected to conduct an open-minded investigation into the economic nature of the common
factors.
Our survey is inevitably selective and disproportionally influenced by our own research on
these topics. We have mainly focused on methodological contributions, leaving a detailed review of
empirical discoveries via these methodologies for future work. A frequently discussed dichotomy
in the literature is observable factor versus latent factor models. While some of the methods we
discuss apply to observable factor settings (or hybrid settings), we have also skewed our coverage
in favor of latent factor methods given the growing emphasis on them in the literature. In addi-
tion, we have focused on statistical frameworks as opposed to theoretical economic underpinnings
or specifications implied by structural models (which mirrors the emphasis in the literature as a
whole).
This area of research is evolving quickly, and there is a myriad of opportunities for improve-
ments and new directions. The first CAPM-based return factor models were analyzed to test
specific predictions of theoretical models. In the time since, the research pendulum has swung
far in the opposite direction toward purely statistical model formulations with little connection
to theory. Perhaps the most important direction for future research is to reestablish the link
between asset pricing theory and empirical models of returns. Machine learning, through its abil-
ity to cast a wide net for detecting the underlying determinants of return behavior, can be a critical
tool for this endeavor. To do so, it will need to focus more squarely on integrating the behavior
of returns with data on fundamental microeconomic and macroeconomic activity and cash flows
as well as emphasizing the economic interpretability of the associations it finds. Along these lines,
new theories in behavioral finance present opportunities to marry returns with more readily avail-
able nonprice data such as survey responses and textual narratives. Machine learning methods can
be a key ingredient in deriving the empirical map between prices and the beliefs of economic
agents encoded in these nonstandard data sources. Another important research direction is to
take seriously structural change in financial markets and asset returns. How should our return
models accommodate structural evolution in the economy, regulatory and political regime shifts,
and financial technological progress? How can we capture the subtle return dynamics of more
gradual economic feedback mechanisms, for example, alpha decay emerging from learning and
competition effects in markets?
LITERATURE CITED
Ahn DH, Conrad J, Dittmar RF. 2009. Basis assets. Rev. Financ. Stud. 22(12):5133–74
Aït-Sahalia Y, Jacod J, Xiu D. 2021. Inference on risk premia in continuous-time asset pricing models. Work. Pap.,
Univ. Chicago, Chicago, IL. https://dachxiu.chicagobooth.edu/download/RPContTime.pdf
Anatolyev S, Mikusheva A. 2022. Factor models with many assets: strong factors, weak factors, and the two-pass
procedure. J. Econom. 229(1):103–26
Ang A, Hodrick R, Xing Y, Zhang X. 2006. The cross-section of volatility and expected returns. J. Finance
61:259–99
Annu. Rev. Financ. Econ. 2022.14:337-368. Downloaded from www.annualreviews.org
Ang A, Liu J, Schwarz K. 2020. Using individual stocks or portfolios in tests of factor models. J. Financ. Quant.
Anal. 55:709–50
Bai J. 2003. Inferential theory for factor models of large dimensions. Econometrica 71(1):135–71
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
Bai J, Ng S. 2002. Determining the number of factors in approximate factor models. Econometrica 70:191–221
Bai J, Ng S. 2008. Forecasting economic time series using targeted predictors. J. Econom. 146(2):304–17
Bailey N, Kapetanios G, Pesaran MH. 2021. Measurement of factor strength: theory and practice. J. Appl.
Econom. 36(5):587–613
Bajgrowicz P, Scaillet O. 2012. Technical trading revisited: false discoveries, persistence tests, and transaction
costs. J. Financ. Econ. 106(3):473–91
Baldi P, Hornik K. 1989. Neural networks and principal component analysis: learning from examples without
local minima. Neural Netw. 2(1):53–58
Bansal R, Yaron A. 2004. Risks for the long run: a potential resolution of asset pricing puzzles. J. Finance
59(4):1481–509
Barillas F, Kan R, Robotti C, Shanken J. 2020. Model comparison with Sharpe ratios. J. Financ. Quant. Anal.
55(6):1840–74
Barillas F, Shanken J. 2017. Which alpha? Rev. Financ. Stud. 30(4):1316–38
Barillas F, Shanken J. 2018. Comparing asset pricing models. J. Finance 73(2):715–54
Barras L, Scaillet O, Wermers R. 2010. False discoveries in mutual fund performance: measuring luck in
estimated alphas. J. Finance 65(1):179–216
Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to
multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57(1):289–300
Benjamini Y, Yekutieli D. 2001. The control of the false discovery rate in multiple testing under dependency.
Ann. Stat. 28(4):1165–88
Brandt MW, Santa-Clara P, Valkanov R. 2009. Parametric portfolio policies: exploiting characteristics in the
cross-section of equity returns. Rev. Financ. Stud. 22(9):3411–47
Bryzgalova S. 2015. Spurious factors in linear asset pricing models. Work. Pap., Stanford Univ., Stanford, CA
Bryzgalova S, Huang J, Julliard C. 2022. Bayesian solutions for the factor zoo: We just ran two quadrillion models.
Work. Pap., Lond. Sch. Econ. Political Sci., London, UK. https://personal.lse.ac.uk/julliard/papers/
BSftFT.pdf
Bryzgalova S, Pelger M, Zhu J. 2020. Forest through the trees: building cross-sections of asset returns. SSRN Work.
Pap. https://dx.doi.org/10.2139/ssrn.3493458
Bchner M, Kelly BT. 2022. A factor model for option returns. J. Financ. Econ. 143(3):1140–61
Campbell JY, Cochrane JH. 1999. By force of habit: a consumption-based explanation of aggregate stock
market behavior. J. Political Econ. 107(2):205–51
Chamberlain G, Rothschild M. 1983. Arbitrage, factor structure, and mean-variance analysis on large asset
markets. Econometrica 51:1281–304
Chen AY. 2021. The limits of p-hacking: some thought experiments. J. Finance 76(5):2447–80
Chen AY, Zimmermann T. 2022. Open source cross-sectional asset pricing. Crit. Finance Rev. 11(2):207–64
Chen L, Pelger M, Zhu J. 2019. Deep learning in asset pricing. Work. Pap., Stanford Univ., Stanford, CA
with individual assets: resolving the errors-in-variable bias in risk premium estimation. J. Financ. Econ.
133(2):273–98
Jensen TI, Kelly B, Pedersen LH. 2021. Is there a replication crisis in finance? J. Finance. In press
Jiang J, Kelly B, Xiu D. 2021. (Re-)Imag(in)ing price trends. SSRN Work. Pap. https://dx.doi.org/10.2139/
ssrn.3756587
Kan R, Robotti C. 2009. Model comparison using the Hansen-Jagannathan distance. Rev. Financ. Stud.
22(9):3449–90
Kan R, Robotti C, Shanken J. 2013. Pricing model performance and the two-pass cross-sectional regression
methodology. J. Finance 68(6):2617–49
Kan R, Zhang C. 1999. Two-pass tests of asset pricing models with useless factors. J. Finance 54(1):203–35
Kass RE, Raftery AE. 1995. Bayes factors. J. Am. Stat. Assoc. 90(430):773–95
Ke T, Kelly B, Xiu D. 2019. Predicting returns with text data. Work. Pap. 2019-69, Univ. Chicago, Chicago, IL.
https://bfi.uchicago.edu/wp-content/uploads/BFI_WP_201969.pdf
Kelly B, Moskowitz T, Pruitt S. 2021. Understanding momentum and reversal. J. Financ. Econ. 140(3):726–43
Kelly B, Palhares D, Pruitt S. 2021. Modeling corporate bond returns. J. Finance. In press
Kelly B, Pruitt S. 2013. Market expectations in the cross-section of present values. J. Finance 68(5):1721–56
Kelly B, Pruitt S, Su Y. 2019. Characteristics are covariances: a unified model of risk and return. J. Financ. Econ.
134(3):501–24
Kelly B, Pruitt S, Su Y. 2020. Instrumented principal component analysis. SSRN Work. Pap. https://dx.doi.org/
10.2139/ssrn.2983919
Kim S, Korajczyk RA, Neuhierl A. 2021. Arbitrage portfolios. Rev. Financ. Stud. 34(6):2813–56
Kingma D, Ba J. 2014. Adam: a method for stochastic optimization. arXiv:1412.6980 [cs.LG]
Kleibergen F. 2009. Tests of risk premia in linear factor models. J. Econom. 149(2):149–73
Koijen R, Nieuwerburgh SV. 2011. Predictability of returns and cash flows. Annu. Rev. Financ. Econ. 3:467–91
Korsaye SA, Quaini A, Trojani F. 2019. Smart SDFs. Work. Pap., Univ. Geneva, Geneva, Switz.
Kosowski R, Timmermann A, Wermers R, White H. 2006. Can mutual fund “stars” really pick stocks? New
evidence from a bootstrap analysis. J. Finance 61(6):2551–95
Kozak S, Nagel S, Santosh S. 2018. Interpreting factor models. J. Finance 73(3):1183–223
Kozak S, Nagel S, Santosh S. 2020. Shrinking the cross section. J. Financ. Econ. 135(2):271–92
Lamont OA. 2001. Economic tracking portfolios. J. Econom. 105(1):161–84
Lettau M, Pelger M. 2020a. Estimating latent asset-pricing factors. J. Econom. 218(1):1–31
Lettau M, Pelger M. 2020b. Factors that fit the time series and cross-section of stock returns. Rev. Financ. Stud.
33(5):2274–325
Lewellen J. 2015. The cross-section of expected stock returns. Crit. Finance Rev. 4(1):1–44
Lewellen J, Nagel S, Shanken J. 2010. A skeptical appraisal of asset pricing tests. J. Financ. Econ. 96(2):175–94
Lo AW, MacKinlay AC. 1990. Data-snooping biases in tests of financial asset pricing models. Rev. Financ. Stud.
3(3):431–67
32(4):1573–607
Rapach D, Strauss JK, Zhou G. 2013. International stock return predictability: What is the role of the United
Access provided by 107.186.235.202 on 12/25/23. For personal use only.
Annual Review
of Financial
Economics
vii
FE14_FrontMatter ARjats.cls September 16, 2022 14:5
Indexes
Errata
viii Contents