Lecture 05
Lecture 05
https://www.lhpedersen.com/big-data-asset-pricing
Exercises
1. Beta-dollar neutral portfolios
2. Construct value factors
3. Factor replication analysis
4. High-dimensional return prediction
5. Research proposal
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 2
Overview of this Lecture
Terminology Notation
Statistics ML Stat/ML Finance
Inputs Independent variables Features xi sti
Outputs Dependent variable Response yi i
rt+1
I Two types of supervised learning: classification vs. regression
I Ridge regression
I Objective: minimize T1 (y − xβ)0 (y − xβ) + λβ 0 β
I Solution: β̂ ridge = ( T1 x 0 x + λI )−1 T1 x 0 y
I But, how do we choose (or “tune”) λ?
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 8
Parameters vs. Hyper-Parameters (or Tuning Parameters)
Source: Hastie et al. (2009). The light blue curves: training error. Light red: conditional test error.
IElectronic
So, an intermediate λ is optimal,
copy available but how to choose it?
at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 11
Training, Validation, and Testing
For details, see excellent and free books Hastie et al. (2009) and
Efron and Hastie (2021)
I Alternative objective:
1
β̂ lasso = arg min (y − xβ)0 (y − xβ)
β T
X
subject to ˉ
|βj | ≤ λ
j
I Cf. bestcopy
subset: use OLS for subset of x variables (less used)
Electronic available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 17
Penalized Regressions: Comparison
Source: Hastie et al. (2009). Lasso regression (left) and ridge regression (right). Red lines show contours of the
P
prediction errors, (y − xβ)0 (y − xβ). The solid blue areas show, respectively, subject to ˉ
j |βj | ≤ λ and
P 2 ˉ
subject to j βj ≤ λ
I Differences
I Ridge shrinks all parameters, but generically all parameters are non-zero
I Lasso sets some parameters to zero, here β1 = 0, and shrinks the rest
Source: Hastie et al. (2009). The x-axis has the OLS estimate, β̂, and the y -axis has the respective penalized
regression estimates.
I Differences
I Best subset: once a variable is included, there is no shrinkage in its beta
I Ridge: proportional shrinkage, with one parameter:
1 0 1 β̂ OLS 1
β̂ ridge = ( x x + λI )−1 x 0 y = , where λ̃ = 1 + λ/( x 0 x)
T T λ̃ T
I Lasso: once beta is non-zero, it is shrunk by a constant, e.g. if β > 0
1 1 0 1 1
β̂ lasso = ( x 0 x)−1 x y − λ = β̂ OLS − λ̃ , where λ̃ = 2 λ
T T 2 x 0x
T
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 19
Penalized Regressions: Elastic-Net
I Elastic-net regression:
!
1 X 2
β̂ elastic-net
= arg min (y − xβ)0 (y − xβ) + λ aβj + (1 − a)|βj |
β T j
Source: Hastie et al. (2009). Top right panel shows a partition of a two-dimensional feature space by recursive
binary splitting, as used in CART. Top left panel shows a general partition that cannot be obtained from
recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right
panel, and a perspective plot of the prediction surface appears in the bottom right panel.
I Greedy algorithm
I solve problem in stages, making the locally optimal choice at each stage
I does not necessarily produce an optimal solution, but fast
IElectronic
Here: always splitavailable
copy each regionat:
(orhttps://ssrn.com/abstract=4068797
branch) into two (or leave as is)
c Lasse Heje Pedersen 23
Regression Trees: Empirical Example from Asset Pricing
Use regression tree to predict excess returns
I Sample: Non-microcap stocks in the US
I Features: Beta, book-to-market, market equity, profitability, and asset
growth, all standardized to cross-sectional percentiles
I Hyper-parameters: Max depth=3 and no split if leaf has <10% of obs
I Each circle contains that node’s
I average excess return
I percent of observations
0.0072
100%
yes Asset_growth >= 0.86 no
0.0027 0.0079
14% 86%
Profitability < 0.17 Size >= 0.75
I Advantages
I Relatively simple and easy to interpret
I Insensitive to monotone transformations of predictors
I Can capture interactions of predictors (up to depth minus 1)
I Can handle missing feature values
I Disadvantages
I Not smooth
I Possible instability and sensitivity to a few observations
I I.e., high variance
I Bagging=Bootstrap aggregating
I ML method to improve the stability and accuracy
I Model averaging of models fitted to bootstrap samples
I Random forest = bagging applied to regression trees
I How to make a random forest:
I Draw b = 1, ..., B different bootstrap samples of the data
I Fit a separate regression tree to each bootstrap b
I To make these trees less correlated, each split is based on a
random subset of features
I Average the forecasts coming from each tree
1 Xˆ
fˆrandom forest (x) = ftree,b (x)
B
b
∂L
− = y i − ŷ i =: res i
∂ ŷ i
(H)
X (H) (H)
fˆ(x i ) = θ0 + θi ai
i
I OLS/Ridge/Lasso regression
I When features are on the same scale (i.e., standardized):
I Importance of feature j is |β j |
I otherwise
I Importance of feature j is σ(x j )|β j |
I Problematic with high multi-collinearity (especially OLS)
Surrogate model
I Start with prediction, fˆ(x), from a hard-to-interpret model such as a
neural network
I Fit interpretable surrogate model, fˆsurrogate , to prediction
fˆ(x i ) = fˆsurrogate (x i ) + εi
I E.g., fit regression tree or OLS regression:
fˆ(x i ) = x i β + εi
I Use standard metrics such as R 2 to ensure appropriate fit
I Interpret surrogate model
I Notation:
I SVj is the Shapley value (importance) of feature j
I C is a coalition of features
I C ∪ {j} is the coalition with j added
I |C | is the number of features in the coalition
I K is number of features, K ~ = {1, ..., K }, and K
~ \ {j} is the set excluding j
I V (C ) is the value function that determines the “worth” of coalition C ,
e.g., the R 2 of a model that uses the coalition features
I The Shapley value for feature j is essentially the avg. marginal contribution of j
when added to a coalition
1 X K − 1−1
SVj = (V (C ∪ {j}) − V (C )),
K |C |
~ \{j}
C ⊆K
I SV calculation is computationally demanding
I Feasible with OLS with a limited set of features
I But even with OLS the computational burden of SV could be too large
I Recent advances in ML makes calculating approximate SV feasible for
complex models with many features (Lundberg and Lee, 2017)
I For tree-based model, efficient methods exist (Lundberg et al., 2020)
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 37
Applications to Asset Pricing
(WFR portfolios) The maximum OOS cross-sectional R2 attained by a model with n factors (on the x -axis)
across all possible values of the prior root expected SR 2 (κ) for models based on original characteristics
portfolios (solid) and PCs (dashed). Dotted lines depict −1 s.e. bounds of the CV estimator. The sample is
daily from September 1964 to December 2017.
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 40
Other Topics in ML
Efron, B. and T. Hastie (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and
Data Science, Volume 6. Cambridge University Press.
Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning. The Review of Financial
Studies 33 (5), 2223–2273.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer series in statistics. Springer.
Jensen, T. I., B. T. Kelly, S. Malamud, and L. H. Pedersen (2022). Machine learning about optimal portfolios.
Working paper, Copenhagen Business School .
Kozak, S., S. Nagel, and S. Santosh (2020). Shrinking the cross-section. Journal of Financial Economics 135 (2),
271–292.
Lundberg, S. M., G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and
S.-I. Lee (2020). From local explanations to global understanding with explainable AI for trees. Nature
Machine Intelligence 2 (1), 2522–5839.
Lundberg, S. M. and S.-i. Lee (2017). A Unified Approach to Interpreting Model Predictions. In 31st Conference
on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Number Section 2, pp. 1–10.