0% found this document useful (0 votes)
9 views43 pages

Lecture 05

Uploaded by

Muhammad Naeem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

Lecture 05

Uploaded by

Muhammad Naeem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Big Data Asset Pricing

Lecture 5: Machine Learning in Asset Pricing

Lasse Heje Pedersen


AQR, Copenhagen Business School, CEPR

https://www.lhpedersen.com/big-data-asset-pricing

The views expressed are those of the author


Electronic copyand
available at: https://ssrn.com/abstract=4068797
not necessarily those of AQR
Overview of the Course: Big Data Asset Pricing
Lectures
I Quickly getting to the research frontier
1. A primer on asset pricing
2. A primer on empirical asset pricing
3. Working with big asset pricing data (videos)
I Twenty-first-century topics
4. The factor zoo and replication
5. Machine learning in asset pricing
6. Asset pricing with frictions

Exercises
1. Beta-dollar neutral portfolios
2. Construct value factors
3. Factor replication analysis
4. High-dimensional return prediction
5. Research proposal
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 2
Overview of this Lecture

I Overview of machine learning


I Supervised: regression vs. classification
I Unsupervised
I How to apply ML
I Bias-variance trade-off
I Hyper-parameters
I Train, validation, test
I Specific ML models in more detail
I Penalized regressions
I Regression trees
I Neural networks
I Feature importance
I Applications to asset pricing
I Predicting stock returns
I What factors generate the pricing kernel?

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 3
Overview of Machine Learning

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 4
Supervised vs. Unsupervised Learning

I Supervised learning: start with inputs and outputs


I Terminology:

Terminology Notation
Statistics ML Stat/ML Finance
Inputs Independent variables Features xi sti
Outputs Dependent variable Response yi i
rt+1
I Two types of supervised learning: classification vs. regression

Types of outputs # of outputs Example


Classification Qualitative Finite Default/no default
Regression Quantitative Any value Predict return

I Unsupervised: start with just inputs (no outputs)


I Example: clustering of factors into groups

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 5
Classic ML Problem: Supervised Learning via Regression
I Start with data on inputs, x i ∈ RK , and outputs, y i ∈ R1
I Training set of observations, i = 1, ..., T
I Want to learn f such that
y i = f (x i ) + εi

I Think of f as f (x i ) = E (y i |x i ), projection, or just a way to guess


the output
I Ways to learn about f :
Statistics Finance ML
Assumption
f (x i ) = ave(y j )x j near x i Kernel regression Portfolio sort Regression trees,
random forest,
boosting

f (x i ) linear OLS, GLS Fama-MacBeth, Penalized regres-


regression-based sion, e.g., LASSO,
mkt timing, signal- ridge
weighted portfolio

f (x i ) non-linear Non-linear least Neural networks,


squares, GMM, deep learning
MLE
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 6
How to Apply ML

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 7
Leading Example: Ridge Regression
I Want to learn f such that y i = f (x i ) + εi
I Regression: y i = x i β + εi
I Vector form: y = xβ + ε
I OLS: β̂ = (x 0 x)−1 x 0 y
I Properties: β̂ = (x 0 x)−1 x 0 y = β + ( T1 x 0 x)−1 T1 x 0 ε → β for T → ∞
I What does T → ∞ mean in practice? That T is large relative to K
I But K is large when we do ML
I Estimating many regression coefficients creates a lot of noise

I Gauss-Markov theorem: OLS has smallest mean squared error of all


linear estimators with no bias
I But we can do better with bias
I By reducing estimation noise – more on this later

I Ridge regression
I Objective: minimize T1 (y − xβ)0 (y − xβ) + λβ 0 β
I Solution: β̂ ridge = ( T1 x 0 x + λI )−1 T1 x 0 y
I But, how do we choose (or “tune”) λ?
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 8
Parameters vs. Hyper-Parameters (or Tuning Parameters)

I Want to learn f such that


example
y i = f (x i ) + εi = x i β + εi

I Leading example, ridge regression, β̂ ridge = ( T1 x 0 x + λI )−1 T1 x 0 y


I Parameters and hyper-parameters:
example
I Parameters tell us what f is: fˆ(x i ) = f (x i ; β̂) = x i β̂ ridge
I Hyper-parameters tell us how to find the parameters, e.g., λ
I Terminology: hyper-parameters = tuning parameters
I Parameters and hyper-parameters: other examples
Method Parameters Hyper-parameters
Penalized regression Regression coefficients, β Penalty λ for large β
Regression tree Tree splits Tree depth, etc.
Neural network Coefficients Layers, etc.

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 9
Bias-Variance Trade-Off

Source: Hastie et al. (2009). The light blue curves: training error. Light red: conditional test error.

I Example: ridge regression


I Smaller λ = greater model complexity = more degrees of freedom
I More complexity always leads to better fit in the training sample
I More complexity leads to lower bias, but higher variance
I Bestcopy
Electronic available
performance in theat: https://ssrn.com/abstract=4068797
test period for intermediate λ
c Lasse Heje Pedersen 10
Bias-Variance Trade-Off, continued
I Prediction error:
I Prediction: ŷ i = fˆ(x i )
I Outcome, out-of-sample: y i = f (x i ) + εi
I Prediction error: fˆ(x i ) − y i

I Expected “loss,” L (squared prediction error, or absolute prediction error):


E[L] = E[prediction error2 ] =E[(fˆ(x i ) − y i )2 ]
=E[(fˆ(x i ) − f (x i ) − εi )2 ]
=σε2 + E[(fˆ(x i ) − f (x i ))2 ]
=σε2 + (E[fˆ(x i )] − f (x i ))2 + E[(fˆ(x i ) − E[fˆ(x i )])2 ]
=noise + bias2 + variance

I Noise: from ε, unavoidable, the same regardless of prediction method


I Bias: Tendency of the prediction to miss its target even with lots of data
I Variance: Var(fˆ(x i )) comes from noise in parameter estimation
I Bias-variance trade-off:
I Lower complexity, e.g., shrinking β̂ more toward zero
I Increases bias
I Lowers variance

IElectronic
So, an intermediate λ is optimal,
copy available but how to choose it?
at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 11
Training, Validation, and Testing

I Old school finance and statistics: 1 sample (everything in-sample)


I Finance and statistics with model assessment: 2 sub-samples
1. In-sample: estimate parameters
2. Out-of-sample: assess performance
I ML: 3 sub-samples
1. Train: estimate parameters for each hyper-parameter
2. Validation: pick the best-performing hyper-parameter
3. Test: assess performance
I ML: example of ridge regression

1. Train: estimate β̂ ridge (λ) = (x 0 x + λI )−1 x 0 y for each λ


2. Validation: pick λ̂ with small loss, L[y i − x i β̂ ridge (λ̂)]
3. Test: assess performance of β̂ ridge (λ̂)

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 12
Cross-Validation
I k-fold cross-validation
I Data scarce – want to use full train+validation sample for both
I Split train+validation sample into k folds (subsets), but note
I k is not the same as the number of features K
I In finance, ensure that each whole cross-section is in the same fold,
but standard cross-validation functions distribute data randomly
I Example: 5-fold validation
Data
Split 1 Validate Train Train Train Train
Split 2 Train Validate Train Train Train
Split 3 Train Train Validate Train Train
Split 4 Train Train Train Validate Train
Split 5 Train Train Train Train Validate
I Method:
1. Train k times: For each split j and each hyper-parameter
I estimate parameters using training data in all folds, except j
I compute prediction error in j’th fold, i.e., j’th validation period
2. Cross-validate: Choose hyper-parameters
I that minimize prediction errors across all validation periods
3. Retrain: estimate parameters using full train+validation data
4. Test:
Electronic assess
copy performance
available (as always)
at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 13
Another Validation Method that Reuses the Data
I When the data has a time-series dimension
I finance papers usually keep the temporal order such that the
validation set comes after the training set
I e.g., in asset pricing, people often use rolling estimation
I not necessarily wrong to reverse order
I First do the standard ML split into 3 sub-samples
1. Train
2. Validation
3. Test
I Then proceed in these steps
1. Train: estimate β̂ for each λ in the training sample
2. Validation: pick the best-performing λ̂
3. Possibly re-train:
I Estimate β̂ for λ̂ in joint sample: training+validation
I NB: re-training requires that λ̂ “immune” to sample length
E.g., the T matters in minβ T1 (y − xβ)0 (y − xβ) + λ̂β 0 β
4. Test: assess performance
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 14
Expanding Training Set
In finance, people often use a rolling or expanding training set
I Expanding: always start from the beginning of the sample
I Rolling: Use a fixed number of years (discarding the oldest data)

Illustration of expanding training set:

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 15
Specific ML Models:
Penalized Regressions

For details, see excellent and free books Hastie et al. (2009) and
Efron and Hastie (2021)

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 16
Penalized Regressions: Lasso
I Want to learn f such that y i = f (x i ) + εi
I Regression: y i = x i β + εi
I Vector form: y = xβ + ε
I Ridge regression: we already saw this
I Lasso regression:
I Objective:
 
1 X
β̂ lasso = arg min  (y − xβ)0 (y − xβ) + λ |βj |
β T
j

I Alternative objective:
1
β̂ lasso = arg min (y − xβ)0 (y − xβ)
β T
X
subject to ˉ
|βj | ≤ λ
j

I Cf. bestcopy
subset: use OLS for subset of x variables (less used)
Electronic available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 17
Penalized Regressions: Comparison

Source: Hastie et al. (2009). Lasso regression (left) and ridge regression (right). Red lines show contours of the
P
prediction errors, (y − xβ)0 (y − xβ). The solid blue areas show, respectively, subject to ˉ
j |βj | ≤ λ and
P 2 ˉ
subject to j βj ≤ λ

I Differences
I Ridge shrinks all parameters, but generically all parameters are non-zero
I Lasso sets some parameters to zero, here β1 = 0, and shrinks the rest

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 18
Penalized Regressions: Comparison, continued

Source: Hastie et al. (2009). The x-axis has the OLS estimate, β̂, and the y -axis has the respective penalized

regression estimates.

I Differences
I Best subset: once a variable is included, there is no shrinkage in its beta
I Ridge: proportional shrinkage, with one parameter:

1 0 1 β̂ OLS 1
β̂ ridge = ( x x + λI )−1 x 0 y = , where λ̃ = 1 + λ/( x 0 x)
T T λ̃ T
I Lasso: once beta is non-zero, it is shrunk by a constant, e.g. if β > 0
 
1 1 0 1 1
β̂ lasso = ( x 0 x)−1 x y − λ = β̂ OLS − λ̃ , where λ̃ = 2 λ
T T 2 x 0x
T
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 19
Penalized Regressions: Elastic-Net

I Want to learn f such that y i = f (x i ) + εi


I Regression: y i = x i β + εi
I Vector form: y = xβ + ε

I Elastic-net regression:
!
1 X 2 
β̂ elastic-net
= arg min (y − xβ)0 (y − xβ) + λ aβj + (1 − a)|βj |
β T j

I Two tuning parameters:


I λ ≥ 0 controls amount of regularization
I a ∈ [0, 1] controls elements of ridge vs. lasso
I Some betas are set to zero, all are shrunk

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 20
Specific ML Models:
Regressions Trees and Random Forests

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 21
Regression Trees
What is a regression tree?
I Partition the space of features into J regions, R1 , ..., RJ
I Estimate response function as
X
fˆtree (x) = 1{x∈Rj } ave(yi |xi ∈ Rj )
j

“Greedy” algorithm to make a tree of maximum depth D:


I Start with a tree of depth d = 0 (all data in one region)
I For d = 1, ..., D, make tree of depth d from prior tree of depth d − 1 by
splitting each region as follows:
I If a region is smaller than a minimum size, leave as is
I Otherwise, consider splitting the region into two:
1. For each feature k = 1, ..., K , find the split point s(k), that leads to
the smallest prediction error
2. Pick the feature k ∗ such that the split [k ∗ , s(k ∗ )] leads to smallest
prediction error
3. Split the region into two daughter regions based on [k ∗ , s(k ∗ )]
unless one of daughter regions has too few elements or the
improvement in prediction error is too small (alternatively, prune
afterward)
Hyper-paramaters:
IElectronic
Tree depth,copy available
minimum at: https://ssrn.com/abstract=4068797
observations per region, minimum loss improvement, etc.
c Lasse Heje Pedersen 22
Regression Trees: Generic Example

Source: Hastie et al. (2009). Top right panel shows a partition of a two-dimensional feature space by recursive

binary splitting, as used in CART. Top left panel shows a general partition that cannot be obtained from

recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right

panel, and a perspective plot of the prediction surface appears in the bottom right panel.

I Greedy algorithm
I solve problem in stages, making the locally optimal choice at each stage
I does not necessarily produce an optimal solution, but fast
IElectronic
Here: always splitavailable
copy each regionat:
(orhttps://ssrn.com/abstract=4068797
branch) into two (or leave as is)
c Lasse Heje Pedersen 23
Regression Trees: Empirical Example from Asset Pricing
Use regression tree to predict excess returns
I Sample: Non-microcap stocks in the US
I Features: Beta, book-to-market, market equity, profitability, and asset
growth, all standardized to cross-sectional percentiles
I Hyper-parameters: Max depth=3 and no split if leaf has <10% of obs
I Each circle contains that node’s
I average excess return
I percent of observations

0.0072
100%
yes Asset_growth >= 0.86 no

0.0027 0.0079
14% 86%
Profitability < 0.17 Size >= 0.75

0.0045 0.0058 0.0086


11% 22% 63%
Beta >= 0.81 Profitability < 0.57 Book_to_market < 0.79

−0.0037 597e−6 0.0066 0.0049 0.0067 0.008 0.01


3% 4% 7% 11% 11% 47% 16%

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 24
Regression Trees: Advantages and Disadvantages

I Advantages
I Relatively simple and easy to interpret
I Insensitive to monotone transformations of predictors
I Can capture interactions of predictors (up to depth minus 1)
I Can handle missing feature values
I Disadvantages
I Not smooth
I Possible instability and sensitivity to a few observations
I I.e., high variance

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 25
Random Forests and Bagging

I Bagging=Bootstrap aggregating
I ML method to improve the stability and accuracy
I Model averaging of models fitted to bootstrap samples
I Random forest = bagging applied to regression trees
I How to make a random forest:
I Draw b = 1, ..., B different bootstrap samples of the data
I Fit a separate regression tree to each bootstrap b
I To make these trees less correlated, each split is based on a
random subset of features
I Average the forecasts coming from each tree
1 Xˆ
fˆrandom forest (x) = ftree,b (x)
B
b

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 26
Gradient Tree Boosting
I Gradient boosting with quadratic loss, L(y i − ŷ i ) = 1
(y i − ŷ i )2
2
I How to reduce the loss with ŷ i = fˆ(x i )? Answer:

∂L
− = y i − ŷ i =: res i
∂ ŷ i

I So raise ŷ i when res i > 0 and lower ŷ i when res i < 0


I I.e., always try to reduce the absolute value of res i (obviously)
I The loss can be improved more when the residual is bigger
→so work on fixing the biggest residuals

I Gradient tree boosting, loosely explained:


I Fit a tree, fˆtree,1 (x i )
I Look at the residuals (prediction errors), y i − fˆtree,1 (x i )
I Fit a tree to these residuals
I New fˆ = original tree + tree for the residuals
I New fˆ has smaller residuals
I Repeat
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 27
Gradient Tree Boosting, continued

I How to do gradient tree boosting in more detail:


I Start with initial tree (“tree number 0”), e.g., fˆtree,0 ≡ 0
I so initial residuals are res0i = y i − 0 = y i
I For each b = 1, ..., B
I Fit a regression tree, fˆtree,b , to (x, resb−1 )
I Shrink tree toward zero, ν fˆtree,b , using shrinkage factor ν ∈ [0, 1]
I Update the residuals to, resbi = resb−1 i − ν fˆtree,b (x i )
I Final prediction is the sum of the resulting B shrunken trees
X
fˆboosted tree (x) = ftree,0 (x) + ν fˆtree,b (x)
b>0
I Tuning: Boosting steps B, shrinkage factor ν, and tree depth d
I Note: a random forest is an average of trees, boosting is a sum
I In random forest, we can let B → ∞ to reduce variance
I With boosting, large B leads to over-fitting, so must to chosen well
I Popular implementation (more sophisticated): XGBoost
I Open-source software library for R, Python, Julia, etc.
I Algorithm for some winning teams of machine learning competitions
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 28
Specific ML Models:
Neural Networks

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 29
Neural Networks
I Want to learn f such that y i = f (x i ) + εi
I Use functions of functions of functions: fˆ = k(g (...), ..., g (...))

I Elements of neural network: overview


I (0)
Input layer (=layer 0): each “neuron” k is just the k’th feature, ak = x i ,k
I (h)
Hidden layer h = 1, ..., H: Each neuron aj
is a function of prior neurons
I Output layer: our final function is linear in final neurons:

(H)
X (H) (H)
fˆ(x i ) = θ0 + θi ai
i

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 30
Neural Networks, continued
I Elements of neural network: more details
I (0)
Input layer (=layer 0): each “neuron” k is just the k’th feature, ak = x i ,k
I Hidden layer h = 1, ..., H: Each neuron j is a function of prior neurons:
!
(h) (h−1)
X (h−1) (h−1)
aj = g θj,0 + θj,i ai
i

(H) P (H) (H)


I Output layer: fˆ(x i ) = θ0 + ai
i θi
I Comments:
I Each neuron: first compute a linear function of past neurons
I with weights given by the θ’s
I Then transformed with non-linear function g
I Note that g is just a scalar function R → R
I Typical: g (z) = ReLU(z) = z1{z>0} or sigmoid g (z) = 1
1+e −z
I Lots of hyper-parameters and parameters
I Hyper:“Design” (number of layers/neurons), g function
I Parameters: θ’s
I Typically requires a lot of data!
I Can be difficult to choose good design and estimate it well
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 31
Feature Importance

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 32
Feature Importance

I The ML literature typically focuses on prediction


I But model interpretation is also crucial
I I.e., what is the importance of each feature?
I Measures of features importance
I Model-specific methods: tailored to a ML method
I Model-agnostic methods: can be used for any ML method
I Note: substitution effects matter
I For example, book-to-market will be less important if we include 10
similar value characteristics
I So it can make sense to aggregate feature importance for highly
correlated features
I See also book on “Interpretable ML”:
https://christophm.github.io/interpretable-ml-book/

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 33
Feature Importance: Model-Specific Methods

I OLS/Ridge/Lasso regression
I When features are on the same scale (i.e., standardized):
I Importance of feature j is |β j |
I otherwise
I Importance of feature j is σ(x j )|β j |
I Problematic with high multi-collinearity (especially OLS)

I Regression tree, random forest, or boosting


I Simple measures of feature importance:
I Number of times a feature is picked as splitting variable
I Average drop in prediction error when splitting on variable
I More sophisticated measure for random forests:
I Increase in prediction error on “out-of-bag” samples after
permuting feature
I Similar to “permutation feature importance” discussed next

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 34
Model-Agnostic Methods: Permutation Feature
Importance

Permutation feature importance


I Pick loss function, L(y , fˆ(x)), e.g., squared prediction error or R 2
I Estimate actual model error

lorig = L(y , fˆ(x))

I For each feature j ∈ {1, . . . , K }:


1. Generate xperm,j by permuting (shuffling) feature j randomly
2. Estimate error:
lperm,j = L(y , fˆ(xperm,j ))
3. Feature importance:

FIj = lperm,j − lorig

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 35
Model-Agnostic Methods: Surrogate Model

Surrogate model
I Start with prediction, fˆ(x), from a hard-to-interpret model such as a
neural network
I Fit interpretable surrogate model, fˆsurrogate , to prediction

fˆ(x i ) = fˆsurrogate (x i ) + εi
I E.g., fit regression tree or OLS regression:

fˆ(x i ) = x i β + εi
I Use standard metrics such as R 2 to ensure appropriate fit
I Interpret surrogate model

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 36
Model-Agnostic Methods: Shapley Values
Shapley values
I Method from game theory to allocate reward fairly between players
I In ML, the features are the players and the reward could be R 2

I Notation:
I SVj is the Shapley value (importance) of feature j
I C is a coalition of features
I C ∪ {j} is the coalition with j added
I |C | is the number of features in the coalition
I K is number of features, K ~ = {1, ..., K }, and K
~ \ {j} is the set excluding j
I V (C ) is the value function that determines the “worth” of coalition C ,
e.g., the R 2 of a model that uses the coalition features
I The Shapley value for feature j is essentially the avg. marginal contribution of j
when added to a coalition
1 X K − 1−1
SVj = (V (C ∪ {j}) − V (C )),
K |C |
~ \{j}
C ⊆K
I SV calculation is computationally demanding
I Feasible with OLS with a limited set of features
I But even with OLS the computational burden of SV could be too large
I Recent advances in ML makes calculating approximate SV feasible for
complex models with many features (Lundberg and Lee, 2017)
I For tree-based model, efficient methods exist (Lundberg et al., 2020)
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 37
Applications to Asset Pricing

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 38
Predicting Stock Returns with ML
From: Gu et al. (2020)

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 39
Number of Variables in SDF
From: Kozak et al. (2020), Fig. 6

(WFR portfolios) The maximum OOS cross-sectional R2 attained by a model with n factors (on the x -axis)
across all possible values of the prior root expected SR 2 (κ) for models based on original characteristics
portfolios (solid) and PCs (dashed). Dotted lines depict −1 s.e. bounds of the CV estimator. The sample is
daily from September 1964 to December 2017.
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 40
Other Topics in ML

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 41
Other Topics in ML

I Directed acyclical graphs, double machine learning, and more


I See free new book and code: https://causalml-book.org/
I Standard errors for predictions
I Reinforcement learning
I Designed to handle complex dynamic problems, sometimes called
“approximate dynamic programming”
I Super-human performance on several games such as chess and Go
I Standard reference: Barto and Sutton (2018), freely available from
http://www.incompleteideas.net/book/the-book-2nd.html

I Natural language processing (NLP) and textual analysis


I Large language models (LLM)
I How to make ML aware of trading costs
I New paper by Jensen et al. (2022) covered in the last lecture

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 42
References Cited in Slides

Efron, B. and T. Hastie (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and
Data Science, Volume 6. Cambridge University Press.
Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning. The Review of Financial
Studies 33 (5), 2223–2273.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer series in statistics. Springer.
Jensen, T. I., B. T. Kelly, S. Malamud, and L. H. Pedersen (2022). Machine learning about optimal portfolios.
Working paper, Copenhagen Business School .
Kozak, S., S. Nagel, and S. Santosh (2020). Shrinking the cross-section. Journal of Financial Economics 135 (2),
271–292.
Lundberg, S. M., G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and
S.-I. Lee (2020). From local explanations to global understanding with explainable AI for trees. Nature
Machine Intelligence 2 (1), 2522–5839.
Lundberg, S. M. and S.-i. Lee (2017). A Unified Approach to Interpreting Model Predictions. In 31st Conference
on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Number Section 2, pp. 1–10.

Electronic copy available at: https://ssrn.com/abstract=4068797


c Lasse Heje Pedersen 43

You might also like