0% found this document useful (0 votes)

9 views43 pages

Lecture 05

Uploaded by

Muhammad Naeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views43 pages

Lecture 05

Uploaded by

Muhammad Naeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Big Data Asset Pricing

Lecture 5: Machine Learning in Asset Pricing

Lasse Heje Pedersen

AQR, Copenhagen Business School, CEPR

https://www.lhpedersen.com/big-data-asset-pricing

The views expressed are those of the author

Electronic copyand
available at: https://ssrn.com/abstract=4068797
not necessarily those of AQR
Overview of the Course: Big Data Asset Pricing
Lectures
I Quickly getting to the research frontier
1. A primer on asset pricing
2. A primer on empirical asset pricing
3. Working with big asset pricing data (videos)
I Twenty-first-century topics
4. The factor zoo and replication
5. Machine learning in asset pricing
6. Asset pricing with frictions

Exercises
1. Beta-dollar neutral portfolios
2. Construct value factors
3. Factor replication analysis
4. High-dimensional return prediction
5. Research proposal
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 2
Overview of this Lecture

I Overview of machine learning

I Supervised: regression vs. classification
I Unsupervised
I How to apply ML
I Bias-variance trade-off
I Hyper-parameters
I Train, validation, test
I Specific ML models in more detail
I Penalized regressions
I Regression trees
I Neural networks
I Feature importance
I Applications to asset pricing
I Predicting stock returns
I What factors generate the pricing kernel?

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 3
Overview of Machine Learning

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 4
Supervised vs. Unsupervised Learning

I Supervised learning: start with inputs and outputs

I Terminology:

Terminology Notation
Statistics ML Stat/ML Finance
Inputs Independent variables Features xi sti
Outputs Dependent variable Response yi i
rt+1
I Two types of supervised learning: classification vs. regression

Types of outputs # of outputs Example

Classification Qualitative Finite Default/no default
Regression Quantitative Any value Predict return

I Unsupervised: start with just inputs (no outputs)

I Example: clustering of factors into groups

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 5
Classic ML Problem: Supervised Learning via Regression
I Start with data on inputs, x i ∈ RK , and outputs, y i ∈ R1
I Training set of observations, i = 1, ..., T
I Want to learn f such that
y i = f (x i ) + εi

I Think of f as f (x i ) = E (y i |x i ), projection, or just a way to guess

the output
I Ways to learn about f :
Statistics Finance ML
Assumption
f (x i ) = ave(y j )x j near x i Kernel regression Portfolio sort Regression trees,
random forest,
boosting

f (x i ) linear OLS, GLS Fama-MacBeth, Penalized regres-

regression-based sion, e.g., LASSO,
mkt timing, signal- ridge
weighted portfolio

f (x i ) non-linear Non-linear least Neural networks,

squares, GMM, deep learning
MLE
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 6
How to Apply ML

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 7
Leading Example: Ridge Regression
I Want to learn f such that y i = f (x i ) + εi
I Regression: y i = x i β + εi
I Vector form: y = xβ + ε
I OLS: β̂ = (x 0 x)−1 x 0 y
I Properties: β̂ = (x 0 x)−1 x 0 y = β + ( T1 x 0 x)−1 T1 x 0 ε → β for T → ∞
I What does T → ∞ mean in practice? That T is large relative to K
I But K is large when we do ML
I Estimating many regression coefficients creates a lot of noise

I Gauss-Markov theorem: OLS has smallest mean squared error of all

linear estimators with no bias
I But we can do better with bias
I By reducing estimation noise – more on this later

I Ridge regression
I Objective: minimize T1 (y − xβ)0 (y − xβ) + λβ 0 β
I Solution: β̂ ridge = ( T1 x 0 x + λI )−1 T1 x 0 y
I But, how do we choose (or “tune”) λ?
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 8
Parameters vs. Hyper-Parameters (or Tuning Parameters)

I Want to learn f such that

example
y i = f (x i ) + εi = x i β + εi

I Leading example, ridge regression, β̂ ridge = ( T1 x 0 x + λI )−1 T1 x 0 y

I Parameters and hyper-parameters:
example
I Parameters tell us what f is: fˆ(x i ) = f (x i ; β̂) = x i β̂ ridge
I Hyper-parameters tell us how to find the parameters, e.g., λ
I Terminology: hyper-parameters = tuning parameters
I Parameters and hyper-parameters: other examples
Method Parameters Hyper-parameters
Penalized regression Regression coefficients, β Penalty λ for large β
Regression tree Tree splits Tree depth, etc.
Neural network Coefficients Layers, etc.

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 9
Bias-Variance Trade-Off

Source: Hastie et al. (2009). The light blue curves: training error. Light red: conditional test error.

I Example: ridge regression

I Smaller λ = greater model complexity = more degrees of freedom
I More complexity always leads to better fit in the training sample
I More complexity leads to lower bias, but higher variance
I Bestcopy
Electronic available
performance in theat: https://ssrn.com/abstract=4068797
test period for intermediate λ
c Lasse Heje Pedersen 10
Bias-Variance Trade-Off, continued
I Prediction error:
I Prediction: ŷ i = fˆ(x i )
I Outcome, out-of-sample: y i = f (x i ) + εi
I Prediction error: fˆ(x i ) − y i

I Expected “loss,” L (squared prediction error, or absolute prediction error):

E[L] = E[prediction error2 ] =E[(fˆ(x i ) − y i )2 ]
=E[(fˆ(x i ) − f (x i ) − εi )2 ]
=σε2 + E[(fˆ(x i ) − f (x i ))2 ]
=σε2 + (E[fˆ(x i )] − f (x i ))2 + E[(fˆ(x i ) − E[fˆ(x i )])2 ]
=noise + bias2 + variance

I Noise: from ε, unavoidable, the same regardless of prediction method

I Bias: Tendency of the prediction to miss its target even with lots of data
I Variance: Var(fˆ(x i )) comes from noise in parameter estimation
I Bias-variance trade-off:
I Lower complexity, e.g., shrinking β̂ more toward zero
I Increases bias
I Lowers variance

IElectronic
So, an intermediate λ is optimal,
copy available but how to choose it?
at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 11
Training, Validation, and Testing

I Old school finance and statistics: 1 sample (everything in-sample)

I Finance and statistics with model assessment: 2 sub-samples
1. In-sample: estimate parameters
2. Out-of-sample: assess performance
I ML: 3 sub-samples
1. Train: estimate parameters for each hyper-parameter
2. Validation: pick the best-performing hyper-parameter
3. Test: assess performance
I ML: example of ridge regression

1. Train: estimate β̂ ridge (λ) = (x 0 x + λI )−1 x 0 y for each λ

2. Validation: pick λ̂ with small loss, L[y i − x i β̂ ridge (λ̂)]
3. Test: assess performance of β̂ ridge (λ̂)

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 12
Cross-Validation
I k-fold cross-validation
I Data scarce – want to use full train+validation sample for both
I Split train+validation sample into k folds (subsets), but note
I k is not the same as the number of features K
I In finance, ensure that each whole cross-section is in the same fold,
but standard cross-validation functions distribute data randomly
I Example: 5-fold validation
Data
Split 1 Validate Train Train Train Train
Split 2 Train Validate Train Train Train
Split 3 Train Train Validate Train Train
Split 4 Train Train Train Validate Train
Split 5 Train Train Train Train Validate
I Method:
1. Train k times: For each split j and each hyper-parameter
I estimate parameters using training data in all folds, except j
I compute prediction error in j’th fold, i.e., j’th validation period
2. Cross-validate: Choose hyper-parameters
I that minimize prediction errors across all validation periods
3. Retrain: estimate parameters using full train+validation data
4. Test:
Electronic assess
copy performance
available (as always)
at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 13
Another Validation Method that Reuses the Data
I When the data has a time-series dimension
I finance papers usually keep the temporal order such that the
validation set comes after the training set
I e.g., in asset pricing, people often use rolling estimation
I not necessarily wrong to reverse order
I First do the standard ML split into 3 sub-samples
1. Train
2. Validation
3. Test
I Then proceed in these steps
1. Train: estimate β̂ for each λ in the training sample
2. Validation: pick the best-performing λ̂
3. Possibly re-train:
I Estimate β̂ for λ̂ in joint sample: training+validation
I NB: re-training requires that λ̂ “immune” to sample length
E.g., the T matters in minβ T1 (y − xβ)0 (y − xβ) + λ̂β 0 β
4. Test: assess performance
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 14
Expanding Training Set
In finance, people often use a rolling or expanding training set
I Expanding: always start from the beginning of the sample
I Rolling: Use a fixed number of years (discarding the oldest data)

Illustration of expanding training set:

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 15
Specific ML Models:
Penalized Regressions

For details, see excellent and free books Hastie et al. (2009) and
Efron and Hastie (2021)

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 16
Penalized Regressions: Lasso
I Want to learn f such that y i = f (x i ) + εi
I Regression: y i = x i β + εi
I Vector form: y = xβ + ε
I Ridge regression: we already saw this
I Lasso regression:
I Objective:
 
1 X
β̂ lasso = arg min  (y − xβ)0 (y − xβ) + λ |βj |
β T
j

I Alternative objective:
1
β̂ lasso = arg min (y − xβ)0 (y − xβ)
β T
X
subject to ˉ
|βj | ≤ λ
j

I Cf. bestcopy
subset: use OLS for subset of x variables (less used)
Electronic available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 17
Penalized Regressions: Comparison

Source: Hastie et al. (2009). Lasso regression (left) and ridge regression (right). Red lines show contours of the
P
prediction errors, (y − xβ)0 (y − xβ). The solid blue areas show, respectively, subject to ˉ
j |βj | ≤ λ and
P 2 ˉ
subject to j βj ≤ λ

I Differences
I Ridge shrinks all parameters, but generically all parameters are non-zero
I Lasso sets some parameters to zero, here β1 = 0, and shrinks the rest

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 18
Penalized Regressions: Comparison, continued

Source: Hastie et al. (2009). The x-axis has the OLS estimate, β̂, and the y -axis has the respective penalized

regression estimates.

I Differences
I Best subset: once a variable is included, there is no shrinkage in its beta
I Ridge: proportional shrinkage, with one parameter:

1 0 1 β̂ OLS 1
β̂ ridge = ( x x + λI )−1 x 0 y = , where λ̃ = 1 + λ/( x 0 x)
T T λ̃ T
I Lasso: once beta is non-zero, it is shrunk by a constant, e.g. if β > 0

1 1 0 1 1
β̂ lasso = ( x 0 x)−1 x y − λ = β̂ OLS − λ̃ , where λ̃ = 2 λ
T T 2 x 0x
T
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 19
Penalized Regressions: Elastic-Net

I Want to learn f such that y i = f (x i ) + εi

I Regression: y i = x i β + εi
I Vector form: y = xβ + ε

I Elastic-net regression:
!
1 X 2
β̂ elastic-net
= arg min (y − xβ)0 (y − xβ) + λ aβj + (1 − a)|βj |
β T j

I Two tuning parameters:

I λ ≥ 0 controls amount of regularization
I a ∈ [0, 1] controls elements of ridge vs. lasso
I Some betas are set to zero, all are shrunk

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 20
Specific ML Models:
Regressions Trees and Random Forests

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 21
Regression Trees
What is a regression tree?
I Partition the space of features into J regions, R1 , ..., RJ
I Estimate response function as
X
fˆtree (x) = 1{x∈Rj } ave(yi |xi ∈ Rj )
j

“Greedy” algorithm to make a tree of maximum depth D:

I Start with a tree of depth d = 0 (all data in one region)
I For d = 1, ..., D, make tree of depth d from prior tree of depth d − 1 by
splitting each region as follows:
I If a region is smaller than a minimum size, leave as is
I Otherwise, consider splitting the region into two:
1. For each feature k = 1, ..., K , find the split point s(k), that leads to
the smallest prediction error
2. Pick the feature k ∗ such that the split [k ∗ , s(k ∗ )] leads to smallest
prediction error
3. Split the region into two daughter regions based on [k ∗ , s(k ∗ )]
unless one of daughter regions has too few elements or the
improvement in prediction error is too small (alternatively, prune
afterward)
Hyper-paramaters:
IElectronic
Tree depth,copy available
minimum at: https://ssrn.com/abstract=4068797
observations per region, minimum loss improvement, etc.
c Lasse Heje Pedersen 22
Regression Trees: Generic Example

Source: Hastie et al. (2009). Top right panel shows a partition of a two-dimensional feature space by recursive

binary splitting, as used in CART. Top left panel shows a general partition that cannot be obtained from

recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right

panel, and a perspective plot of the prediction surface appears in the bottom right panel.

I Greedy algorithm
I solve problem in stages, making the locally optimal choice at each stage
I does not necessarily produce an optimal solution, but fast
IElectronic
Here: always splitavailable
copy each regionat:
(orhttps://ssrn.com/abstract=4068797
branch) into two (or leave as is)
c Lasse Heje Pedersen 23
Regression Trees: Empirical Example from Asset Pricing
Use regression tree to predict excess returns
I Sample: Non-microcap stocks in the US
I Features: Beta, book-to-market, market equity, profitability, and asset
growth, all standardized to cross-sectional percentiles
I Hyper-parameters: Max depth=3 and no split if leaf has <10% of obs
I Each circle contains that node’s
I average excess return
I percent of observations

0.0072
100%
yes Asset_growth >= 0.86 no

0.0027 0.0079
14% 86%
Profitability < 0.17 Size >= 0.75

0.0045 0.0058 0.0086

11% 22% 63%
Beta >= 0.81 Profitability < 0.57 Book_to_market < 0.79

−0.0037 597e−6 0.0066 0.0049 0.0067 0.008 0.01

3% 4% 7% 11% 11% 47% 16%

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 24
Regression Trees: Advantages and Disadvantages

I Advantages
I Relatively simple and easy to interpret
I Insensitive to monotone transformations of predictors
I Can capture interactions of predictors (up to depth minus 1)
I Can handle missing feature values
I Disadvantages
I Not smooth
I Possible instability and sensitivity to a few observations
I I.e., high variance

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 25
Random Forests and Bagging

I Bagging=Bootstrap aggregating
I ML method to improve the stability and accuracy
I Model averaging of models fitted to bootstrap samples
I Random forest = bagging applied to regression trees
I How to make a random forest:
I Draw b = 1, ..., B different bootstrap samples of the data
I Fit a separate regression tree to each bootstrap b
I To make these trees less correlated, each split is based on a
random subset of features
I Average the forecasts coming from each tree
1 Xˆ
fˆrandom forest (x) = ftree,b (x)
B
b

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 26
Gradient Tree Boosting
I Gradient boosting with quadratic loss, L(y i − ŷ i ) = 1
(y i − ŷ i )2
2
I How to reduce the loss with ŷ i = fˆ(x i )? Answer:

∂L
− = y i − ŷ i =: res i
∂ ŷ i

I So raise ŷ i when res i > 0 and lower ŷ i when res i < 0

I I.e., always try to reduce the absolute value of res i (obviously)
I The loss can be improved more when the residual is bigger
→so work on fixing the biggest residuals

I Gradient tree boosting, loosely explained:

I Fit a tree, fˆtree,1 (x i )
I Look at the residuals (prediction errors), y i − fˆtree,1 (x i )
I Fit a tree to these residuals
I New fˆ = original tree + tree for the residuals
I New fˆ has smaller residuals
I Repeat
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 27
Gradient Tree Boosting, continued

I How to do gradient tree boosting in more detail:

I Start with initial tree (“tree number 0”), e.g., fˆtree,0 ≡ 0
I so initial residuals are res0i = y i − 0 = y i
I For each b = 1, ..., B
I Fit a regression tree, fˆtree,b , to (x, resb−1 )
I Shrink tree toward zero, ν fˆtree,b , using shrinkage factor ν ∈ [0, 1]
I Update the residuals to, resbi = resb−1 i − ν fˆtree,b (x i )
I Final prediction is the sum of the resulting B shrunken trees
X
fˆboosted tree (x) = ftree,0 (x) + ν fˆtree,b (x)
b>0
I Tuning: Boosting steps B, shrinkage factor ν, and tree depth d
I Note: a random forest is an average of trees, boosting is a sum
I In random forest, we can let B → ∞ to reduce variance
I With boosting, large B leads to over-fitting, so must to chosen well
I Popular implementation (more sophisticated): XGBoost
I Open-source software library for R, Python, Julia, etc.
I Algorithm for some winning teams of machine learning competitions
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 28
Specific ML Models:
Neural Networks

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 29
Neural Networks
I Want to learn f such that y i = f (x i ) + εi
I Use functions of functions of functions: fˆ = k(g (...), ..., g (...))

I Elements of neural network: overview

I (0)
Input layer (=layer 0): each “neuron” k is just the k’th feature, ak = x i ,k
I (h)
Hidden layer h = 1, ..., H: Each neuron aj
is a function of prior neurons
I Output layer: our final function is linear in final neurons:

(H)
X (H) (H)
fˆ(x i ) = θ0 + θi ai
i

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 30
Neural Networks, continued
I Elements of neural network: more details
I (0)
Input layer (=layer 0): each “neuron” k is just the k’th feature, ak = x i ,k
I Hidden layer h = 1, ..., H: Each neuron j is a function of prior neurons:
!
(h) (h−1)
X (h−1) (h−1)
aj = g θj,0 + θj,i ai
i

(H) P (H) (H)

I Output layer: fˆ(x i ) = θ0 + ai
i θi
I Comments:
I Each neuron: first compute a linear function of past neurons
I with weights given by the θ’s
I Then transformed with non-linear function g
I Note that g is just a scalar function R → R
I Typical: g (z) = ReLU(z) = z1{z>0} or sigmoid g (z) = 1
1+e −z
I Lots of hyper-parameters and parameters
I Hyper:“Design” (number of layers/neurons), g function
I Parameters: θ’s
I Typically requires a lot of data!
I Can be difficult to choose good design and estimate it well
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 31
Feature Importance

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 32
Feature Importance

I The ML literature typically focuses on prediction

I But model interpretation is also crucial
I I.e., what is the importance of each feature?
I Measures of features importance
I Model-specific methods: tailored to a ML method
I Model-agnostic methods: can be used for any ML method
I Note: substitution effects matter
I For example, book-to-market will be less important if we include 10
similar value characteristics
I So it can make sense to aggregate feature importance for highly
correlated features
I See also book on “Interpretable ML”:
https://christophm.github.io/interpretable-ml-book/

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 33
Feature Importance: Model-Specific Methods

I OLS/Ridge/Lasso regression
I When features are on the same scale (i.e., standardized):
I Importance of feature j is |β j |
I otherwise
I Importance of feature j is σ(x j )|β j |
I Problematic with high multi-collinearity (especially OLS)

I Regression tree, random forest, or boosting

I Simple measures of feature importance:
I Number of times a feature is picked as splitting variable
I Average drop in prediction error when splitting on variable
I More sophisticated measure for random forests:
I Increase in prediction error on “out-of-bag” samples after
permuting feature
I Similar to “permutation feature importance” discussed next

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 34
Model-Agnostic Methods: Permutation Feature
Importance

Permutation feature importance

I Pick loss function, L(y , fˆ(x)), e.g., squared prediction error or R 2
I Estimate actual model error

lorig = L(y , fˆ(x))

I For each feature j ∈ {1, . . . , K }:

1. Generate xperm,j by permuting (shuffling) feature j randomly
2. Estimate error:
lperm,j = L(y , fˆ(xperm,j ))
3. Feature importance:

FIj = lperm,j − lorig

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 35
Model-Agnostic Methods: Surrogate Model

Surrogate model
I Start with prediction, fˆ(x), from a hard-to-interpret model such as a
neural network
I Fit interpretable surrogate model, fˆsurrogate , to prediction

fˆ(x i ) = fˆsurrogate (x i ) + εi
I E.g., fit regression tree or OLS regression:

fˆ(x i ) = x i β + εi
I Use standard metrics such as R 2 to ensure appropriate fit
I Interpret surrogate model

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 36
Model-Agnostic Methods: Shapley Values
Shapley values
I Method from game theory to allocate reward fairly between players
I In ML, the features are the players and the reward could be R 2

I Notation:
I SVj is the Shapley value (importance) of feature j
I C is a coalition of features
I C ∪ {j} is the coalition with j added
I |C | is the number of features in the coalition
I K is number of features, K ~ = {1, ..., K }, and K
~ \ {j} is the set excluding j
I V (C ) is the value function that determines the “worth” of coalition C ,
e.g., the R 2 of a model that uses the coalition features
I The Shapley value for feature j is essentially the avg. marginal contribution of j
when added to a coalition
1 X K − 1−1
SVj = (V (C ∪ {j}) − V (C )),
K |C |
~ \{j}
C ⊆K
I SV calculation is computationally demanding
I Feasible with OLS with a limited set of features
I But even with OLS the computational burden of SV could be too large
I Recent advances in ML makes calculating approximate SV feasible for
complex models with many features (Lundberg and Lee, 2017)
I For tree-based model, efficient methods exist (Lundberg et al., 2020)
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 37
Applications to Asset Pricing

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 38
Predicting Stock Returns with ML
From: Gu et al. (2020)

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 39
Number of Variables in SDF
From: Kozak et al. (2020), Fig. 6

(WFR portfolios) The maximum OOS cross-sectional R2 attained by a model with n factors (on the x -axis)
across all possible values of the prior root expected SR 2 (κ) for models based on original characteristics
portfolios (solid) and PCs (dashed). Dotted lines depict −1 s.e. bounds of the CV estimator. The sample is
daily from September 1964 to December 2017.
Electronic copy available at: https://ssrn.com/abstract=4068797
c Lasse Heje Pedersen 40
Other Topics in ML

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 41
Other Topics in ML

I Directed acyclical graphs, double machine learning, and more

I See free new book and code: https://causalml-book.org/
I Standard errors for predictions
I Reinforcement learning
I Designed to handle complex dynamic problems, sometimes called
“approximate dynamic programming”
I Super-human performance on several games such as chess and Go
I Standard reference: Barto and Sutton (2018), freely available from
http://www.incompleteideas.net/book/the-book-2nd.html

I Natural language processing (NLP) and textual analysis

I Large language models (LLM)
I How to make ML aware of trading costs
I New paper by Jensen et al. (2022) covered in the last lecture

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 42
References Cited in Slides

Efron, B. and T. Hastie (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and
Data Science, Volume 6. Cambridge University Press.
Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning. The Review of Financial
Studies 33 (5), 2223–2273.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer series in statistics. Springer.
Jensen, T. I., B. T. Kelly, S. Malamud, and L. H. Pedersen (2022). Machine learning about optimal portfolios.
Working paper, Copenhagen Business School .
Kozak, S., S. Nagel, and S. Santosh (2020). Shrinking the cross-section. Journal of Financial Economics 135 (2),
271–292.
Lundberg, S. M., G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and
S.-I. Lee (2020). From local explanations to global understanding with explainable AI for trees. Nature
Machine Intelligence 2 (1), 2522–5839.
Lundberg, S. M. and S.-i. Lee (2017). A Unified Approach to Interpreting Model Predictions. In 31st Conference
on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Number Section 2, pp. 1–10.

Electronic copy available at: https://ssrn.com/abstract=4068797

c Lasse Heje Pedersen 43

Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
EE5104 Adaptive Control Systems Part I
0% (1)
EE5104 Adaptive Control Systems Part I
57 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
11 pages
Ten Financial Applications of Machine Learning: Marcos López de Prado
No ratings yet
Ten Financial Applications of Machine Learning: Marcos López de Prado
20 pages
M1 - Evaluating Predictive Performance
No ratings yet
M1 - Evaluating Predictive Performance
58 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Machine Learning For Quants
No ratings yet
Machine Learning For Quants
13 pages
Regression
No ratings yet
Regression
45 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
SSRN Id3365271
No ratings yet
SSRN Id3365271
16 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
No ratings yet
ML:Introduction What Is Machine Learning?: Continuous and Discrete Data
6 pages
ML 01
No ratings yet
ML 01
24 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
machine learning
No ratings yet
machine learning
37 pages
Machine Learning Concepts
No ratings yet
Machine Learning Concepts
68 pages
ML Primer PDF
No ratings yet
ML Primer PDF
122 pages
Assignment_DADS303_MBA 3_Set 1 and 2
No ratings yet
Assignment_DADS303_MBA 3_Set 1 and 2
9 pages
Book Machine Learning Finance Python
100% (1)
Book Machine Learning Finance Python
75 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
class 3 - classification
No ratings yet
class 3 - classification
80 pages
unit 4 regression
No ratings yet
unit 4 regression
26 pages
SSRN 3257497
No ratings yet
SSRN 3257497
27 pages
DS-05 Introduction To Machine Learning
No ratings yet
DS-05 Introduction To Machine Learning
103 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
No ratings yet
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
44 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
CS601_Machine Learning_Unit 1_Notes_1672759748
No ratings yet
CS601_Machine Learning_Unit 1_Notes_1672759748
13 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Machine Learning and Pattern Recognition Week 2
No ratings yet
Machine Learning and Pattern Recognition Week 2
7 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
The Virtue of Complexity in Return Prediction
No ratings yet
The Virtue of Complexity in Return Prediction
122 pages
07 - Evaluating Performance
No ratings yet
07 - Evaluating Performance
46 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
Lecture 3
No ratings yet
Lecture 3
51 pages
Biasvariancetradeoff 210313075413
No ratings yet
Biasvariancetradeoff 210313075413
13 pages
Machine Learning Lecture Notes Undergrad (1)
No ratings yet
Machine Learning Lecture Notes Undergrad (1)
19 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
kelly-malamud-zhou-2023
No ratings yet
kelly-malamud-zhou-2023
141 pages
lec1
No ratings yet
lec1
54 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Approach Towards Model Evaluation, Model Selection
No ratings yet
Approach Towards Model Evaluation, Model Selection
13 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Chapter2-Introduction
No ratings yet
Chapter2-Introduction
2 pages
SSRN Id3197726
No ratings yet
SSRN Id3197726
27 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Intro ML For Quants
No ratings yet
Intro ML For Quants
51 pages
Module 5
No ratings yet
Module 5
48 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
Texto para Discussão: Departamento de Economia
No ratings yet
Texto para Discussão: Departamento de Economia
43 pages
10: Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
10: Advice For Applying Machine Learning: Deciding What To Try Next
8 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Assignment 1 (3)
No ratings yet
Assignment 1 (3)
2 pages
A Finite Element Approach For Cable Problems
No ratings yet
A Finite Element Approach For Cable Problems
11 pages
Ds 9queues and RecursionQuickSort
No ratings yet
Ds 9queues and RecursionQuickSort
13 pages
Coursera E5B8CS37XHD8
No ratings yet
Coursera E5B8CS37XHD8
1 page
Chapter 8 _ Markov Chains
No ratings yet
Chapter 8 _ Markov Chains
28 pages
Greedy, Dynamic Programming, and Divide and Conquer Algorithm
No ratings yet
Greedy, Dynamic Programming, and Divide and Conquer Algorithm
15 pages
Transportation Method of Linear Programming
No ratings yet
Transportation Method of Linear Programming
15 pages
334 1 11 Lec29
No ratings yet
334 1 11 Lec29
4 pages
CAP3770 Lab#4 DecsionTree Sp2017
No ratings yet
CAP3770 Lab#4 DecsionTree Sp2017
4 pages
L-3 Polynomial Operations
No ratings yet
L-3 Polynomial Operations
44 pages
MS XI CS Set 1
No ratings yet
MS XI CS Set 1
4 pages
Encryptionst
No ratings yet
Encryptionst
41 pages
CH02 CompSec4e
No ratings yet
CH02 CompSec4e
35 pages
Subject Name: Design and Analysis of Algorithms Subject Code: 10CS43 Prepared By: Sindhuja K Department: CSE Date
No ratings yet
Subject Name: Design and Analysis of Algorithms Subject Code: 10CS43 Prepared By: Sindhuja K Department: CSE Date
59 pages
Implementation of Radial Basis Function Neural Network For Estimation of Strain of Blade
No ratings yet
Implementation of Radial Basis Function Neural Network For Estimation of Strain of Blade
5 pages
Notes On RecursiveFunctions
No ratings yet
Notes On RecursiveFunctions
4 pages
7 DAX Functions For Power BI
No ratings yet
7 DAX Functions For Power BI
9 pages
Arithmetic Operations On Binary Numbers: Two's Complement Addition
No ratings yet
Arithmetic Operations On Binary Numbers: Two's Complement Addition
11 pages
4-Query Processing Nhom1
No ratings yet
4-Query Processing Nhom1
73 pages
Chapter One - Introduction and Elementary Data Structures
No ratings yet
Chapter One - Introduction and Elementary Data Structures
84 pages
DSP Microsoft Office Word Document
No ratings yet
DSP Microsoft Office Word Document
8 pages
Mrefrobotarm
No ratings yet
Mrefrobotarm
7 pages
Lecture11 K20GR
No ratings yet
Lecture11 K20GR
24 pages
Advanced Algorithms and Complexity: The Complexity Class P: August 3, 2018
No ratings yet
Advanced Algorithms and Complexity: The Complexity Class P: August 3, 2018
4 pages
Theory of Computation Two Marks Question With Answer Unit I
No ratings yet
Theory of Computation Two Marks Question With Answer Unit I
3 pages
Robotics Lec3
No ratings yet
Robotics Lec3
58 pages
One-Factor Short-Rate Models: 4.1. Vasicek Model
No ratings yet
One-Factor Short-Rate Models: 4.1. Vasicek Model
8 pages
ECE Mathematical Methods OCT 2022 ECE Board Exam
No ratings yet
ECE Mathematical Methods OCT 2022 ECE Board Exam
4 pages
Enhancement of Cotton Leaves Images Using Various Filtering Techniques
No ratings yet
Enhancement of Cotton Leaves Images Using Various Filtering Techniques
3 pages

Lecture 05

Uploaded by

Lecture 05

Uploaded by

Big Data Asset Pricing

Lecture 5: Machine Learning in Asset Pricing

Lasse Heje Pedersen

The views expressed are those of the author

I Overview of machine learning

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

I Supervised learning: start with inputs and outputs

Types of outputs # of outputs Example

I Unsupervised: start with just inputs (no outputs)

Electronic copy available at: https://ssrn.com/abstract=4068797

I Think of f as f (x i ) = E (y i |x i ), projection, or just a way to guess

f (x i ) linear OLS, GLS Fama-MacBeth, Penalized regres-

f (x i ) non-linear Non-linear least Neural networks,

Electronic copy available at: https://ssrn.com/abstract=4068797

I Gauss-Markov theorem: OLS has smallest mean squared error of all

I Want to learn f such that

I Leading example, ridge regression, β̂ ridge = ( T1 x 0 x + λI )−1 T1 x 0 y

Electronic copy available at: https://ssrn.com/abstract=4068797

I Example: ridge regression

I Expected “loss,” L (squared prediction error, or absolute prediction error):

I Noise: from ε, unavoidable, the same regardless of prediction method

I Old school finance and statistics: 1 sample (everything in-sample)

1. Train: estimate β̂ ridge (λ) = (x 0 x + λI )−1 x 0 y for each λ

Electronic copy available at: https://ssrn.com/abstract=4068797

Illustration of expanding training set:

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

I Want to learn f such that y i = f (x i ) + εi

I Two tuning parameters:

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

“Greedy” algorithm to make a tree of maximum depth D:

0.0045 0.0058 0.0086

−0.0037 597e−6 0.0066 0.0049 0.0067 0.008 0.01

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

I So raise ŷ i when res i > 0 and lower ŷ i when res i < 0

I Gradient tree boosting, loosely explained:

I How to do gradient tree boosting in more detail:

Electronic copy available at: https://ssrn.com/abstract=4068797

I Elements of neural network: overview

Electronic copy available at: https://ssrn.com/abstract=4068797

(H) P (H) (H)

Electronic copy available at: https://ssrn.com/abstract=4068797

I The ML literature typically focuses on prediction

Electronic copy available at: https://ssrn.com/abstract=4068797

I Regression tree, random forest, or boosting

Electronic copy available at: https://ssrn.com/abstract=4068797

Permutation feature importance

lorig = L(y , fˆ(x))

I For each feature j ∈ {1, . . . , K }:

FIj = lperm,j − lorig

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

I Directed acyclical graphs, double machine learning, and more

I Natural language processing (NLP) and textual analysis

Electronic copy available at: https://ssrn.com/abstract=4068797

Electronic copy available at: https://ssrn.com/abstract=4068797

You might also like