0% found this document useful (0 votes)

8 views

Lecture5 FGV

Uploaded by

Kaio Daniel

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture5 FGV

Uploaded by

Kaio Daniel

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Machine Learning for Econometrics

Handout of Lecture 5

Bernard Salanié

22 July 2024

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Boosting

Another general idea, especially useful for gradient-boosted

decision trees (GBT)
we work with weak learners: e.g. shallow trees, highly
penalized Lasso, small neural networks

we recursively improve their performance on the observations

they mispredict,
by boosting the weight of these observations.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Boosting Algorithm (general)
Suppose that yi takes values
Pnin Y .
we want to minimize a loss i=1 L (yi , ŷi ).
We start with the value ŷi0 ≡ θ ∈ Y that minimizes ni=1 L (yi , θ).
P

(e.g. if y is continuous and we minimize the MSE it is Ên y).

At iteration t, given predictors ŷit−1 and a class of learners F,
we choose ft ∈ F and a step st to solve
n
X
min L (yi , PY (ŷit−1 , sf (xi )))
s∈R,f ∈F
i=1

where PY makes us stay within Y (if needed);

and we update the predictions:

ŷit = PY (ŷit−1 , st ft (xi )).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

In practice
We need to choose PY ; e.g. if Y = {0, 1} we can take
PY (a, b) = 1 (a + b > 0).
The mins∈R,f ∈F is often penalized (rewarding parsimony)
Since the minimization may be a difficult problem, there are
shortcuts:
Gradient boosting: if say y is continuous
1 we compute the variable
∂L
ỹit = − (yi , ŷit )
∂ŷ
which indicates how badly off we are with observation i
2 we choose the ft ∈ F that fits it best (a simple problem)
3 then the s that minimizes the loss from using ŷit−1 + sft (xi )
(additive training)
4 perhaps only on a small random sample (stochastic
gradient descent).
Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5
Gradient-boosted trees

Let v(x, f ) be the average value of (continuous) y in the leaf

that contains x in a shallow tree with splits f .

1 choose the class of trees, often the depth D

2 start with ŷi0 = Ey for each i
3 for t = 1, . . . , T :
we choose the best-fitting ft tree of depth D to minus the
gradient
and step st to minimize
n
X
L yi , ŷit−1 + sv(xi , ft ) + . . .
i=1

we update ŷit = ŷit−1 + st v(xi , ft ).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Tuning

→ we add a complexity penalty c(f ) in the loss function:

e.g +λd(f )2 if d(f ) is the depth of tree f
That is still a lot of trees to work through. . .
XGBoost uses quadratic approximations to accelerate the
process.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

AdaBoost
Popular for classification problems, say y = ±1.
we start with observation weights wi0 = 1/n.
At step t = 1, . . . , T , we choose the learner ft ∈ F that
minimizes the total weight of misclassified observations
n
X
Mt = wit−1 11(yi , f (xi )).
i=1

We multiply
p the weight of misclassified observations by a factor
exp(st ) = (1 − Mt )/Mt ;
We divide the weight of well-classified observations by the
same factor;
and we use a scale factor Wt to make sure that i wit = 1.
P

Our updated predictor is Ft (x) ≡ 211 tτ=1 sτ fτ (x) > 0 − 1.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Properties of AdaBoost

Let γt = 1/2 − Mt (one half the excess of well-classified over

misclassified).
It is “easy” to prove that the average probability of
misclassification decreases at least as fast as
t
X
exp(−2 γ2τ ).
τ=1

So if our weak learners are still strong enough that γt > γ > 0,
we converge to perfect classification exponentially fast
but of course we want to stop before that to avoid overfitting,
hence we cross-validate.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Sketch of proof

1 P
We want to show that n i 11(yi , Ft (xi )) goes to zero
exponentially in t.
1 first, note that 1 (yi , Ft (xi )) ≤ exp(−yi Ft (xi ))
show that wit = tτ=1 Wτ /n
Q
2

3 deduce that the

Qt proportion of misclassified observations
behaves like τ=1 Wτ
p q
4 prove that Wτ = Mτ (1 − Mτ ) = 1 − 4γ2τ and conclude.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Switching to Neurons

A real neuron.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Simplify, then Exaggerate

Simple 1950s model of a neuron (Rosenblatt).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

What the Neuron Does

1 it receives inputs x1 , . . . , xp (think of p covariates)

2 it combines them linearly into w 0 x + w0
3 and it transmits a signal σ(w 0 x) (think of y = 0, 1)

≡ a Generalized Linear Model (or a restricted single-index

model)
restricted because the activation function σ has a specific form:
originally σ(t) = 11(t > 0).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Modern Activation Functions

Smoothing 0-1 to get a probability gave the sigmoid

1
σ(t) = .
1 + exp(−t)

(known otherwise as the cdf of the logistic)

With more categories we use the multinomial logit (softmax)

exp(tj )
σj (t) = P .
k exp(tk )

Current MVP for continuous y: the ReLU (Rectified Linear Unit)

function
σ(t) = max(t, 0).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

A Single-Layer Perceptron

Input layer + one hidden layer + output layer (K outputs).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Why is a neuronal layer a good idea?

Start with a bad idea:

if we take σ to be linear, we get something boring:
a linear model for regression
a multinomial logit for classification

more interesting:
with a nonlinear activation function, we get a sort of
series/sieves flexible method
with a difference: only one basis function, many linear indices.
It will get more expressive when we add hidden layers.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Combining Neurons

We have M neurons in a layer

→ each neuron m has weights w1m , transmits σ(w1m
0 x +w )
10

Signals combine, using weights w2l , into a prediction

  
X X
ŷ = g w2 σ(w1 x + w10 ) + w20 = g 
0 0
w2l σ 

 w1lm xm + w0lm  + w20
l m

where g is the activation function of the output “layer”.

Regression: we take g(t) ≡ t

Classification with K classes: we use the softmax function (cf
multinomial logit).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Fitting an SLP
Given a loss function L̄ (w) = L (y, ŷ(w)),
we do an epoch: one forward pass, one backward pass; and
we repeat.
Forward pass: for given w, we compute ŷ(w)
Backward pass: as in boosting, we compute the components of
the loss (the errors),
we take their gradients wrt w, and we do approximate
Newton-Raphson iterates on w.
Basically: w(s+1) = w(s) − εs L 0 (w(s) ).
Problem: the w’s consist of weights w between various layers
how can we update them correctly and efficiently?

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Backpropagation, or the Chain Rule
SupposePa regression problem with loss
L̄ (w) = ni=1 (yi − ŷi (w))2 .
We have
n
∂L̄ X
(s)
(w(s) ) = −2 r̂i (w)zil
∂w2l
i=1
(s) (s)
where r̂i = yi − ŷi is the residual;
we do one Newton iteration to update w2 and we move on:

n
∂L̄ X
(s)
X
(s)
(w(s) ) = −2 r̂i (w)xim 11((wl )0 xi > 0)
∂w1lm
i=1 l

(since with ReLU, σ0 (t) = 11(t > 0));

and we update w1 .
Done (with automatic differentiation, in practice).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Deep Learning
We add more hidden layers (≥ 2 is “deep”...)

Typically in metrics, they are fully connected:

every input node to every node in hidden layer 1
every node in hidden layer k to every node in hidden layer
(k + 1)
every node in the last hidden layer to every node in the
output layer.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

A Multilayer Perceptron

(a very small one! for regression)

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Choosing Parameters

With D hidden layers of M neurons each, we have

pM + (D − 1)M 2 + KM parameters.

How should we choose D and M?

and the activation function?
and other hyperparameters? (see later: dropout)

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Deep Helps

Theory and experience show that d eeper-cum-regularization is

better than wider:
“too small” D does not fit well
“too large” D is OK if many weights are small
(roughly speaking)
if
the unknown E(y|x) is very smooth
the width M goes to infinity faster than n1/4
the depth D goes to infinity like log n
√
then the RMSE goes to zero slightly more slowly than 1/ n
good enough to use as a first-stage estimator.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Why do we need D → ∞?

Approximation theory says that when σ is continuous but not a

polynomial,
linear combinations of functions x → σ(a + bx) can
approximate any continuous function arbitrarily closely (on
compact sets.)
So why not just one layer?
(loose) answer: if the network is not deep, it takes a very large
number of units to get a good approximation.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Avoiding Overfitting

As we add more epochs the loss on the training sample can

only go down
We risk overfitting, as always.
→ we keep a validation/test sample in reserve and after each
epoch, we compute the loss on the test sample
we stop when the loss on the validation sample stops
decreasing.
q
we often also add a penalty term λkwkq to L (w)
the norm q can be quadratic (cf ridge) or q = 1 (cf Lasso)
λ is the weight decay.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

What we Want to See

Figure: Stop at the minimum of the validation loss

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

Lecture1 FGV
No ratings yet
Lecture1 FGV
21 pages
Lecture2 FGV
No ratings yet
Lecture2 FGV
21 pages
Machine Learning: An Applied Econometric Approach
100% (1)
Machine Learning: An Applied Econometric Approach
31 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
38 pages
Jep 31 2 87
No ratings yet
Jep 31 2 87
62 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
48 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Cs229 Ml Notes
No ratings yet
Cs229 Ml Notes
192 pages
Machine Learning and Econometrics EF
No ratings yet
Machine Learning and Econometrics EF
270 pages
2.b Applied Machine Learning Secret Sauce - Slides
No ratings yet
2.b Applied Machine Learning Secret Sauce - Slides
41 pages
09_EnsembleLearning
No ratings yet
09_EnsembleLearning
36 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
ML 01
No ratings yet
ML 01
24 pages
Stanford ML
No ratings yet
Stanford ML
168 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
ml_cheat (1)
No ratings yet
ml_cheat (1)
9 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
NN Theory
No ratings yet
NN Theory
138 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Brief Summary ML
No ratings yet
Brief Summary ML
25 pages
Texto para Discussão: Departamento de Economia
No ratings yet
Texto para Discussão: Departamento de Economia
43 pages
LectureNotes7
No ratings yet
LectureNotes7
8 pages
CH 1
No ratings yet
CH 1
24 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
21csc305p Machine Learning Unit 5
No ratings yet
21csc305p Machine Learning Unit 5
61 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Machine Learning in MATLAB: Roland Memisevic
No ratings yet
Machine Learning in MATLAB: Roland Memisevic
16 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Lecture6 Notes
No ratings yet
Lecture6 Notes
5 pages
Lect 1
No ratings yet
Lect 1
24 pages
Chapter 9. Classification: Advanced Methods
No ratings yet
Chapter 9. Classification: Advanced Methods
39 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Main Notes
No ratings yet
Main Notes
227 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Andrew NG Week 1-2
No ratings yet
Andrew NG Week 1-2
120 pages
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
No ratings yet
Machine Learning Advances For Time Series Forecasting: Ricardo P. Masini
44 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
Machine Learning With R
No ratings yet
Machine Learning With R
2 pages
MLR PDF
No ratings yet
MLR PDF
2 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
UNIT 1,2,3
No ratings yet
UNIT 1,2,3
17 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
MS28295 Manuscript
No ratings yet
MS28295 Manuscript
45 pages
The Views of The National Bureau of Economic Research
No ratings yet
The Views of The National Bureau of Economic Research
34 pages
Penaloza KGLF2012
No ratings yet
Penaloza KGLF2012
77 pages
Heyne 1472 EBk v6.0
No ratings yet
Heyne 1472 EBk v6.0
331 pages
Heyne EconomicThinking1995
No ratings yet
Heyne EconomicThinking1995
9 pages
Do Lawyers Inhibit Economic Growth - New Evidence From The 50 U.S
No ratings yet
Do Lawyers Inhibit Economic Growth - New Evidence From The 50 U.S
18 pages
Lecture4 FGV
No ratings yet
Lecture4 FGV
14 pages
Lecture Notes 01-Introduction and Error Analysis
No ratings yet
Lecture Notes 01-Introduction and Error Analysis
96 pages
Robot 1
No ratings yet
Robot 1
5 pages
Assignment Problem
No ratings yet
Assignment Problem
82 pages
Assignment No. 5 Linear Programming
No ratings yet
Assignment No. 5 Linear Programming
4 pages
Solutions To Selected Problems in Numerical Optimization 2nbsped - Compress
No ratings yet
Solutions To Selected Problems in Numerical Optimization 2nbsped - Compress
75 pages
Digital Signal Processing (8SRN4/7IT01) : Unit Wise Course Contents
No ratings yet
Digital Signal Processing (8SRN4/7IT01) : Unit Wise Course Contents
3 pages
Mealy and Moore Machine and Their Conversions
No ratings yet
Mealy and Moore Machine and Their Conversions
3 pages
Quiz Game Micro Project
No ratings yet
Quiz Game Micro Project
4 pages
Time-Optimal Control With Direct Collocation and Variable Discretization
No ratings yet
Time-Optimal Control With Direct Collocation and Variable Discretization
6 pages
Chapter 12 - Numerical Methods
No ratings yet
Chapter 12 - Numerical Methods
5 pages
Linear Programming
No ratings yet
Linear Programming
63 pages
Architecture: Simple Neural Nets For Pattern Classification
No ratings yet
Architecture: Simple Neural Nets For Pattern Classification
15 pages
Quora Answer Classifier (Redux)
No ratings yet
Quora Answer Classifier (Redux)
2 pages
DCGAN Presentation
No ratings yet
DCGAN Presentation
16 pages
Introduction PDF
No ratings yet
Introduction PDF
25 pages
Deep Learning - Question Bank
No ratings yet
Deep Learning - Question Bank
6 pages
Numerical Solution of Damped Forced Oscillator Problem Using Haar Wavelets
No ratings yet
Numerical Solution of Damped Forced Oscillator Problem Using Haar Wavelets
12 pages
Ujjawal Khare SDE
No ratings yet
Ujjawal Khare SDE
2 pages
Unit 3
No ratings yet
Unit 3
24 pages
Andika Word Apk. Kom
No ratings yet
Andika Word Apk. Kom
4 pages
Solutions Interpolation (And Curve Fitting)
No ratings yet
Solutions Interpolation (And Curve Fitting)
19 pages
Anfis Structure
No ratings yet
Anfis Structure
5 pages
Lec 3 Spline Interpolation
No ratings yet
Lec 3 Spline Interpolation
25 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
21 pages
Lecture Notes
No ratings yet
Lecture Notes
337 pages
Introduction To Machine Learning Top-Down Approach - Towards Data Science
No ratings yet
Introduction To Machine Learning Top-Down Approach - Towards Data Science
6 pages
1157_CS_F425_20231222015056_Mid_Semester_Question_Paper_DL
No ratings yet
1157_CS_F425_20231222015056_Mid_Semester_Question_Paper_DL
2 pages
Unit 1: Introduction To Soft Computing
No ratings yet
Unit 1: Introduction To Soft Computing
9 pages
Integer
No ratings yet
Integer
54 pages
Presentation
No ratings yet
Presentation
15 pages