0% found this document useful (0 votes)
8 views

Lecture5 FGV

Uploaded by

Kaio Daniel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture5 FGV

Uploaded by

Kaio Daniel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Machine Learning for Econometrics

Handout of Lecture 5

Bernard Salanié

22 July 2024

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Boosting

Another general idea, especially useful for gradient-boosted


decision trees (GBT)
we work with weak learners: e.g. shallow trees, highly
penalized Lasso, small neural networks

we recursively improve their performance on the observations


they mispredict,
by boosting the weight of these observations.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Boosting Algorithm (general)
Suppose that yi takes values
Pnin Y .
we want to minimize a loss i=1 L (yi , ŷi ).
We start with the value ŷi0 ≡ θ ∈ Y that minimizes ni=1 L (yi , θ).
P

(e.g. if y is continuous and we minimize the MSE it is Ên y).


At iteration t, given predictors ŷit−1 and a class of learners F,
we choose ft ∈ F and a step st to solve
n
X
min L (yi , PY (ŷit−1 , sf (xi )))
s∈R,f ∈F
i=1

where PY makes us stay within Y (if needed);


and we update the predictions:

ŷit = PY (ŷit−1 , st ft (xi )).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


In practice
We need to choose PY ; e.g. if Y = {0, 1} we can take
PY (a, b) = 1 (a + b > 0).
The mins∈R,f ∈F is often penalized (rewarding parsimony)
Since the minimization may be a difficult problem, there are
shortcuts:
Gradient boosting: if say y is continuous
1 we compute the variable
∂L
ỹit = − (yi , ŷit )
∂ŷ
which indicates how badly off we are with observation i
2 we choose the ft ∈ F that fits it best (a simple problem)
3 then the s that minimizes the loss from using ŷit−1 + sft (xi )
(additive training)
4 perhaps only on a small random sample (stochastic
gradient descent).
Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5
Gradient-boosted trees

Let v(x, f ) be the average value of (continuous) y in the leaf


that contains x in a shallow tree with splits f .

1 choose the class of trees, often the depth D


2 start with ŷi0 = Ey for each i
3 for t = 1, . . . , T :
we choose the best-fitting ft tree of depth D to minus the
gradient
and step st to minimize
n
X  
L yi , ŷit−1 + sv(xi , ft ) + . . .
i=1

we update ŷit = ŷit−1 + st v(xi , ft ).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Tuning

→ we add a complexity penalty c(f ) in the loss function:


e.g +λd(f )2 if d(f ) is the depth of tree f
That is still a lot of trees to work through. . .
XGBoost uses quadratic approximations to accelerate the
process.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


AdaBoost
Popular for classification problems, say y = ±1.
we start with observation weights wi0 = 1/n.
At step t = 1, . . . , T , we choose the learner ft ∈ F that
minimizes the total weight of misclassified observations
n
X
Mt = wit−1 11(yi , f (xi )).
i=1

We multiply
p the weight of misclassified observations by a factor
exp(st ) = (1 − Mt )/Mt ;
We divide the weight of well-classified observations by the
same factor;
and we use a scale factor Wt to make sure that i wit = 1.
P

Our updated predictor is Ft (x) ≡ 211 tτ=1 sτ fτ (x) > 0 − 1.


P

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Properties of AdaBoost

Let γt = 1/2 − Mt (one half the excess of well-classified over


misclassified).
It is “easy” to prove that the average probability of
misclassification decreases at least as fast as
t
X
exp(−2 γ2τ ).
τ=1

So if our weak learners are still strong enough that γt > γ > 0,
we converge to perfect classification exponentially fast
but of course we want to stop before that to avoid overfitting,
hence we cross-validate.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Sketch of proof

1 P
We want to show that n i 11(yi , Ft (xi )) goes to zero
exponentially in t.
1 first, note that 1 (yi , Ft (xi )) ≤ exp(−yi Ft (xi ))
show that wit = tτ=1 Wτ /n
Q
2

3 deduce that the


Qt proportion of misclassified observations
behaves like τ=1 Wτ
p q
4 prove that Wτ = Mτ (1 − Mτ ) = 1 − 4γ2τ and conclude.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Switching to Neurons

A real neuron.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Simplify, then Exaggerate

Simple 1950s model of a neuron (Rosenblatt).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


What the Neuron Does

1 it receives inputs x1 , . . . , xp (think of p covariates)


2 it combines them linearly into w 0 x + w0
3 and it transmits a signal σ(w 0 x) (think of y = 0, 1)

≡ a Generalized Linear Model (or a restricted single-index


model)
restricted because the activation function σ has a specific form:
originally σ(t) = 11(t > 0).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Modern Activation Functions

Smoothing 0-1 to get a probability gave the sigmoid

1
σ(t) = .
1 + exp(−t)

(known otherwise as the cdf of the logistic)


With more categories we use the multinomial logit (softmax)

exp(tj )
σj (t) = P .
k exp(tk )

Current MVP for continuous y: the ReLU (Rectified Linear Unit)


function
σ(t) = max(t, 0).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


A Single-Layer Perceptron

Input layer + one hidden layer + output layer (K outputs).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Why is a neuronal layer a good idea?

Start with a bad idea:


if we take σ to be linear, we get something boring:
a linear model for regression
a multinomial logit for classification

more interesting:
with a nonlinear activation function, we get a sort of
series/sieves flexible method
with a difference: only one basis function, many linear indices.
It will get more expressive when we add hidden layers.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Combining Neurons

We have M neurons in a layer


→ each neuron m has weights w1m , transmits σ(w1m
0 x +w )
10

Signals combine, using weights w2l , into a prediction


  
  X X
ŷ = g w2 σ(w1 x + w10 ) + w20 = g 
0 0
w2l σ 

 w1lm xm + w0lm  + w20
l m

where g is the activation function of the output “layer”.

Regression: we take g(t) ≡ t


Classification with K classes: we use the softmax function (cf
multinomial logit).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Fitting an SLP
Given a loss function L̄ (w) = L (y, ŷ(w)),
we do an epoch: one forward pass, one backward pass; and
we repeat.
Forward pass: for given w, we compute ŷ(w)
Backward pass: as in boosting, we compute the components of
the loss (the errors),
we take their gradients wrt w, and we do approximate
Newton-Raphson iterates on w.
Basically: w(s+1) = w(s) − εs L 0 (w(s) ).
Problem: the w’s consist of weights w between various layers
how can we update them correctly and efficiently?

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Backpropagation, or the Chain Rule
SupposePa regression problem with loss
L̄ (w) = ni=1 (yi − ŷi (w))2 .
We have
n
∂L̄ X
(s)
(w(s) ) = −2 r̂i (w)zil
∂w2l
i=1
(s) (s)
where r̂i = yi − ŷi is the residual;
we do one Newton iteration to update w2 and we move on:

n
∂L̄ X
(s)
X
(s)
(w(s) ) = −2 r̂i (w)xim 11((wl )0 xi > 0)
∂w1lm
i=1 l

(since with ReLU, σ0 (t) = 11(t > 0));


and we update w1 .
Done (with automatic differentiation, in practice).

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Deep Learning
We add more hidden layers (≥ 2 is “deep”...)

Typically in metrics, they are fully connected:


every input node to every node in hidden layer 1
every node in hidden layer k to every node in hidden layer
(k + 1)
every node in the last hidden layer to every node in the
output layer.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


A Multilayer Perceptron

(a very small one! for regression)

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Choosing Parameters

With D hidden layers of M neurons each, we have

pM + (D − 1)M 2 + KM parameters.

How should we choose D and M?


and the activation function?
and other hyperparameters? (see later: dropout)

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Deep Helps

Theory and experience show that d eeper-cum-regularization is


better than wider:
“too small” D does not fit well
“too large” D is OK if many weights are small
(roughly speaking)
if
the unknown E(y|x) is very smooth
the width M goes to infinity faster than n1/4
the depth D goes to infinity like log n

then the RMSE goes to zero slightly more slowly than 1/ n
good enough to use as a first-stage estimator.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Why do we need D → ∞?

Approximation theory says that when σ is continuous but not a


polynomial,
linear combinations of functions x → σ(a + bx) can
approximate any continuous function arbitrarily closely (on
compact sets.)
So why not just one layer?
(loose) answer: if the network is not deep, it takes a very large
number of units to get a good approximation.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


Avoiding Overfitting

As we add more epochs the loss on the training sample can


only go down
We risk overfitting, as always.
→ we keep a validation/test sample in reserve and after each
epoch, we compute the loss on the test sample
we stop when the loss on the validation sample stops
decreasing.
q
we often also add a penalty term λkwkq to L (w)
the norm q can be quadratic (cf ridge) or q = 1 (cf Lasso)
λ is the weight decay.

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5


What we Want to See

Figure: Stop at the minimum of the validation loss

Bernard Salanié Machine Learning for Econometrics Handout of Lecture 5

You might also like