2023 LSE MY474 Applied Machine Learning Social Science, Lecture4
2023 LSE MY474 Applied Machine Learning Social Science, Lecture4
Science
Lecture 4: Gradient Descent, Bootstrap, Cross-Validation,
Hyperparameters
Blake Miller
26 January 2023
Agenda
1. Gradient Descent
2. The Bootstrap
3. Cross-Validation
4. Hyperparameter optimization
I Grid search
I Random search
I Bayesian search
Gradient Descent
Gradient Descent
The gradient, represented by the blue arrows, denote the direction of greatest change of
a scalar function. The values of the function are represented in greyscale and increase in
value from white (low) to dark (high). Source: Wikimedia Commons
Gradient Descent visualized with Simple Linear Regression
The phrase “pull oneself up by one’s bootstraps” is thought to come from “The
Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe: The Baron had
fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to
pick himself up by his own bootstraps.
S (Z * 1) S (Z * 2) S (Z *B )
Bootstrap
samples
Z*1 Z*2 Z*B
Z = ( z1 , z 2 , . . . , z ) Training
N
sample
The Bootstrap
3 5.3 2.8
α̂ *1
1 4.3 2.4
*1
Z 3 5.3 2.8
Obs X Y
Obs X Y
2 2.1 1.1
1 4.3 2.4 Z *2
!!
3 5.3 2.8 α̂ *2
2 2.1 1.1
!! 1 4.3 2.4 !!
3 5.3 2.8 !!
!! !!
!! !! !!
*B !!
Original Data (Z) !Z !!
!!
Obs X Y
α̂ *B
2 2.1 1.1
2 2.1 1.1
1 4.3 2.4
A typical split of data: 50% for training, and 25% each for validation and testing
A random splitting into two halves: left part is training set, right
part is validation set
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
Example: Polynomial Regression
I Goal: Compare performance of linear vs higher-order
polynomial terms in a linear regression
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Left panel shows single split; right panel shows multiple splits
Drawbacks of the Validation Set Approach
How it works:
1. Randomly divide the data into K parts of (roughly) equal
length
2. Hold out one of the parts to be the “test” data.
3. Use all remaining data to train the model.
4. Predict on the held-out part and compute the error rate.
5. Repeat steps 2-4 for each of the K parts.
6. Average all “test” errors to obtain the cross-validation error.
K-Fold Cross-Validation Algorithm
where MSEk = i∈Ck (yi − ŷi )2 /nk , and ŷi is the fit for observation
P
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!
%!
%!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
Example: Polynomial Regression
LOOCV 10−fold CV
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
k
X
Ĉ (x ) = wi ∗ yi ; 0 < wi < 1
i=1
We can use kernels that are functions of the distance between points. This
is another hyperparameter in addition to k.
Algorithm Hyperparameters: K-D Tree vs. Brute Force
o o
o o o o
o o o o
o o
o o
o o o o
o o
o o
o o
Continue to split the feature space along a random dimension until you
have no more than m observations in each bounding box. In this case
m = 2.
Algorithm Hyperparameters: K-D Tree vs. Brute Force