0% found this document useful (0 votes)
2 views

2023 LSE MY474 Applied Machine Learning Social Science, Lecture4

The document covers key concepts in applied machine learning, focusing on gradient descent, bootstrap methods, cross-validation, and hyperparameter optimization. It explains the mechanics of gradient descent, including batch and stochastic approaches, and discusses resampling techniques like bootstrap and cross-validation for model assessment and selection. Additionally, it highlights the importance of hyperparameters in model training and the various methods for optimizing them.

Uploaded by

Joe blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2023 LSE MY474 Applied Machine Learning Social Science, Lecture4

The document covers key concepts in applied machine learning, focusing on gradient descent, bootstrap methods, cross-validation, and hyperparameter optimization. It explains the mechanics of gradient descent, including batch and stochastic approaches, and discusses resampling techniques like bootstrap and cross-validation for model assessment and selection. Additionally, it highlights the importance of hyperparameters in model training and the various methods for optimizing them.

Uploaded by

Joe blog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

MY474: Applied Machine Learning for Social

Science
Lecture 4: Gradient Descent, Bootstrap, Cross-Validation,
Hyperparameters

Blake Miller

26 January 2023
Agenda

1. Gradient Descent
2. The Bootstrap
3. Cross-Validation
4. Hyperparameter optimization
I Grid search
I Random search
I Bayesian search
Gradient Descent
Gradient Descent

The gradient, represented by the blue arrows, denote the direction of greatest change of
a scalar function. The values of the function are represented in greyscale and increase in
value from white (low) to dark (high). Source: Wikimedia Commons
Gradient Descent visualized with Simple Linear Regression

Source: Alykhan Tejani


Gradient Descent
Our goal is to estimate β by minimizing the loss function
−ln(`(β; yi , xi )):
β̂ = arg min − ln(`(β))
β

I A loss function measures how far away predictions ŷ are from


the true values y .
I The gradient ∇f (·) is the vector of parital derivates where the
direction represents the greatest rate of increase of the
function.
I For a convex function like a loss function, moving in the
negative direction of the gradient will take you toward the
minimum of that function.
I In other words, the gradient of the loss gives the direction of
the prediction error and we can move in the opposite direction
to minimize that error.
I How much we move is a parameter λ called the learning rate
Learning Rate

I λ too small: will approach the minimum very slowly


I λ too big: will jump around the minimum and converge very
slowly (if at all)
I We want to choose a λ that minimizes the time to convergence
and gives us the right answer.
(Batch) Gradient Descent

Algorithm 1: (Batch) Gradient Descent


1 Set tolerance parameter  to positive real value;
2 Set learning rate parameter λ to positive real value;
3 Initialize β̂ to a random non-zero vector;
4 while k∇L(ŷ, y)k >  do
5 β̂ ← β̂ − λ∇L(ŷ, y);
6 end
Stochastic Gradient Descent

I With large training data, gradient descent can become very


computationally expensive
I Must sum over all examples of the training data at each step in
the iteration
n h
0
i
yi xi 0 β − log(1 + e xi β )
X
ln(`(β; yi , xi )) =
i=1

I Instead of considering the full batch gradient (with all training


data), we compute a stochastic version of the gradient with a
single randomly-selected training observation.
I Very efficient in practice, cuts down training time significantly
while performing similarly (and sometimes better) than batch
gradient descent.
Stochastic Gradient Descent

Algorithm 2: Stochastic Gradient Descent


1 Set learning rate parameter λ to positive real value;
2 Initialize β̂ to a random non-zero vector;
3 for e ← 0 to E do
4 Shuffle D to prevent cycles;
5 foreach (xi , yi ) ∈ D do
6 Compute ŷ ← h(xi );
7 Compute the loss Li (yˆi , yi );
8 Compute the gradient of the loss ∇Li (yˆi , yi );
9 Update parameters β̂ ← β̂ − λ∇Li (yˆi , yi );
10 end
11 end
Cross-Validation and the Bootstrap

I In the section we discuss two resampling methods:


cross-validation and the bootstrap.
I These methods refit a model of interest to samples formed
from the training set, in order to obtain additional information
about the fitted model.
I For example, they provide estimates of test-set prediction error,
and the standard deviation and bias of our parameter estimates
The Bootstrap
Fundamentals of Bootstrap

I Applied when there is no theory to compute standard errors,


confidence intervals, etc, or the theory cannot be trusted
I May not always work either, but works quite generally
I Fundamental idea: pretend the observed data is the population
I Resample observed data, create multiple samples
I From each sample, estimate parameters and assess variability
History of the Bootstrap

The phrase “pull oneself up by one’s bootstraps” is thought to come from “The
Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe: The Baron had
fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to
pick himself up by his own bootstraps.

I Name comes from the phrase “pull oneself up by one’s


bootstraps.”
I The Bootstrap was invented by Bradley Efron in 1979 as an
extension to the jackknife, another resampling method.
I Since then, Efron and others have developed several extensions
of the bootstrap (e.g. Bayesian, parametric, etc.)
The Bootstrap
Bootstrap
replications

S (Z * 1) S (Z * 2) S (Z *B )

Bootstrap
samples
Z*1 Z*2 Z*B

Z = ( z1 , z 2 , . . . , z ) Training
N
sample
The Bootstrap

1. Start with training set Z = (z1 , . . . , zN ); zi = (xi , yi ).


2. Randomly draw B bootstrap samples Z ∗ of size N with
replacement from the training data.
3. Fit a model to each Z ∗ , estimate â∗ for all B bootstrap sample.
4. Characterize behaviors of the fits over the B replications S(α̂∗ ).
S(α̂∗ ) could be, for example, the standard error:
v
u B
u 1 X ¯ ∗ )2
SEB (α̂) = t (α̂∗r − α̂
B − 1 r =1
Example with 3 Observations
Obs X Y

3 5.3 2.8
α̂ *1
1 4.3 2.4
*1
Z 3 5.3 2.8

Obs X Y
Obs X Y

2 2.1 1.1
1 4.3 2.4 Z *2
!!
3 5.3 2.8 α̂ *2
2 2.1 1.1
!! 1 4.3 2.4 !!
3 5.3 2.8 !!
!! !!
!! !! !!
*B !!
Original Data (Z) !Z !!
!!
Obs X Y
α̂ *B
2 2.1 1.1
2 2.1 1.1
1 4.3 2.4

A graphical illustration of the bootstrap approach on a small sample


containing n = 3 observations. Each bootstrap data set contains n
observations, sampled with replacement from the original data set. Each
bootstrap data set is used to obtain an estimate of α
Bootstrap World vs. Real World
Cross-Validation
Finding g(x ) ≈ f (x )

I We aim to find g(x ) that approximates an unknown function


f (x )
I There are, however, many model specifications to choose from.
I Two important goals:
I Model selection: estimating the performance of different
models in order to choose the best one.
I Model assessment: having chosen a final model, estimating its
prediction error (generalization error) on new data.
I For model assessment, we would ideally have a large designated
test set that is never used to train models; this is often not
feasible.
I For model selection, we need to train several models; don’t
want to throw away data.
Partitioning Data to Learn g(x )

Train Validation Test

A typical split of data: 50% for training, and 25% each for validation and testing

I Training set: A subsample used to fit a model (e.g. training


set can be used to estimate model parameters).
I Validation set: A subsample used for model selection.
I Test set: A subsample used only model assessment; i.e. to
measure the generalization error of a fully specified model.
Validation-Set Approach

I Here we randomly divide the available set of samples into two


parts: a training set and a validation or hold-out set.
I The model is fit on the training set, and the fitted model is
used to predict the responses for the observations in the
validation set.
I The resulting validation-set error provides an estimate of the
test error. This is typically assessed using MSE in the case of a
quantitative response and misclassification rate in the case of a
qualitative (discrete) response.
The Validation Process

A random splitting into two halves: left part is training set, right
part is validation set

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
Example: Polynomial Regression
I Goal: Compare performance of linear vs higher-order
polynomial terms in a linear regression
28

28
Mean Squared Error

Mean Squared Error


26

26
24

24
22

22
20

20
18

18
16

16
2 4 6 8 10 2 4 6 8 10

Degree of Polynomial Degree of Polynomial

Left panel shows single split; right panel shows multiple splits
Drawbacks of the Validation Set Approach

I The validation estimate of the test error can be highly variable,


depending on precisely which observations are included in the
training set and which observations are included in the
validation set.
I In the validation approach, only a subset of the
observations—those that are included in the training set rather
than in the validation set — are used to fit the model.
I This suggests that the validation set error may tend to
overestimate the test error for the model fit on the entire data
set. Why?
K-fold Cross-Validation

I Estimates of generalization error from one train / validation


split can be noisy, so shuffle data and average over K distinct
validation partitions instead
I K-fold cross-validation is a widely used approach for estimating
test error.
I Estimates can be used to select best model, and to give an
idea of the test error of the final chosen model.
K-fold Cross-Validation

How it works:
1. Randomly divide the data into K parts of (roughly) equal
length
2. Hold out one of the parts to be the “test” data.
3. Use all remaining data to train the model.
4. Predict on the held-out part and compute the error rate.
5. Repeat steps 2-4 for each of the K parts.
6. Average all “test” errors to obtain the cross-validation error.
K-Fold Cross-Validation Algorithm

Algorithm 3: K-Fold Cross-Validation


1 Randomly partition the data D into K parts of (roughly) equal
length d1 , . . . , dk ;
2 Initialize empty error vector E ← (e1 , . . . , eK );
3 for k ← 1 to K do
4 Trk ← D − dk ;
5 Tek ← dk ;
6 Train model g(Trk );
7 Predict outcome of Tek using model g(Trk ) yeilding ŷTek ;
8 Compute error ek with ŷTek and yTek ;
9 Insert ek into error vector E;
10 end
nk
return K
P
11 k=1 n ek ;
K-Fold Cross-Validation Details

Divide data into K roughly equal-sized parts (K = 5 here)


K-Fold Cross-Validation Details

I Let the K parts be C1 , C2 , . . . CK , where Ck denotes the


indices of the observations in part k. There are nk observations
in part k: if N is a multiple of K , then nk = n/K .
I Compute
K
X nk
CV(K ) = MSEk
k=1
n

where MSEk = i∈Ck (yi − ŷi )2 /nk , and ŷi is the fit for observation
P

i, obtained from the data with part k removed.


I Setting K = n yields n-fold or leave-one out cross-validation
(LOOCV).
More About Cross-Validation

I K can be anything; popular values are 5, 10, n.


I The cross-validated error rate tends to be closer to the true
error rate than to the apparent error rate.
I The computational cost can become a concern, particularly if
there are many tuning parameters
Leave One Out Cross-Validation (LOOCV)

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!
%!
%!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
Example: Polynomial Regression

LOOCV 10−fold CV
28

28
Mean Squared Error

Mean Squared Error


26

26
24

24
22

22
20

20
18

18
16

16
2 4 6 8 10 2 4 6 8 10

Degree of Polynomial Degree of Polynomial

On the right, each line represents a different partition of the data.


Other Issues with Cross-Validation

I Since each training set is only (K − 1)/K as big as the original


training set, the estimates of prediction error will typically be
biased upward. Why?
I This bias is minimized when K = n (LOOCV), but this
estimate has high variance, as noted earlier.
I K = 5 or 10 provides a good compromise for this bias-variance
tradeoff.
I Like our choice of models, our CV decision involves a
bias-variance tradeoff
Hyperparameter Selection
Hyperparameters

I Hyperparameters, also called tuning parameters, refer to


specifications that change the behavior of the learning model
(e.g. k in KNN) or the learning algorithm (e.g. learning rate λ
for gradient descent).
I Cross-validation is used to select the hyperparameters that
will give us a g(x ) that best generalizes to new data.
I The cross-validation procedure is performed in a
hyperparameter search, which can be conducted in several
ways.
Hyperparameters vs. Parameters

I While parameters like β in regression are set during training,


hyperparameters are set before training.
I Hyperparameters come in two main types:
I Model hyperparameters are attributes of models, usually refer
to flexibility of fit.
I Algorithm hyperparameters refer to methods of training the
model; these hyperparameters don’t affect model performance,
but rather the quality and speed of learning.
Model Hyperparameter Example: Kernel KNN
I k in KNN is a model hyperparameter.
I KNN can also be extended to use additional model
hyperparameters.
I Ordinary KNN gives equal weight to all neighbors, the
equivalent of a “uniform” kernel.
I We could potentially improve performance by adding a weight
to each of these neighbors based on their distance from the
reference observation.
I Weighted KNN downweights neighbors farther from our target
point:

k
X
Ĉ (x ) = wi ∗ yi ; 0 < wi < 1
i=1

I We want our weight wi to be


I small if the distance to our reference point is large
I large if the distance to our reference point is small
Model Hyperparameter Example: Kernel KNN

We can use kernels that are functions of the distance between points. This
is another hyperparameter in addition to k.
Algorithm Hyperparameters: K-D Tree vs. Brute Force
o o
o o o o
o o o o
o o

o o

o o o o
o o
o o
o o

I Recall the KNN algorithm: in order to find the k nearest


neighbors, we have to calculate the distance between a
reference point and all of our training observations.
I This can be VERY expensive in a large dataset.
I As with many other learning algorithms, often there are
computational shortcuts that get you close to the same answer
(recall stochastic gradient descent).
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Say we have 2-dimensional training data like the following.


Algorithm Hyperparameters: K-D Tree vs. Brute Force

First split a randomly selected feature at the median.


Algorithm Hyperparameters: K-D Tree vs. Brute Force

Now you have a tree with two nodes.


Algorithm Hyperparameters: K-D Tree vs. Brute Force

Continue to split the feature space along a random dimension until you
have no more than m observations in each bounding box. In this case
m = 2.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Each of these bounding boxes are represented by terminal nodes on the


following tree.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

When predicting the class of a new observation, we first determine which


bounding box it is in.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Instead of calculating the distance between all observations in the training


data, we restrict our search to the bounding box where the
query/reference point resides.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Unfortunately, we oftentimes do not find the true nearest neighbor, but we


get close!
Grid Search

Here we see an example of a grid search using two hyperparameters for


support vector machines.

I Exhaustive search of all potential parameter combinations.


I Parameters, even when continuous, must be discretized
I Suffers from the curse of dimensionality. Why?
Grid Search

Here we see an example of a grid search using two hyperparameters for


support vector machines.
Problems with Grid Search

I The search space can become extremely large as the number of


hyperparameters or feature selection/preprocessing steps
increases
I One solution is to randomly sample a subset of possible models
using a randomized hyperparameter search
I Another solution is to use a surrogate model in sequential
model-based optimization (SMBO)
Randomized Hyperparameter Search

I In randomized search, we treat each hyperparameter as a


distribution and sample values from it at random.
I Randomized search does not exhaustively search the space of
possible hyperparameter combinations, instead taking repeated
random samples from distributions of hyperparameters.
I Because of this, a computational budget—the number of
random samples to be drawn—can be set ahead of time.
I Randomized search can discretize parameters, but does not
necessarily have to.
Randomized Hyperparameter Search

On the left we see the difference in sampling of parameters in grid search


vs. random search. The green distribution represents a hyperparameter
that is important if tuned. The high peak represents the area of the
optimal hyperparameter value.
Problems with Randomized Hyperparameter Search

I It can become quite expensive to check many different model


configurations.
I Each new sample of hyperparameters must be used to train a
new model on training data, make predictions with the
validation data, and calculate our performance metric.
I With randomized search, the algorithm has no memory of past
successes and failures, wanders randomly through the
hyperparameter space
I An alternative, bayesian hyperparameter search, uses past
records of hyperparameters and performance metrics to decide
where to look next.
Bayesian Hyperparameter Search

I Uses a probabilistic model of hyperparameter values. Maps


these hyperparameters probabilistically to a score on a chosen
performance metric.
I The approach works as follows:
1. Build a surrogate probability model that predicts how
hyperparameter values will impact the model performance based
on prior knowledge.
2. Find the hyperparameters that perform best on the surrogate
3. Train the model with these hyperparameters on the training
data; evaluate on the validation set.
4. Update the surrogate model incorporating the new information
about how hyperparameter values impact the model’s
performance.
5. Repeat steps 2–4 until max iterations or time is reached
AutoML

I Why not just take humans out of the loop altogether?


I ML algorithms (KNN, logistic regression, neural networks) can
be treated like hyperparameters (choose the algorithm that
minimizes validation error)
I Features and preprocessing can also be treated like
hyperparameters (choose feature representations that minimize
validation error)
Quiz Review and Discussion

1. What is the difference between model and algorithm


hyperparameters?
2. Why might it be better to use cross-validation rather than the
validation approach?
3. What do we mean by “resampling methods?”
4. How does the choice of k amount to a bias-variance tradeoff?
5. What is a learning rate? What happens if it is too big/small?

You might also like