0% found this document useful (0 votes)

2 views

2023 LSE MY474 Applied Machine Learning Social Science, Lecture4

The document covers key concepts in applied machine learning, focusing on gradient descent, bootstrap methods, cross-validation, and hyperparameter optimization. It explains the mechanics of gradient descent, including batch and stochastic approaches, and discusses resampling techniques like bootstrap and cross-validation for model assessment and selection. Additionally, it highlights the importance of hyperparameters in model training and the various methods for optimizing them.

Uploaded by

Joe blog

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

2023 LSE MY474 Applied Machine Learning Social Science, Lecture4

Uploaded by

Joe blog

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

MY474: Applied Machine Learning for Social

Science
Lecture 4: Gradient Descent, Bootstrap, Cross-Validation,
Hyperparameters

Blake Miller

26 January 2023
Agenda

1. Gradient Descent
2. The Bootstrap
3. Cross-Validation
4. Hyperparameter optimization
I Grid search
I Random search
I Bayesian search
Gradient Descent
Gradient Descent

The gradient, represented by the blue arrows, denote the direction of greatest change of
a scalar function. The values of the function are represented in greyscale and increase in
value from white (low) to dark (high). Source: Wikimedia Commons
Gradient Descent visualized with Simple Linear Regression

Source: Alykhan Tejani

Gradient Descent
Our goal is to estimate β by minimizing the loss function
−ln(`(β; yi , xi )):
β̂ = arg min − ln(`(β))
β

I A loss function measures how far away predictions ŷ are from

the true values y .
I The gradient ∇f (·) is the vector of parital derivates where the
direction represents the greatest rate of increase of the
function.
I For a convex function like a loss function, moving in the
negative direction of the gradient will take you toward the
minimum of that function.
I In other words, the gradient of the loss gives the direction of
the prediction error and we can move in the opposite direction
to minimize that error.
I How much we move is a parameter λ called the learning rate
Learning Rate

I λ too small: will approach the minimum very slowly

I λ too big: will jump around the minimum and converge very
slowly (if at all)
I We want to choose a λ that minimizes the time to convergence
and gives us the right answer.
(Batch) Gradient Descent

Algorithm 1: (Batch) Gradient Descent

1 Set tolerance parameter to positive real value;
2 Set learning rate parameter λ to positive real value;
3 Initialize β̂ to a random non-zero vector;
4 while k∇L(ŷ, y)k > do
5 β̂ ← β̂ − λ∇L(ŷ, y);
6 end
Stochastic Gradient Descent

I With large training data, gradient descent can become very

computationally expensive
I Must sum over all examples of the training data at each step in
the iteration
n h
0
i
yi xi 0 β − log(1 + e xi β )
X
ln(`(β; yi , xi )) =
i=1

I Instead of considering the full batch gradient (with all training

data), we compute a stochastic version of the gradient with a
single randomly-selected training observation.
I Very efficient in practice, cuts down training time significantly
while performing similarly (and sometimes better) than batch
gradient descent.
Stochastic Gradient Descent

Algorithm 2: Stochastic Gradient Descent

1 Set learning rate parameter λ to positive real value;
2 Initialize β̂ to a random non-zero vector;
3 for e ← 0 to E do
4 Shuffle D to prevent cycles;
5 foreach (xi , yi ) ∈ D do
6 Compute ŷ ← h(xi );
7 Compute the loss Li (yî , yi );
8 Compute the gradient of the loss ∇Li (yî , yi );
9 Update parameters β̂ ← β̂ − λ∇Li (yî , yi );
10 end
11 end
Cross-Validation and the Bootstrap

I In the section we discuss two resampling methods:

cross-validation and the bootstrap.
I These methods refit a model of interest to samples formed
from the training set, in order to obtain additional information
about the fitted model.
I For example, they provide estimates of test-set prediction error,
and the standard deviation and bias of our parameter estimates
The Bootstrap
Fundamentals of Bootstrap

I Applied when there is no theory to compute standard errors,

confidence intervals, etc, or the theory cannot be trusted
I May not always work either, but works quite generally
I Fundamental idea: pretend the observed data is the population
I Resample observed data, create multiple samples
I From each sample, estimate parameters and assess variability
History of the Bootstrap

The phrase “pull oneself up by one’s bootstraps” is thought to come from “The
Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe: The Baron had
fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to
pick himself up by his own bootstraps.

I Name comes from the phrase “pull oneself up by one’s

bootstraps.”
I The Bootstrap was invented by Bradley Efron in 1979 as an
extension to the jackknife, another resampling method.
I Since then, Efron and others have developed several extensions
of the bootstrap (e.g. Bayesian, parametric, etc.)
The Bootstrap
Bootstrap
replications

S (Z * 1) S (Z * 2) S (Z *B )

Bootstrap
samples
Z*1 Z*2 Z*B

Z = ( z1 , z 2 , . . . , z ) Training
N
sample
The Bootstrap

1. Start with training set Z = (z1 , . . . , zN ); zi = (xi , yi ).

2. Randomly draw B bootstrap samples Z ∗ of size N with
replacement from the training data.
3. Fit a model to each Z ∗ , estimate â∗ for all B bootstrap sample.
4. Characterize behaviors of the fits over the B replications S(α̂∗ ).
S(α̂∗ ) could be, for example, the standard error:
v
u B
u 1 X ¯ ∗ )2
SEB (α̂) = t (α̂∗r − α̂
B − 1 r =1
Example with 3 Observations
Obs X Y

3 5.3 2.8
α̂ *1
1 4.3 2.4
*1
Z 3 5.3 2.8

Obs X Y
Obs X Y

2 2.1 1.1
1 4.3 2.4 Z *2
!!
3 5.3 2.8 α̂ *2
2 2.1 1.1
!! 1 4.3 2.4 !!
3 5.3 2.8 !!
!! !!
!! !! !!
*B !!
Original Data (Z) !Z !!
!!
Obs X Y
α̂ *B
2 2.1 1.1
2 2.1 1.1
1 4.3 2.4

A graphical illustration of the bootstrap approach on a small sample

containing n = 3 observations. Each bootstrap data set contains n
observations, sampled with replacement from the original data set. Each
bootstrap data set is used to obtain an estimate of α
Bootstrap World vs. Real World
Cross-Validation
Finding g(x ) ≈ f (x )

I We aim to find g(x ) that approximates an unknown function

f (x )
I There are, however, many model specifications to choose from.
I Two important goals:
I Model selection: estimating the performance of different
models in order to choose the best one.
I Model assessment: having chosen a final model, estimating its
prediction error (generalization error) on new data.
I For model assessment, we would ideally have a large designated
test set that is never used to train models; this is often not
feasible.
I For model selection, we need to train several models; don’t
want to throw away data.
Partitioning Data to Learn g(x )

Train Validation Test

A typical split of data: 50% for training, and 25% each for validation and testing

I Training set: A subsample used to fit a model (e.g. training

set can be used to estimate model parameters).
I Validation set: A subsample used for model selection.
I Test set: A subsample used only model assessment; i.e. to
measure the generalization error of a fully specified model.
Validation-Set Approach

I Here we randomly divide the available set of samples into two

parts: a training set and a validation or hold-out set.
I The model is fit on the training set, and the fitted model is
used to predict the responses for the observations in the
validation set.
I The resulting validation-set error provides an estimate of the
test error. This is typically assessed using MSE in the case of a
quantitative response and misclassification rate in the case of a
qualitative (discrete) response.
The Validation Process

A random splitting into two halves: left part is training set, right
part is validation set

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
Example: Polynomial Regression
I Goal: Compare performance of linear vs higher-order
polynomial terms in a linear regression
28

28
Mean Squared Error

Mean Squared Error

26
24

24
22

22
20

20
18

18
16

16
2 4 6 8 10 2 4 6 8 10

Degree of Polynomial Degree of Polynomial

Left panel shows single split; right panel shows multiple splits
Drawbacks of the Validation Set Approach

I The validation estimate of the test error can be highly variable,

depending on precisely which observations are included in the
training set and which observations are included in the
validation set.
I In the validation approach, only a subset of the
observations—those that are included in the training set rather
than in the validation set — are used to fit the model.
I This suggests that the validation set error may tend to
overestimate the test error for the model fit on the entire data
set. Why?
K-fold Cross-Validation

I Estimates of generalization error from one train / validation

split can be noisy, so shuffle data and average over K distinct
validation partitions instead
I K-fold cross-validation is a widely used approach for estimating
test error.
I Estimates can be used to select best model, and to give an
idea of the test error of the final chosen model.
K-fold Cross-Validation

How it works:
1. Randomly divide the data into K parts of (roughly) equal
length
2. Hold out one of the parts to be the “test” data.
3. Use all remaining data to train the model.
4. Predict on the held-out part and compute the error rate.
5. Repeat steps 2-4 for each of the K parts.
6. Average all “test” errors to obtain the cross-validation error.
K-Fold Cross-Validation Algorithm

Algorithm 3: K-Fold Cross-Validation

1 Randomly partition the data D into K parts of (roughly) equal
length d1 , . . . , dk ;
2 Initialize empty error vector E ← (e1 , . . . , eK );
3 for k ← 1 to K do
4 Trk ← D − dk ;
5 Tek ← dk ;
6 Train model g(Trk );
7 Predict outcome of Tek using model g(Trk ) yeilding ŷTek ;
8 Compute error ek with ŷTek and yTek ;
9 Insert ek into error vector E;
10 end
nk
return K
P
11 k=1 n ek ;
K-Fold Cross-Validation Details

Divide data into K roughly equal-sized parts (K = 5 here)

K-Fold Cross-Validation Details

I Let the K parts be C1 , C2 , . . . CK , where Ck denotes the

indices of the observations in part k. There are nk observations
in part k: if N is a multiple of K , then nk = n/K .
I Compute
K
X nk
CV(K ) = MSEk
k=1
n

where MSEk = i∈Ck (yi − ŷi )2 /nk , and ŷi is the fit for observation
P

i, obtained from the data with part k removed.

I Setting K = n yields n-fold or leave-one out cross-validation
(LOOCV).
More About Cross-Validation

I K can be anything; popular values are 5, 10, n.

I The cross-validated error rate tends to be closer to the true
error rate than to the apparent error rate.
I The computational cost can become a concern, particularly if
there are many tuning parameters
Leave One Out Cross-Validation (LOOCV)

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!

!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!
%!
%!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
Example: Polynomial Regression

LOOCV 10−fold CV
28

28
Mean Squared Error

Mean Squared Error

26
24

24
22

22
20

20
18

18
16

16
2 4 6 8 10 2 4 6 8 10

Degree of Polynomial Degree of Polynomial

On the right, each line represents a different partition of the data.

Other Issues with Cross-Validation

I Since each training set is only (K − 1)/K as big as the original

training set, the estimates of prediction error will typically be
biased upward. Why?
I This bias is minimized when K = n (LOOCV), but this
estimate has high variance, as noted earlier.
I K = 5 or 10 provides a good compromise for this bias-variance
tradeoff.
I Like our choice of models, our CV decision involves a
bias-variance tradeoff
Hyperparameter Selection
Hyperparameters

I Hyperparameters, also called tuning parameters, refer to

specifications that change the behavior of the learning model
(e.g. k in KNN) or the learning algorithm (e.g. learning rate λ
for gradient descent).
I Cross-validation is used to select the hyperparameters that
will give us a g(x ) that best generalizes to new data.
I The cross-validation procedure is performed in a
hyperparameter search, which can be conducted in several
ways.
Hyperparameters vs. Parameters

I While parameters like β in regression are set during training,

hyperparameters are set before training.
I Hyperparameters come in two main types:
I Model hyperparameters are attributes of models, usually refer
to flexibility of fit.
I Algorithm hyperparameters refer to methods of training the
model; these hyperparameters don’t affect model performance,
but rather the quality and speed of learning.
Model Hyperparameter Example: Kernel KNN
I k in KNN is a model hyperparameter.
I KNN can also be extended to use additional model
hyperparameters.
I Ordinary KNN gives equal weight to all neighbors, the
equivalent of a “uniform” kernel.
I We could potentially improve performance by adding a weight
to each of these neighbors based on their distance from the
reference observation.
I Weighted KNN downweights neighbors farther from our target
point:

k
X
Ĉ (x ) = wi ∗ yi ; 0 < wi < 1
i=1

I We want our weight wi to be

I small if the distance to our reference point is large
I large if the distance to our reference point is small
Model Hyperparameter Example: Kernel KNN

We can use kernels that are functions of the distance between points. This
is another hyperparameter in addition to k.
Algorithm Hyperparameters: K-D Tree vs. Brute Force
o o
o o o o
o o o o
o o

o o

o o o o
o o
o o
o o

I Recall the KNN algorithm: in order to find the k nearest

neighbors, we have to calculate the distance between a
reference point and all of our training observations.
I This can be VERY expensive in a large dataset.
I As with many other learning algorithms, often there are
computational shortcuts that get you close to the same answer
(recall stochastic gradient descent).
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Say we have 2-dimensional training data like the following.

Algorithm Hyperparameters: K-D Tree vs. Brute Force

First split a randomly selected feature at the median.

Algorithm Hyperparameters: K-D Tree vs. Brute Force

Now you have a tree with two nodes.

Algorithm Hyperparameters: K-D Tree vs. Brute Force

Continue to split the feature space along a random dimension until you
have no more than m observations in each bounding box. In this case
m = 2.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Each of these bounding boxes are represented by terminal nodes on the

following tree.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

When predicting the class of a new observation, we first determine which

bounding box it is in.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Instead of calculating the distance between all observations in the training

data, we restrict our search to the bounding box where the
query/reference point resides.
Algorithm Hyperparameters: K-D Tree vs. Brute Force

Unfortunately, we oftentimes do not find the true nearest neighbor, but we

get close!
Grid Search

Here we see an example of a grid search using two hyperparameters for

support vector machines.

I Exhaustive search of all potential parameter combinations.

I Parameters, even when continuous, must be discretized
I Suffers from the curse of dimensionality. Why?
Grid Search

Here we see an example of a grid search using two hyperparameters for

support vector machines.
Problems with Grid Search

I The search space can become extremely large as the number of

hyperparameters or feature selection/preprocessing steps
increases
I One solution is to randomly sample a subset of possible models
using a randomized hyperparameter search
I Another solution is to use a surrogate model in sequential
model-based optimization (SMBO)
Randomized Hyperparameter Search

I In randomized search, we treat each hyperparameter as a

distribution and sample values from it at random.
I Randomized search does not exhaustively search the space of
possible hyperparameter combinations, instead taking repeated
random samples from distributions of hyperparameters.
I Because of this, a computational budget—the number of
random samples to be drawn—can be set ahead of time.
I Randomized search can discretize parameters, but does not
necessarily have to.
Randomized Hyperparameter Search

On the left we see the difference in sampling of parameters in grid search

vs. random search. The green distribution represents a hyperparameter
that is important if tuned. The high peak represents the area of the
optimal hyperparameter value.
Problems with Randomized Hyperparameter Search

I It can become quite expensive to check many different model

configurations.
I Each new sample of hyperparameters must be used to train a
new model on training data, make predictions with the
validation data, and calculate our performance metric.
I With randomized search, the algorithm has no memory of past
successes and failures, wanders randomly through the
hyperparameter space
I An alternative, bayesian hyperparameter search, uses past
records of hyperparameters and performance metrics to decide
where to look next.
Bayesian Hyperparameter Search

I Uses a probabilistic model of hyperparameter values. Maps

these hyperparameters probabilistically to a score on a chosen
performance metric.
I The approach works as follows:
1. Build a surrogate probability model that predicts how
hyperparameter values will impact the model performance based
on prior knowledge.
2. Find the hyperparameters that perform best on the surrogate
3. Train the model with these hyperparameters on the training
data; evaluate on the validation set.
4. Update the surrogate model incorporating the new information
about how hyperparameter values impact the model’s
performance.
5. Repeat steps 2–4 until max iterations or time is reached
AutoML

I Why not just take humans out of the loop altogether?

I ML algorithms (KNN, logistic regression, neural networks) can
be treated like hyperparameters (choose the algorithm that
minimizes validation error)
I Features and preprocessing can also be treated like
hyperparameters (choose feature representations that minimize
validation error)
Quiz Review and Discussion

1. What is the difference between model and algorithm

hyperparameters?
2. Why might it be better to use cross-validation rather than the
validation approach?
3. What do we mean by “resampling methods?”
4. How does the choice of k amount to a bias-variance tradeoff?
5. What is a learning rate? What happens if it is too big/small?

Solution Manual For Advanced Fluid Mechanics GCo William Graebel PDF
50% (2)
Solution Manual For Advanced Fluid Mechanics GCo William Graebel PDF
4 pages
Dialogues, Gilles Deleuze and Claire Parnet (1977)
100% (2)
Dialogues, Gilles Deleuze and Claire Parnet (1977)
172 pages
Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
5 CV Boot-Handout PDF
No ratings yet
5 CV Boot-Handout PDF
44 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
Validation Model 2024-2
No ratings yet
Validation Model 2024-2
37 pages
Lect_03_Evaluation_Part_2
No ratings yet
Lect_03_Evaluation_Part_2
40 pages
Statistical Learning: Master in Data Science For Management
No ratings yet
Statistical Learning: Master in Data Science For Management
47 pages
4-ResamplingMethods 1
No ratings yet
4-ResamplingMethods 1
23 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
ML U-4
No ratings yet
ML U-4
63 pages
M1 - Evaluating Predictive Performance
No ratings yet
M1 - Evaluating Predictive Performance
58 pages
lec5
No ratings yet
lec5
28 pages
Lecture 9
No ratings yet
Lecture 9
16 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
MI_Unit 5
No ratings yet
MI_Unit 5
72 pages
Bootstrap: Estimate Statistical Uncertainties
No ratings yet
Bootstrap: Estimate Statistical Uncertainties
22 pages
unit 4 regression
No ratings yet
unit 4 regression
26 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
Wk07 Topic07 2 - 202303
No ratings yet
Wk07 Topic07 2 - 202303
21 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Ch5 Resampling Methods
No ratings yet
Ch5 Resampling Methods
66 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
UnivariateRegression Summary
No ratings yet
UnivariateRegression Summary
36 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
46 pages
ML 5
No ratings yet
ML 5
14 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Lecture 21: Model Selection 1 Choosing Models
No ratings yet
Lecture 21: Model Selection 1 Choosing Models
14 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
73 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Lecture 31-36
No ratings yet
Lecture 31-36
44 pages
18+cv+%26+model+selection
No ratings yet
18+cv+%26+model+selection
11 pages
03 Model Selection and Train-Validation-Test Sets 12 Min
No ratings yet
03 Model Selection and Train-Validation-Test Sets 12 Min
7 pages
Lecture 4 - Bias-Variance Trade-Off and Model Selection
No ratings yet
Lecture 4 - Bias-Variance Trade-Off and Model Selection
66 pages
Lecture On Bootstrap - Lecture Notes
No ratings yet
Lecture On Bootstrap - Lecture Notes
29 pages
Unit IV
No ratings yet
Unit IV
51 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
4.4 Parametric and Non-parametric Estimator
No ratings yet
4.4 Parametric and Non-parametric Estimator
47 pages
10: Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
10: Advice For Applying Machine Learning: Deciding What To Try Next
8 pages
bootstrap
No ratings yet
bootstrap
4 pages
KNN_Bias_Variance_Classification_Metrics (1)
No ratings yet
KNN_Bias_Variance_Classification_Metrics (1)
81 pages
Bootstrap Report
No ratings yet
Bootstrap Report
92 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Bootstrap Up
No ratings yet
Bootstrap Up
5 pages
SML_Lecture4
No ratings yet
SML_Lecture4
38 pages
Boot
No ratings yet
Boot
15 pages
week2
No ratings yet
week2
43 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
Lecture 7
No ratings yet
Lecture 7
51 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
07 - Evaluating Performance
No ratings yet
07 - Evaluating Performance
46 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Machine Learning
No ratings yet
Machine Learning
24 pages
Foundation Model Evaluation
No ratings yet
Foundation Model Evaluation
10 pages
(Cambridge Series in Statistical and Probabilistic Mathematics) A. C. Davison, D. v. Hinkley - Bootstrap Methods and Their Application-Cambridge University Press (1997)
No ratings yet
(Cambridge Series in Statistical and Probabilistic Mathematics) A. C. Davison, D. v. Hinkley - Bootstrap Methods and Their Application-Cambridge University Press (1997)
596 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
JEducationalMeasurement 2025 Ulitzsch UsingItemParameterPredictionsforReducingCalibrationSampleRequirements
No ratings yet
JEducationalMeasurement 2025 Ulitzsch UsingItemParameterPredictionsforReducingCalibrationSampleRequirements
52 pages
Fluid Mechanics II-course Guide Book - 2023 - ASTU
No ratings yet
Fluid Mechanics II-course Guide Book - 2023 - ASTU
3 pages
Review On Piecewise Functions
100% (1)
Review On Piecewise Functions
46 pages
Moving Wheel Loads Analysis
No ratings yet
Moving Wheel Loads Analysis
3 pages
Wipro Practice Qs
100% (2)
Wipro Practice Qs
65 pages
DPP - Straight Lines
No ratings yet
DPP - Straight Lines
2 pages
FIRST QUARTER SLM WALKTHROUGH OBSERVATIONS - REVIEW OF SLMs
No ratings yet
FIRST QUARTER SLM WALKTHROUGH OBSERVATIONS - REVIEW OF SLMs
41 pages
Count like an Egyptian a hands on introduction to ancient mathematics 1st Edition Reimer all chapter instant download
100% (10)
Count like an Egyptian a hands on introduction to ancient mathematics 1st Edition Reimer all chapter instant download
85 pages
Lexicology: Filmstar, Ex-President, Ex-Wife - Morphological Motivation
No ratings yet
Lexicology: Filmstar, Ex-President, Ex-Wife - Morphological Motivation
7 pages
Radiation Integrals
No ratings yet
Radiation Integrals
22 pages
Block Shear Failure in Steel Members - A Review of Design Practice PDF
No ratings yet
Block Shear Failure in Steel Members - A Review of Design Practice PDF
11 pages
Cheatsheet and Mindmap
No ratings yet
Cheatsheet and Mindmap
14 pages
Odd Even Miracle
No ratings yet
Odd Even Miracle
31 pages
Topic 11 Manual 2023
No ratings yet
Topic 11 Manual 2023
24 pages
Manual Plaxis
No ratings yet
Manual Plaxis
114 pages
Exercise 6
No ratings yet
Exercise 6
2 pages
Probability of Compound Event
No ratings yet
Probability of Compound Event
3 pages
Numerical Methods Versus Bjerksund and Stensland Approximations For American Options Pricing
No ratings yet
Numerical Methods Versus Bjerksund and Stensland Approximations For American Options Pricing
9 pages
Otc 24837 MS
No ratings yet
Otc 24837 MS
8 pages
Limits &amp Fits
100% (1)
Limits &amp Fits
46 pages
Maths GPT
No ratings yet
Maths GPT
11 pages
STAT-APP-1 (1) - Converted-Merged
No ratings yet
STAT-APP-1 (1) - Converted-Merged
60 pages
Fitjee Math 23
No ratings yet
Fitjee Math 23
14 pages
May Jun 2024
No ratings yet
May Jun 2024
4 pages
Agilent Jitter Analysis 1
No ratings yet
Agilent Jitter Analysis 1
11 pages
Mind Maps and Math Problem Solving
100% (30)
Mind Maps and Math Problem Solving
11 pages
As Dira Cop Rs Vector Potentials
No ratings yet
As Dira Cop Rs Vector Potentials
4 pages
Bayesian Regression For A Dirichlet Distributed Response Using Stan
No ratings yet
Bayesian Regression For A Dirichlet Distributed Response Using Stan
13 pages

2023 LSE MY474 Applied Machine Learning Social Science, Lecture4

Uploaded by

2023 LSE MY474 Applied Machine Learning Social Science, Lecture4

Uploaded by

MY474: Applied Machine Learning for Social

Source: Alykhan Tejani

I A loss function measures how far away predictions ŷ are from

I λ too small: will approach the minimum very slowly

Algorithm 1: (Batch) Gradient Descent

I With large training data, gradient descent can become very

I Instead of considering the full batch gradient (with all training

Algorithm 2: Stochastic Gradient Descent

I In the section we discuss two resampling methods:

I Applied when there is no theory to compute standard errors,

I Name comes from the phrase “pull oneself up by one’s

1. Start with training set Z = (z1 , . . . , zN ); zi = (xi , yi ).

A graphical illustration of the bootstrap approach on a small sample

I We aim to find g(x ) that approximates an unknown function

Train Validation Test

I Training set: A subsample used to fit a model (e.g. training

I Here we randomly divide the available set of samples into two

Mean Squared Error

Degree of Polynomial Degree of Polynomial

I The validation estimate of the test error can be highly variable,

I Estimates of generalization error from one train / validation

Algorithm 3: K-Fold Cross-Validation

Divide data into K roughly equal-sized parts (K = 5 here)

I Let the K parts be C1 , C2 , . . . CK , where Ck denotes the

i, obtained from the data with part k removed.

I K can be anything; popular values are 5, 10, n.

Mean Squared Error

Degree of Polynomial Degree of Polynomial

On the right, each line represents a different partition of the data.

I Since each training set is only (K − 1)/K as big as the original

I Hyperparameters, also called tuning parameters, refer to

I While parameters like β in regression are set during training,

I We want our weight wi to be

I Recall the KNN algorithm: in order to find the k nearest

Say we have 2-dimensional training data like the following.

First split a randomly selected feature at the median.

Now you have a tree with two nodes.

Each of these bounding boxes are represented by terminal nodes on the

When predicting the class of a new observation, we first determine which

Instead of calculating the distance between all observations in the training

Unfortunately, we oftentimes do not find the true nearest neighbor, but we

Here we see an example of a grid search using two hyperparameters for

I Exhaustive search of all potential parameter combinations.

Here we see an example of a grid search using two hyperparameters for

I The search space can become extremely large as the number of

I In randomized search, we treat each hyperparameter as a

On the left we see the difference in sampling of parameters in grid search

I It can become quite expensive to check many different model

I Uses a probabilistic model of hyperparameter values. Maps

I Why not just take humans out of the loop altogether?

1. What is the difference between model and algorithm

You might also like