0% found this document useful (0 votes)

30 views

Best Practices

The document discusses best practices for machine learning, including choosing a hypothesis class, training objective, and optimization algorithm. It describes how hyperparameters need to be selected before running the learning algorithm, and recommends using a validation set to optimize hyperparameters to avoid overfitting to the test set. The document also outlines a model development strategy of splitting data into train, validation, and test sets and iteratively developing and improving a model by examining errors and weights.

Uploaded by

giydirelti

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Best Practices

Uploaded by

giydirelti

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Machine learning: best practices

• We’ve spent a lot of talking about the formal principles of machine learning.
• In this module, I will discuss some of the more empirical aspects you encounter in practice.
Choose your own adventure
Hypothesis class:
Feature extractor φ: linear, quadratic
fw (x) = sign(w · φ(x))
Architecture: number of layers, number of hidden units

Training objective:
1
P Loss function: hinge, logistic
|Dtrain | (x,y)∈Dtrain Loss(x, y, w) + Reg(w)
Regularization: none, L2

Optimization algorithm:
Number of epochs
Algorithm: stochastic gradient descent
Step size: constant, decreasing, adaptive
Initialize w = [0, . . . , 0]
For t = 1, . . . , T : Initialization: amount of noise, pre-training
For (x, y) ∈ Dtrain :
Batch size
w ← w − η∇w Loss(x, y, V, w)
CS221
Dropout 2
• Recall that there are three design decisions for setting up a machine learning algorithm: the hypothesis class, the training objective, and the
optimization algorithm.
• For the hypothesis class, there are two knobs you can turn. The first is the feature extractor φ (linear features, quadratic features, indicator
features on regions, etc. The second is the architecture of the predictor: linear (one layer) or neural network with layers, and in the case of
neural networks, how many hidden units (k) do we have.
• The second design decision is to specify the training objective, which we do by specifying the loss function depending how we want the
predictor to fit our data, and also whether we want to regularize the weights to guard against overfitting.
• The final design decision is how to optimize the predictor. Even the basic stochastic gradient descent algorithm has at least two knobs: how
long to train (number of epochs) and how aggressively to update (the step size). On top of that are many enhancements and tweaks common
to training deep neural networks: changing the step size over time, perhaps adaptively, how we initialize the weights, whether we update on
batches (say of 16 examples) instead of 1, and whether we apply dropout to guard against overfitting.
• So it is really a choose your own machine learning adventure. Sometimes decisions can be made via prior knowledge and are thoughtful (e.g.,
features that capture periodic trends). But in many (even most) cases, we don’t really know what the proper values should be. Instead, we
want a way to have these just set automatically.
Hyperparameters

Definition: hyperparameters

Design decisions (hypothesis class, training objective, optimization algorithm) that

need to be made before running the learning algorithm.

How do we choose hyperparameters?

Choose hyperparameters to minimize Dtrain error?

No - optimum would be to include all features, no regularization, train forever

Choose hyperparameters to minimize Dtest error?

No - choosing based on Dtest makes it an unreliable estimate of error!

CS221 4
• Each of these many design decisions is a hyperparameter.
• We could choose the hyperparameters to minimize the training loss. However, this would lead to a degenerate solution. For example, by
adding additional features, we can always decrease the training loss, so we would just end up adding all the features in the world, leading to
a model that wouldn’t generalize. We would turn off all regularization, because that just gets in the way of minimizing the training loss.
• What if we instead chose hyperparameters to minimize the test loss. This might lead to good hyperparameters, but is problematic because
you then lose the ability to measure how well you’re doing. Recall that the test set is supposed to be a surrogate for unseen examples, and
the more you optimize over them, the less unseen they become.
Validation set

Definition: validation set

A validation set is taken out of the training set and used to optimize hyperparameters.

Dtrain \Dval Dval Dtest

For each setting of hyperparameters, train on Dtrain \Dval , evaluate on Dval

CS221 6
• The solution is to invent something that looks like a test set. There’s no other data lying around, so we’ll have to steal it from the training
set. The resulting set is called the validation set.
• The size of the validation set should be large enough to give you a reliable estimate, but you don’t want to take away too many examples
from the training set.
• With this validation set, now we can simply try out a bunch of different hyperparameters and choose the setting that yields the lowest error
on the validation set. Which hyperparameter values should we try? Generally, you should start by getting the right order of magnitude (e.g.,
λ = 0.0001, 0.001, 0.01, 0.1, 1, 10) and then refining if necessary.
• In K-fold cross-validation, you divide the training set into K parts. Repeat K times: train on K − 1 of the parts and use the other part as
a validation set. You then get K validation errors, from which you can compute and report both the mean and the variance, which gives you
more reliable information.
Model development strategy

Algorithm: Model development strategy

• Split data into train, validation, test

• Look at data to get intuition
• Repeat:
• Implement model/feature, adjust hyperparameters
• Run learning algorithm
• Sanity check train and validation error rates
• Look at weights and prediction errors
• Evaluate on test set to get final error rates

CS221 8
• This slide represents the most important yet most overlooked part of machine learning: how to actually apply it in practice.
• We have so far talked about the mathematical foundation of machine learning (loss functions and optimization), and discussed some of the
conceptual issues surrounding overfitting, generalization, and the size of hypothesis classes. But what actually takes most of your time is not
writing new algorithms, but going through a development cycle, where you iteratively improve your system.
• The key is to stay connected with the data and the model, and have intuition about what’s going on. Make sure to empirically examine the
data before proceeding to the actual machine learning. It is imperative to understand the nature of your data in order to understand the
nature of your problem.
• First, maintain data hygiene. Hold out a test set from your data that you don’t look at until you’re done. Start by looking at the (training or
validation) data to get intuition. You can start to brainstorm what features / predictors you will need. You can compute some basic statistics.
• Then you enter a loop: implement a new model architecture or feature template. There are three things to look at: error rates, weights, and
predictions. First, sanity check the error rates and weights to make sure you don’t have an obvious bug. Then do an error analysis to see
which examples your predictor is actually getting wrong. The art of practical machine learning is turning these observations into new features.
• Finally, run your system once on the test set and report the number you get. If your test error is much higher than your validation error, then
you probably did too much tweaking and were overfitting (at a meta-level) the validation set.
Model development strategy example

Problem: simplified named-entity recognition

Input: a string x (e.g., Governor Gavin Newsom in)

Output: y, whether x (excluding first/last word) is a person or not (e.g., +1)

[code]

CS221 10
• Let’s try out the model development strategy on the task of training a classifier to predict whether a string is person or not (excluding the
first and last contxt words).
• First, let us look at the data (names.train). Starting simple, we define the empty feature template, which gets horrible error.
• Then we define a single feature template ”entity is ”. Look at weights (person names have positive weight, city names have negative
weight) and error-analysis.
• Let us add ”left is ” and ”right is ” feature templates based on the errors (e.g., the word ”said” is indicative of a person). Look at
weights (”the” showing up on the left indicates not a person) and error-analysis.
• Let us add feature templates ”entity contains ”. Look at weights and error-analysis.
• Let us add feature templates ”entity contains prefix ” and ”entity contains suffix ”. Look at weights and error-analysis.
• Finally we run it on the test set.
Tips
Start simple:
• Run on small subsets of your data or synthetic data
• Start with a simple baseline model
• Sanity check: can you overfit 5 examples
Log everything:
• Track training loss and validation loss over time
• Record hyperparameters, statistics of data, model, and predictions
• Organize experiments (each run goes in a separate folder)
Report your results:
• Run each experiment multiple times with different random seeds
• Compute multiple metrics (e.g., error rates for minority groups)
CS221 12
• There is more to be said about the practice of machine learning. Here are some pieces of advice. Note that many related to simply good
software engineering practices.
• First, don’t start out by coding up a large complex model and try running it on a million examples. Start simple, both with the data (small
number of examples) and the model (e.g., linear classifier). Sanity check that things are working first before increasing the complexity. This
will help you debug in a regime where things are more interpretable and also things run faster. One sanity check is to train a sufficiently
expressive model on a very few examples and see if the model can overfit the examples (get zero training error). This does not produce a
useful model, but is a diagnostic to see if the optimization is working. If you can’t overfit on 5 examples, then you have a problem: maybe
the hypothesis class is too small, the data is too noisy, or the optimization isn’t working.
• Second, log everything so you can diagnose problems. Monitor the losses over epochs. It is also important to track the training loss so that
if you get bad results, you can find out if it is due to bad optimization or overfitting. Record all the hyperparameters, so that you have a full
record of how to reproduce the results.
• Third, when you report your results, you should be able to run an experiment multiple times with different randomness to see how stable the
results are. Report error bars. And finally, if it makes sense for your application to report more than just a single test accuracy. Report the
errors for minority groups and add if your model is treating every group fairly.
Summary

Dtrain \Dval Dval Dtest

Don’t look at the test set!

Understand the data!

Start simple!

Practice!

CS221 14
• To summarize, we’ve talked about the practice of machine learning.
• First, make sure you follow good data hygiene, separating out the test set and don’t look at it.
• But you should look at the training or validation set to get intuition about your data before you start.
• Then, start simple and make sure you understand how things are working.
• Beyond that, there are a lot of design decisions to be made (hyperparameters). So the most important thing is to practice, so that you can
start developing more intuition, and developing a set of best practices that works for you.

RO47002 - Lecture 2C - Hyperparameters and Cross-Validation
No ratings yet
RO47002 - Lecture 2C - Hyperparameters and Cross-Validation
10 pages
Basic_concepts_of_Machine_Learning_for_Beginners_1732109263
No ratings yet
Basic_concepts_of_Machine_Learning_for_Beginners_1732109263
102 pages
A Practical and Technical Introduction To Machine Learning
No ratings yet
A Practical and Technical Introduction To Machine Learning
23 pages
An Introduction To Machine Learning
No ratings yet
An Introduction To Machine Learning
136 pages
19_ML_intro
No ratings yet
19_ML_intro
33 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
CS3244 (2120) - Project Discussion 1 - Overview
No ratings yet
CS3244 (2120) - Project Discussion 1 - Overview
25 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
04 Machine Learning Overview
No ratings yet
04 Machine Learning Overview
109 pages
Hyper Parameters
No ratings yet
Hyper Parameters
24 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
ML in Practice
No ratings yet
ML in Practice
51 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Machine Learning Fundamentals (Updated)
No ratings yet
Machine Learning Fundamentals (Updated)
42 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Week 2 - Select and Train A Model
No ratings yet
Week 2 - Select and Train A Model
29 pages
Deep Learning
No ratings yet
Deep Learning
25 pages
CS585 Lecture October03rd
No ratings yet
CS585 Lecture October03rd
146 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Lecture 2 - Hello World in ML
No ratings yet
Lecture 2 - Hello World in ML
49 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
IDML presentation
No ratings yet
IDML presentation
12 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
learning2
No ratings yet
learning2
82 pages
01_ml_basics
No ratings yet
01_ml_basics
61 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Codes and Concepts of ML-Developer-2
No ratings yet
Codes and Concepts of ML-Developer-2
17 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Segmentation Dataset
No ratings yet
Segmentation Dataset
41 pages
Lec2 Intro to ML
No ratings yet
Lec2 Intro to ML
35 pages
Complete ML Notes
No ratings yet
Complete ML Notes
62 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
01 - Introduction
No ratings yet
01 - Introduction
35 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Coding Neural Networks-Classification & Regression
No ratings yet
Coding Neural Networks-Classification & Regression
39 pages
IntroClassificationDA-2024
No ratings yet
IntroClassificationDA-2024
129 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
ML 01
No ratings yet
ML 01
24 pages
Training of Neural Networks: Q.J. Zhang, Carleton University
No ratings yet
Training of Neural Networks: Q.J. Zhang, Carleton University
44 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
AI 501 - Lesson 4 - Supervised Learning
No ratings yet
AI 501 - Lesson 4 - Supervised Learning
41 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
ML1-Introduction To Machine Learning
No ratings yet
ML1-Introduction To Machine Learning
46 pages
Module 4
No ratings yet
Module 4
28 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
60 pages
Lones_2024
No ratings yet
Lones_2024
28 pages
Course Two
No ratings yet
Course Two
133 pages
Overfitting & Feature Engineering.pptx
No ratings yet
Overfitting & Feature Engineering.pptx
37 pages
Unit No. 1
No ratings yet
Unit No. 1
73 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
AI-Lecture 8 (Machine Learning Overview)
No ratings yet
AI-Lecture 8 (Machine Learning Overview)
42 pages
Presentation on ML - Copy
No ratings yet
Presentation on ML - Copy
469 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages

Best Practices

Uploaded by

Best Practices

Uploaded by

Machine learning: best practices

Design decisions (hypothesis class, training objective, optimization algorithm) that

How do we choose hyperparameters?

Choose hyperparameters to minimize Dtrain error?

Choose hyperparameters to minimize Dtest error?

Definition: validation set

Dtrain \Dval Dval Dtest

For each setting of hyperparameters, train on Dtrain \Dval , evaluate on Dval

Algorithm: Model development strategy

• Split data into train, validation, test

Problem: simplified named-entity recognition

Input: a string x (e.g., Governor Gavin Newsom in)

Dtrain \Dval Dval Dtest

Don’t look at the test set!

Understand the data!

You might also like