Machine Learning and Pattern Recognition Week 2
Machine Learning and Pattern Recognition Week 2
Models
You now have enough knowledge to be dangerous. You could read some data into ma-
trices/arrays in a language like Matlab or Python, and attempt to fit a variety of linear
regression models. They may not be sensible models, and so the results may not mean very
much, but you can do it.
You could also use more sophisticated and polished methods baked into existing software.
Many Kaggle competitions are won using easy-to-use and fast gradient boosting methods
(such as XGBoost, LightGBM, or CatBoost), or for image data using ‘convolutional neural
nets’, which you can fit with several packages, for example Keras.
How do you decide which method(s) to use, and work out what you can conclude from the
results?
1 Baselines
It is almost always a good idea to find the simplest possible algorithm that could possibly
give an answer to your problem, and run it. In a regression task, you could report the
mean square error of a constant function, f ( x ) = b. If you’re not beating that model on your
training data, there is probably a bug in your code. If a fancy method does not generalize to
new data as well as the baseline, then maybe the problem is simply too hard. . . or there are
more subtle bugs in your code.
It is also usual to implement stronger baselines. For example, why use a neural network if
straight-forward linear regression works better? If you are proposing a new method for an
established task, you should try to compare to the existing state-of-the art if possible. (If a
paper does not contain enough detail to be reproducible, and does not come with code, they
may not deserve comparison.)
I’ve heard the following piece of advice for finding a machine learning algorithm to solve a
practical problem1 :
‘‘Find a recent paper at a top machine learning conference, like NeurIPS or ICML that’s
on the sort of task you need to solve. Look at the baseline that they compare their method
to. . . and implement that.”
If they used a method for comparison, it should be somewhat sensible, but it’s probably a
lot simpler and tested than the new method. And for many applications, you won’t actually
need this year’s advance to achieve what you need.
[The website version of this note has a question here.]
1. I wish I knew who to credit. The “quote” given is a paraphrase of something I’ve heard in passing.
where L(y, f (x)) is the error we record if we predict f (x) but the output was actually y. For
example, if we care about square error, then L(y, f (x)) = (y − f (x))2 .
We don’t have access to the distribution of data examples p(x, y). If we did, we wouldn’t
need to do learning! However, we might assume that our test set contains M samples from
this distribution. (That’s a big assumption, which is usually wrong, because the world will
change, and future examples will come from a different distribution.) Given this assumption,
an empirical (observed) average can be used as what’s called a Monte Carlo estimate of the
formal average in the integral above:
M
1
Average test error =
M ∑ L(y(m) , f (x(m) )), x(m) , y(m) ∼ p(x, y). (2)
m =1
If the test items really are drawn from the distribution that we are interested in, the expected
value of the test error is equal to the generalization error. (Check you can see why.) That
means it’s what’s called an unbiased estimate: sometimes it’s too big, sometimes to small, but
on average our estimate is correct.
x, input
The red dots are validation data: held out points that weren’t used to fit the curves. Training
and validation errors were then plotted for polynomials of different orders:
10
Mean square error
validation
train
5
0
0 5 10 15
p, polynomial order
As expected for nested models, the training error falls monotonically as we increase the
complexity of the model. However, the fifth order polynomial has the lowest validation error,
so we would normally pick that model. Looking at the validation curve, the average error
of the fourth to seventh order polynomials all seem similar. If choosing a model by hand, I
might instead pick the simpler fourth order polynomial.
[The website version of this note has a question here.]
4 What to report
After choosing our final model, we evaluate and report its test error. We would normally also
report the test error on a small number of reference models to put this number in context:
often a simple baseline of our own, and/or a previously-reported state-of-the-art result.
We don’t evaluate other intermediate models (with different model sizes or regularization
constants) on the test set!
It is often also useful to report training and validation losses, which we can safely report
for more models. Researchers following your work can then compare to your intermediate
results during their development. If you only give them a test performance, other researchers
will be tempted to consult the test performance of their own systems early and often during
development.
It can also be useful to look at the training and validation losses during your own model
development. If they’re similar, you might be underfitting, and you could consider more
complicated models. If the training score is good, but the validation score is poor, that could
be a sign of overfitting, so you could try regularizing your model more.
We usually report the average loss on each of the training, validation and test sets, rather than
the sum of losses. An average is usually more interpretable as the ‘typical’ loss encountered
per example. It is also easier to compare training and validation performance using average
errors: because the sets have different sizes, we wouldn’t expect the sum of losses to be
comparable.
6 K-fold cross-validation
A more elaborate procedure called K-fold cross-validation is appropriate for small datasets,
where holding out enough validation data to select model choices leaves too little data for
training a good model. The data that we can fit to is split into K parts. Each model is fitted K
times, each with a different part used as a validation set, and the remaining K − 1 parts used
for training. We select the model choices with the lowest validation error, when averaged
over the K folds. If we want a model for prediction, we then might refit a model with those
choices using all the data.
Making statistically rigorous statements based on K-fold cross-validation performance is
difficult. (Beyond this course, and I believe an open research area.)
If there’s a really small dataset, a paper might report a K-fold score in place of a test set
error. In your first project in machine learning, I would try to avoid following such work.
To see some of the difficulties, imagine I want to report how good a model class is (e.g., a
neural network) compared to a baseline (e.g., linear regression) for some task. But the data I
have to do the comparison is small, so I report an average test performance across K-folds,
each containing a training and test set.
What do I do within each fold? Assume I need to make a model choice or choose a
“hyperparameter” for each model class, such as the L2 regularization constant λ. As we have
already seen, we can’t fit such choices to the training data. But fitting them to the test data
2. In the toy example above, we artificially used a small fraction of the data for training, so that we’d quickly get
overfitting for demonstration purposes.
10 Further Reading
Bishop’s book Sections 1.1 and 1.3 summarize much of what we’ve done so far.
The following two book sections contain more (optional) details. These sections may assume
some knowledge we haven’t covered yet.
4. To give one example: Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission
(Caruana et al., SIGKDD 2015) includes a case study where models suggest that having Asthma reduces health-
related risks. This spurious result was driven by selection effects in the data.
‘‘A hypothesis f is said to overfit the data if there exists some alternative hypothesis f 0
such that f has a smaller training error than f 0 , but f 0 has a smaller generalization error
than f .”
However, I think almost all models f that are ever fitted would satisfy these criteria! Thus
the definition doesn’t seem useful to me.
If we thought we have enough data to get a good fit to a system everywhere, then we would
expect the training and validation errors to be similar. Thus a possible indication of overfitting
is that the average validation error is much worse than the average training error. However,
there is no hard rule here either. For example, if we know that our data are noiseless, we
know exactly what the function should be at our training locations. It is therefore reasonable
we will get lower error at or close to the training locations than at test locations in other
places.
I have seen people reject a model with the best validation error because it’s “overfitting”,
by which they mean the model had a bigger gap between its training and validation score
than other models.5 Such a decision is misguided. It’s the validation error that gives an
indication of generalization performance, not the difference between training and validation
error. Moreover, it’s common not to have enough data to get a perfect fit everywhere. So
while it’s informative to inspect the training and validation errors, we shouldn’t insist they
are similar.
The reality is that there is no single continuum from underfit models to overfit models, that
passes through an ideal model somewhere in between. I believe most large models will have
some aspects that are overly simplified, yet other aspects that result from an unreasonable
attempt to explain noise in the training items.
While I can’t define overfitting precisely, the concept is still useful. Adding flexibility to
models does often make test performance drop. It is a useful shorthand to say that we will
select a smaller model, or apply a form of regularization, to prevent overfitting.
5. On questioning these people, I’ve found in some cases it’s due to a misinterpretation of Goodfellow et al.’s
excellent Deep Learning textbook, which contains the sentence: “Overfitting occurs when the gap between the
training error and test error is too large.”. Please don’t lock onto this idea naively. There is no controversy that what
we actually want is good test or generalization error!