Machine Learning and Pattern Recognition Bayesian Complexity Control
Machine Learning and Pattern Recognition Bayesian Complexity Control
Fully Bayesian procedures can’t suffer from “overfitting” exactly, because parameters aren’t
fitted: Bayesian statistics only involves integrating or summing over uncertain parameters, not
optimizing. The predictions can depend heavily on the model, and choice of prior however.
Previously we chose models — and parameters that controlled the complexity of models — by
cross-validation. The Bayesian framework offers an alternative, using marginal likelihoods.
−2
−4
−3 −2 −1 0 1 2 3
Here, we assumed a line model. But the given observations clearly follow a bent curve,
so the line model is too simple. While it looks like there is just one line, we’ve drawn 12
lines from the posterior almost on top of each other, indicating that the posterior is sharply
peaked.
A sharply peaked posterior represents a belief with low uncertainty. As a result, we can be
very confident about properties of a model, even if running some checks (such as looking at
residuals) would show that the model is obviously in strong disagreement with the data.
Strong correlations between residuals is an indication for this being the case.
We can also have problems when we use a model that’s too complicated. As an extreme
example for illustration, we could imagine fitting a function on x ∈ [0, 1] with a million
RBFs spaced evenly over that range with bandwidths ∼ 10−6 . We can closely represent any
reasonable function with this representation. However, given (say) 20 observations, most of
the basis functions will be many bandwidths away from all of the observations. Thus, the
posterior distribution over most of the coefficients will be similar to the prior (check: can you
where the parameters w are assumed unknown. Narrowly focussed models, where the
mass of the prior distribution p(w | M) is concentrated on simple curves, will assign higher
4 Application to hyperparameters
The main challenge with model-selection is often setting real-valued parameters like the
noise level, the typical spread of the weights (their prior standard deviation), and the widths
of some radial basis functions. These values are harder to cross-validate than simple discrete
choices, and if we have too many of these parameters, we can’t cross-validate them all.
Incidentally, in the million narrow RBFs example, the main problem wasn’t that there
were a million RBFs, it was that they were narrow. Linear regression will make reasonable
predictions with many RBFs and only a few datapoints if the bandwidth parameter is broad
and we regularize. So we usually don’t worry about picking a precise number of basis
functions.1
The full Bayesian approach to prediction was described in the previous note. We integrate
over all parameters we don’t know. In a fully Bayesian approach, that integral would
include noise levels, the standard deviation of the weights in the prior, and the widths
of basis functions. Because the best setting of each of these values is unknown. However,
computing integrals over all of these quantities can be difficult (we will return to methods
for approximating such difficult integrals later in the course).
A simpler approach is to maximize some parameters, according to their marginal likelihood,
the likelihood with most of the parameters integrated out. For example, in a linear regression
model with prior
p(w | σw ) = N (w; 0, σw2 I), (2)
and likelihood
p(y | x, w, σy ) = N (y; w> x, σy2 ), (3)
we can fit the hyperparameters σw and σy , parameters which specify the model, to maximize
their marginal likelihood:
Z Z
p(y | X, σw , σy ) = p(y, w | X, σw , σy ) dw = p(y | X, w, σy ) p(w | σw ) dw. (4)
1. When we cover Gaussian processes, we will have an infinite number of basis functions and still be able to make
sensible predictions!
6 Further Reading
Bishop Section 3.4 and Murphy Section 7.6.4 are on Bayesian model selection. For keen
students, earlier sections of Murphy give mathematical detail for a Bayesian treatment of the
noise variance.
For keen students: Chapter 28 of MacKay’s book has a lengthier discussion of Bayesian
model comparison. Some time ago, Iain wrote a note discussing one of the figures in that
chapter.
For very keen students: It can be difficult to put sensible priors on models with many
parameters. In these situations it can sometimes be better to start out with a model class
that we know is too simple, and only swap to a complex model when we have a lot of data.
Bayesian model comparison can fail to tell us the best time to switch to a more complex
model. The paper Catching up faster by switching sooner (Erven et al., 2012) has a nice
language modelling example, and Iain’s thoughts are in the discussion of the paper.
For very keen students, Gelman et al.’s Bayesian Data Analysis book is a good starting point
for reading about model checking and criticism. All models are wrong, but we want to
improve parts of a model that are most strongly in disagreement with the data.
2. Bayes’ rule tells us that: p(w | D) = p(w) p(y | w, X )/p(y | X ). The Bayesian regression note identified all of
the distributions in this equation except for p(y | X ), so we can simply rearrange it to write p(y | X ) as a fraction
containing three Gaussian distributions. The identity is true for any w, so we can use any w (e.g., w = 0, or w = w N )
and we will get the same answer. We could optimize the hyperparameters by grid search.