A Tutorial Introduction To The Minimum Description Length Principle
A Tutorial Introduction To The Minimum Description Length Principle
Peter Grünwald
Centrum voor Wiskunde en Informatica
Kruislaan 413, 1098 SJ Amsterdam
The Netherlands
[email protected]
www.grunwald.nl
Abstract
1 Introducing MDL 5
1.1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 The Fundamental Idea:Learning as Data Compression . . . . . . . . . . 6
1.2.1 Kolmogorov Complexity and Ideal MDL . . . . . . . . . . . . . . 8
1.2.2 Practical MDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 MDL and Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Crude and Refined MDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 The MDL Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 MDL and Occam’s Razor . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.7 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Tutorial on MDL 23
2.1 Plan of the Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Information Theory I: Probabilities and Codelengths . . . . . . . . . . . 24
2.2.1 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 The Kraft Inequality - Codelengths & Probabilities, I . . . . . . 26
2.2.3 The Information Inequality - Codelengths & Probabilities, II . . 31
2.3 Statistical Preliminaries and Example Models . . . . . . . . . . . . . . . 32
2.4 Crude MDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Description Length of Data given Hypothesis . . . . . . . . . . . 35
2.4.2 Description Length of Hypothesis . . . . . . . . . . . . . . . . . . 35
2.5 Information Theory II: Universal Codes and Models . . . . . . . . . . . 37
2.5.1 Two-part Codes as simple Universal Codes . . . . . . . . . . . . 39
2.5.2 From Universal Codes to Universal Models . . . . . . . . . . . . 40
2.5.3 NML as an Optimal Universal Model . . . . . . . . . . . . . . . . 42
2.6 Simple Refined MDL and its Four Interpretations . . . . . . . . . . . . . 44
2.6.1 Compression Interpretation . . . . . . . . . . . . . . . . . . . . . 46
2.6.2 Counting Interpretation . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.3 Bayesian Interpretation . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.4 Prequential Interpretation . . . . . . . . . . . . . . . . . . . . . . 51
2.7 General Refined MDL: Gluing it All Together . . . . . . . . . . . . . . . 54
3
2.7.1 Model Selection with Infinitely Many Models . . . . . . . . . . . 54
2.7.2 The Infinity Problem . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.3 The General Picture . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.8 Beyond Parametric Model Selection . . . . . . . . . . . . . . . . . . . . 60
2.9 Relations to Other Approaches to Inductive Inference . . . . . . . . . . 63
2.9.1 What is ‘MDL’ ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.9.2 MDL and Bayesian Inference . . . . . . . . . . . . . . . . . . . . 64
2.9.3 MDL, Prequential Analysis and Cross-Validation . . . . . . . . . 67
2.9.4 Kolmogorov Complexity and Structure Function; Ideal MDL . . 67
2.10 Problems for MDL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.10.1 Conceptual Problems: Occam’s Razor . . . . . . . . . . . . . . . 68
2.10.2 Practical Problems with MDL . . . . . . . . . . . . . . . . . . . 70
2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4
Chapter 1
1. Occam’s Razor MDL chooses a model that trades-off goodness-of-fit on the ob-
served data with ‘complexity’ or ‘richness’ of the model. As such, MDL embodies
a form of Occam’s Razor, a principle that is both intuitively appealing and infor-
mally applied throughout all the sciences.
5
4. No need for ‘underlying truth’ In contrast to other statistical methods, MDL
procedures have a clear interpretation independent of whether or not there exists
some underlying ‘true’ model.
5. Predictive interpretation Because data compression is formally equivalent to a
form of probabilistic prediction, MDL methods can be interpreted as searching
for a model with good predictive performance on unseen data.
In this chapter, we introduce the MDL Principle in an entirely non-technical way,
concentrating on its most important applications, model selection and avoiding over-
fitting. In Section 1.2 we discuss the relation between learning and data compression.
Section 1.3 introduces model selection and outlines a first, ‘crude’ version of MDL that
can be applied to model selection. Section 1.4 indicates how these ‘crude’ ideas need
to be refined to tackle small sample sizes and differences in model complexity between
models with the same number of parameters. Section 1.5 discusses the philosophy un-
derlying MDL, and considers its relation to Occam’s Razor. Section 1.7 briefly discusses
the history of MDL. All this is summarized in Section 1.8.
Regularity . . . Consider the following three sequences. We assume that each se-
quence is 10000 bits long, and we just list the beginning and the end of each sequence.
00010001000100010001 . . . 0001000100010001000100010001 (1.1)
01110100110100100110 . . . 1010111010111011000101100010 (1.2)
00011000001010100000 . . . 0010001000010000001000110000 (1.3)
The first of these three sequences is a 2500-fold repetition of 0001. Intuitively, the
sequence looks regular; there seems to be a simple ‘law’ underlying it; it might make
sense to conjecture that future data will also be subject to this law, and to predict
that future data will behave according to this law. The second sequence has been
generated by tosses of a fair coin. It is intuitively speaking as ‘random as possible’, and
in this sense there is no regularity underlying it. Indeed, we cannot seem to find such a
regularity either when we look at the data. The third sequence contains approximately
four times as many 0s as 1s. It looks less regular, more random than the first; but it
looks less random than the second. There is still some discernible regularity in these
data, but of a statistical rather than of a deterministic kind. Again, noticing that such
a regularity is there and predicting that future data will behave according to the same
regularity seems sensible.
6
...and Compression We claimed that any regularity detected in the data can be used
to compress the data, i.e. to describe it in a short manner. Descriptions are always
relative to some description method which maps descriptions D′ in a unique manner to
data sets D. A particularly versatile description method is a general-purpose computer
language like C or Pascal. A description of D is then any computer program that
prints D and then halts. Let us see whether our claim works for the three sequences
above. Using a language similar to Pascal, we can write a program
which prints sequence (1) but is clearly a lot shorter. Thus, sequence (1) is indeed
highly compressible. On the other hand, we show in Section 2.2, that if one generates
a sequence like (2) by tosses of a fair coin, then with extremely high probability, the
shortest program that prints (2) and then halts will look something like this:
This program’s size is about equal to the length of the sequence. Clearly, it does nothing
more than repeat the sequence.
The third sequence lies in between the first two: generalizing n = 10000 to arbitrary
length n, we show in Section 2.2 that the first sequence can be compressed to O(log n)
bits; with overwhelming probability, the second sequence cannot be compressed at all;
and the third sequence can be compressed to some length αn, with 0 < α < 1.
The Number π Evidently, there exists a computer program for generating the first
n digits of π – such a program could be based, for example, on an infinite se-
ries expansion of π. This computer program has constant size, except for the
specification of n which takes no more than O(log n) bits. Thus, when n is very
large, the size of the program generating the first n digits of π will be very small
compared to n: the π-digit sequence is deterministic, and therefore extremely
regular.
Physics Data Consider a two-column table where the first column contains numbers
representing various heights from which an object was dropped. The second col-
umn contains the corresponding times it took for the object to reach the ground.
Assume both heights and times are recorded to some finite precision. In Sec-
tion 1.3 we illustrate that such a table can be substantially compressed by first
describing the coefficients of the second-degree polynomial H that expresses New-
ton’s law; then describing the heights; and then describing the deviation of the
time points from the numbers predicted by H.
7
Natural Language Most sequences of words are not valid sentences according to the
English language. This fact can be exploited to substantially compress English
text, as long as it is syntactically mostly correct: by first describing a grammar
for English, and then describing an English text D with the help of that grammar
[Grünwald 1996], D can be described using much less bits than are needed without
the assumption that word order is constrained.
8
is to scale down Solomonoff’s approach so that it does become applicable. This is
achieved by using description methods that are less expressive than general-purpose
computer languages. Such description methods C should be restrictive enough so that
for any data sequence D, we can always compute the length of the shortest description
of D that is attainable using method C; but they should be general enough to allow us
to compress many of the intuitively ‘regular’ sequences. The price we pay is that, using
the ‘practical’ MDL Principle, there will always be some regular sequences which we
will not be able to compress. But we already know that there can be no method for
inductive inference at all which will always give us all the regularity there is — simply
because there can be no automated method which for any sequence D finds the shortest
computer program that prints D and then halts. Moreover, it will often be possible to
guide a suitable choice of C by a priori knowledge we have about our problem domain.
For example, below we consider a description method C that is based on the class of
all polynomials, such that with the help of C we can compress all data sets which can
meaningfully be seen as points on some polynomial.
This idea can be applied to all sorts of inductive inference problems, but it turns out to
be most fruitful in (and its development has mostly concentrated on) problems of model
selection and, more generally, dealing with overfitting. Here is a standard example (we
explain the difference between ‘model’ and ‘hypothesis’ after the example).
Example 1.2 [Model Selection and Overfitting] Consider the points in Figure 1.1.
We would like to learn how the y-values depend on the x-values. To this end, we may
want to fit a polynomial to the points. Straightforward linear regression will give
us the leftmost polynomial - a straight line that seems overly simple: it does not
capture the regularities in the data well. Since for any set of n points there exists a
polynomial of the (n − 1)-st degree that goes exactly through all these points, simply
looking for the polynomial with the least error will give us a polynomial like the one
in the second picture. This polynomial seems overly complex: it reflects the random
fluctuations in the data rather than the general pattern underlying it. Instead of picking
the overly simple or the overly complex polynomial, it seems more reasonable to prefer
9
Figure 1.1: A simple, a complex and a trade-off (3rd degree) polynomial.
a relatively simple polynomial with small but nonzero error, as in the rightmost picture.
This intuition is confirmed by numerous experiments on real-world data from a broad
variety of sources [Rissanen 1989; Vapnik 1998; Ripley 1996]: if one naively fits a high-
degree polynomial to a small sample (set of data points), then one obtains a very good
fit to the data. Yet if one tests the inferred polynomial on a second set of data coming
from the same source, it typically fits this test data very badly in the sense that there
is a large distance between the polynomial and the new data points. We say that the
polynomial overfits the data. Indeed, all model selection methods that are used in
practice either implicitly or explicitly choose a trade-off between goodness-of-fit and
complexity of the models involved. In practice, such trade-offs lead to much better
predictions of test data than one would get by adopting the ‘simplest’ (one degree)
or most ‘complex3 ’ (n − 1-degree) polynomial. MDL provides one particular means of
achieving such a trade-off.
It will be useful to make a precise distinction between ‘model’ and ‘hypothesis’:
10
responding parameters; it is a ‘model selection problem’ if we are mainly interested in
selecting the degree.
• L(D|H) is the length, in bits, of the description of the data when encoded
with the help of the hypothesis.
The best model to explain D is the smallest model containing the selected H.
Example 1.3 [Polynomials, cont.] In our previous example, the candidate hypothe-
ses were polynomials. We can describe a polynomial by describing its coefficients in a
certain precision (number of bits per parameter). Thus, the higher the degree of a poly-
nomial or the precision, the more5 bits we need to describe it and the more ‘complex’
it becomes. A description of the data ‘with the help of’ a hypothesis means that the
better the hypothesis fits the data, the shorter the description will be. A hypothesis
that fits the data well gives us a lot of information about the data. Such information
can always be used to compress the data (Section 2.2). Intuitively, this is because we
only have to code the errors the hypothesis makes on the data rather than the full data.
In our polynomial example, the better a polynomial H fits D, the fewer bits we need
to encode the discrepancies between the actual y-values yi and the predicted y-values
H(xi ). We can typically find a very complex point hypothesis (large L(H)) with a very
good fit (small L(D|H)). We can also typically find a very simple point hypothesis
(small L(H)) with a rather bad fit (large L(D|H)). The sum of the two description
lengths will be minimized at a hypothesis that is quite (but not too) ‘simple’, with a
good (but not perfect) fit.
11
1.4 Crude and Refined MDL
Crude MDL picks the H minimizing the sum L(H) + L(D|H). To make this procedure
well-defined, we need to agree on precise definitions for the codes (description methods)
giving rise to lengths L(D|H) and L(H). We now discuss these codes in more detail.
We will see that the definition of L(H) is problematic, indicating that we somehow
need to ‘refine’ our crude MDL Principle.
Definition of L(D|H) Consider a two-part code as described above, and assume for
the time being that all H under consideration define probability distributions. If H is
a polynomial, we can turn it into a distribution by making the additional assumption
that the Y -values are given by Y = H(X) + Z, where Z is a normally distributed noise
term.
For each H we need to define a code with length L(· | H) such that L(D|H)
can be interpreted as ‘the codelength of D when encoded with the help of H’. It
turns out that for probabilistic hypotheses, there is only one reasonable choice for
this code. It is the so-called Shannon-Fano code, satisfying, for all data sequences
D, L(D|H) = − log P (D|H), where P (D|H) is the probability mass or density of D
according to H – such a code always exists, Section 2.2.
Refined MDL In refined MDL, we associate a code for encoding D not with a single
H ∈ H, but with the full model H. Thus, given model H, we encode data not in
two parts but we design a single one-part code with lengths L̄(D|H). This code is
designed such that whenever there is a member of (parameter in) H that fits the data
12
well, in the sense that L(D | H) is small, then the codelength L̄(D|H) will also be
small. Codes with this property are called universal codes in the information-theoretic
literature [Barron, Rissanen, and Yu 1998]. Among all such universal codes, we pick
the one that is minimax optimal in a sense made precise in Section 2.5. For example, the
set H(3) of third-degree polynomials is associated with a code with lengths L̄(· | H(3) )
such that, the better the data D are fit by the best-fitting third-degree polynomial, the
shorter the codelength L̄(D | H). L̄(D | H) is called the stochastic complexity of the
data given the model.
13
an instance of Dawid’s [1984] prequential model validation and also relates it to
cross-validation methods.
Refined MDL allows us to compare models of different functional form. It even accounts
for the phenomenon that different models with the same number of parameters may
not be equally ‘complex’:
Example 1.4 Consider two models from psychophysics describing the relationship be-
tween physical dimensions (e.g., light intensity) and their psychological counterparts
(e.g. brightness) [Myung, Balasubramanian, and Pitt 2000]: y = axb + Z (Stevens’
model) and y = a ln(x + b) + Z (Fechner’s model) where Z is a normally distributed
noise term. Both models have two free parameters; nevertheless, it turns out that in
a sense, Stevens’ model is more flexible or complex than Fechner’s. Roughly speaking,
this means there are a lot more data patterns that can be explained by Stevens’ model
than can be explained by Fechner’s model. Myung, Balasubramanian, and Pitt [2000]
generated many samples of size 4 from Fechner’s model, using some fixed parameter
values. They then fitted both models to each sample. In 67% of the trials, Stevens’
model fitted the data better than Fechner’s, even though the latter generated the data.
Indeed, in refined MDL, the ‘complexity’ associated with Stevens’ model is much larger
than the complexity associated with Fechner’s, and if both models fit the data equally
well, MDL will prefer Fechner’s model.
Summarizing, refined MDL removes the arbitrary aspect of crude, two-part code MDL
and associates parametric models with an inherent ‘complexity’ that does not depend
on any particular description method for hypotheses. We should, however, warn the
reader that we only discussed a special, simple situation in which we compared a finite
number of parametric models that satisfy certain regularity conditions. Whenever the
models do not satisfy these conditions, or if we compare an infinite number of models,
then the refined ideas have to be extended. We then obtain a ‘general’ refined MDL
Principle, which employs a combination of one-part and two-part codes.
14
be entertaining but quite irrelevant to the task at hand, namely, to learn useful
properties from the data.”
Based on such ideas, Rissanen has developed a radical philosophy of learning and
statistical inference that is considerably different from the ideas underlying mainstream
statistics, both frequentist and Bayesian. We now describe this philosophy in more
detail:
15
3. We Have Only the Data Many (but not all6 ) other methods of inductive
inference are based on the idea that there exists some ‘true state of nature’, typically
a distribution assumed to lie in some model H. The methods are then designed as a
means to identify or approximate this state of nature based on as little data as possible.
According to Rissanen7 , such methods are fundamentally flawed. The main reason is
that the methods are designed under the assumption that the true state of nature is
in the assumed model H, which is often not the case. Therefore, such methods only
admit a clear interpretation under assumptions that are typically violated in practice.
Many cherished statistical methods are designed in this way - we mention hypothesis
testing, minimum-variance unbiased estimation, several non-parametric methods, and
even some forms of Bayesian inference – see Example 2.22. In contrast, MDL has a
clear interpretation which depends only on the data, and not on the assumption of any
underlying ‘state of nature’.
Example 1.5 [Models that are Wrong, yet Useful] Even though the models
under consideration are often wrong, they can nevertheless be very useful. Ex-
amples are the successful ‘Naive Bayes’ model for spam filtering, Hidden Markov
Models for speech recognition (is speech a stationary ergodic process? probably
not), and the use of linear models in econometrics and psychology. Since these
models are evidently wrong, it seems strange to base inferences on them using
methods that are designed under the assumption that they contain the true distri-
bution. To be fair, we should add that domains such as spam filtering and speech
recognition are not what the fathers of modern statistics had in mind when they
designed their procedures – they were usually thinking about much simpler do-
mains, where the assumption that some distribution P ∗ ∈ H is ‘true’ may not be
so unreasonable.
16
part code MDL with ‘clever’ codes achieves good rates of convergence in this sense
(Barron and Cover [1991], complemented by [Zhang 2004], show that in many situa-
tions, the rates are minimax optimal). The same seems to be true for refined one-part
code MDL [Barron, Rissanen, and Yu 1998], although there is at least one surprising
exception where inference based on the NML and Bayesian universal model behaves
abnormally – see [Csiszár and Shields 2000] for the details.
Summarizing this section, the MDL philosophy is quite agnostic about whether any of
the models under consideration is ‘true’, or whether something like a ‘true distribution’
even exists. Nevertheless, it has been suggested [Webb 1996; Domingos 1999] that
MDL embodies a naive belief that ‘simple models’ are ‘a priori more likely to be true’
than complex models. Below we explain why such claims are mistaken.
‘2. Occam’s Razor is false’ It is often claimed that Occam’s razor is false - we
often try to model real-world situations that are arbitrarily complex, so why should we
favor simple models? In the words of G. Webb8 : ‘What good are simple models of a
complex world?’
The short answer is: even if the true data generating machinery is very complex,
it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL
(and the corresponding form of Occam’s razor) is a strategy for inferring models from
data (“choose simple models at small sample sizes”), not a statement about how the
world works (“simple models are more likely to be true”) – indeed, a strategy cannot
be true or false, it is ‘clever’ or ‘stupid’. And the strategy of preferring simpler models
17
is clever even if the data generating process is highly complex, as illustrated by the
following example:
Example 1.6 [‘Infinitely’ Complex Sources] Suppose that data are subject to the
law Y = g(X) + Z where g is some continuous function and Z is some noise term
with mean 0. If g is not a polynomial, but X only takes values in a finite interval,
say [−1, 1], we may still approximate g arbitrarily well by taking higher and higher
degree polynomials. For example, let g(x) = exp(x). Then, if we use MDL to learn
a polynomial for data D = ((x1 , y1 ), . . . , (xn , yn )), the degree of the polynomial f¨(n)
selected by MDL at sample size n will increase with n, and with high probability,
f¨(n) converges to g(x) = exp(x) in the sense that maxx∈[−1,1] |f¨(n) (x) − g(x)| → 0. Of
course, if we had better prior knowledge about the problem we could have tried to learn
g using a model class M containing the function y = exp(x). But in general, both our
imagination and our computational resources are limited, and we may be forced to use
imperfect models.
If, based on a small sample, we choose the best-fitting polynomial fˆ within the set
of all polynomials, then, even though fˆ will fit the data very well, it is likely to be
quite unrelated to the ‘true’ g, and fˆ may lead to disastrous predictions of future
data. The reason is that, for small samples, the set of all polynomials is very large
compared to the set of possible data patterns that we might have observed. Therefore,
any particular data pattern can only give us very limited information about which
high-degree polynomial best approximates g. On the other hand, if we choose the
best-fitting fˆ◦ in some much smaller set such as the set of second-degree polynomials,
then it is highly probable that the prediction quality (mean squared error) of fˆ◦ on
future data is about the same as its mean squared error on the data we observed: the
size (complexity) of the contemplated model is relatively small compared to the set of
possible data patterns that we might have observed. Therefore, the particular pattern
that we do observe gives us a lot of information on what second-degree polynomial best
approximates g.
Thus, (a) fˆ◦ typically leads to better predictions of future data than fˆ; and (b)
unlike fˆ, fˆ◦ is reliable in that it gives a correct impression of how good it will pre-
dict future data even if the ‘true’ g is ‘infinitely’ complex. This idea does not just
appear in MDL, but is also the basis of Vapnik’s [1998] Structural Risk Minimization
approach and many standard statistical methods for non-parametric inference. In such
approaches one acknowledges that the data generating machinery can be infinitely com-
plex (e.g., not describable by a finite degree polynomial). Nevertheless, it is still a good
strategy to approximate it by simple hypotheses (low-degree polynomials) as long as
the sample size is small. Summarizing:
18
The Inherent Difference between Under- and Overfitting
If we choose an overly simple model for our data, then the best-fitting point hy-
pothesis within the model is likely to be almost the best predictor, within the
simple model, of future data coming from the same source. If we overfit (choose
a very complex model) and there is noise in our data, then, even if the complex
model contains the ‘true’ point hypothesis, the best-fitting point hypothesis within
the model is likely to lead to very bad predictions of future data coming from the
same source.
This statement is very imprecise and is meant more to convey the general idea than
to be completely true. As will become clear in Section 2.10.1, it becomes provably
true if we use MDL’s measure of model complexity; we measure prediction quality by
logarithmic loss; and we assume that one of the distributions in H actually generates
the data.
1.7 History
The MDL Principle has mainly been developed by J. Rissanen in a series of papers
starting with [Rissanen 1978]. It has its roots in the theory of Kolmogorov or algorith-
mic complexity [Li and Vitányi 1997], developed in the 1960s by Solomonoff [1964],
Kolmogorov [1965] and Chaitin [1966, 1969]. Among these authors, Solomonoff (a
former student of the famous philosopher of science, Rudolf Carnap) was explicitly in-
terested in inductive inference. The 1964 paper contains explicit suggestions on how the
underlying ideas could be made practical, thereby foreshadowing some of the later work
on two-part MDL. While Rissanen was not aware of Solomonoff’s work at the time,
Kolmogorov’s [1965] paper did serve as an inspiration for Rissanen’s [1978] development
of MDL.
Another important inspiration for Rissanen was Akaike’s [1973] AIC method for
model selection, essentially the first model selection method based on information-
theoretic ideas. Even though Rissanen was inspired by AIC, both the actual method
and the underlying philosophy are substantially different from MDL.
MDL is much closer related to the Minimum Message Length Principle, devel-
oped by Wallace and his co-workers in a series of papers starting with the ground-
breaking [Wallace and Boulton 1968]; other milestones are [Wallace and Boulton 1975]
and [Wallace and Freeman 1987]. Remarkably, Wallace developed his ideas without be-
ing aware of the notion of Kolmogorov complexity. Although Rissanen became aware of
Wallace’s work before the publication of [Rissanen 1978], he developed his ideas mostly
independently, being influenced rather by Akaike and Kolmogorov. Indeed, despite
the close resemblance of both methods in practice, the underlying philosophy is quite
different - see Section 2.9.
The first publications on MDL only mention two-part codes. Important progress
was made by Rissanen [1984], in which prequential codes are employed for the first
19
time and [Rissanen 1987], introducing the Bayesian mixture codes into MDL. This led
to the development of the notion of stochastic complexity as the shortest codelength
of the data given a model [Rissanen 1986; Rissanen 1987]. However, the connection to
Shtarkov’s normalized maximum likelihood code was not made until 1996, and this pre-
vented the full development of the notion of ‘parametric complexity’. In the mean time,
in his impressive Ph.D. thesis, Barron [1985] showed how a specific version of the two-
part code criterion has excellent frequentist statistical consistency properties. This was
extended by Barron and Cover [1991] who achieved a breakthrough for two-part codes:
they gave clear prescriptions on how to design codes for hypotheses, relating codes with
good minimax codelength properties to rates of convergence in statistical consistency
theorems. Some of the ideas of Rissanen [1987] and Barron and Cover [1991] were, as
it were, unified when Rissanen [1996] introduced a new definition of stochastic com-
plexity based on the normalized maximum likelihood code (Section 2.5). The resulting
theory was summarized for the first time by Barron, Rissanen, and Yu [1998], and is
called ‘refined MDL’ in the present overview.
20
Notes
1. See Section 2.9.2, Example 2.22.
2. By this we mean that a universal Turing Machine can be implemented in it [Li and Vitányi 1997].
3. Strictly speaking, in our context it is not very accurate to speak of ‘simple’ or ‘complex’ polyno-
mials; instead we should call the set of first degree polynomials ‘simple’, and the set of 100-th degree
polynomials ‘complex’.
4. The terminology ‘crude MDL’ is not standard. It is introduced here for pedagogical reasons,
to make clear the importance of having a single, unified principle for designing codes. It should
be noted that Rissanen’s and Barron’s early theoretical papers on MDL already contain such prin-
ciples, albeit in a slightly different form than in their recent papers. Early practical applications
[Quinlan and Rivest 1989; Grünwald 1996] often do use ad hoc two-part codes which really are ‘crude’
in the sense defined here.
5. See the previous note.
6. For example, cross-validation cannot easily be interpreted in such terms of ‘a method hunting for
the true distribution’.
7. The present author’s own views are somewhat milder in this respect, but this is not the place to
discuss them.
8. Quoted with permission from KDD Nuggets 96:28, 1996.
21
22
Chapter 2
23
and stochastic complexity. It gives an asymptotic expansion of these quanti-
ties and interprets them from a compression, a geometric, a Bayesian and a
predictive point of view.
– Section 2.7 extends refined MDL to harder model selection problems, and
in doing so reveals the general, unifying idea, which is summarized in Fig-
ure 2.4.
– Section 2.8 briefly discusses how to extend MDL to applications beyond
model section.
• Having defined ‘refined MDL’ in Sections 2.6–2.8, the next two sections place it
in context:
Reader’s Guide
Throughout the text, paragraph headings reflect the most important concepts.
Boxes summarize the most important findings. Together, paragraph headings and
boxes provide an overview of MDL theory.
It is possible to read this chapter without having read the non-technical overview
of Chapter 1. However, we strongly recommend reading at least Sections 1.3 and
Section 1.4 before embarking on the present chapter.
24
distributed according to P , then the code with lengths − log P achieves the minimum
expected codelength. Throughout the section we give examples relating our findings to
our discussion of regularity and compression in Section 1.2 of Chapter 1.
25
We already designed a code C for the elements in X . The natural thing to do is to
encode (x1 , . . . , xn ) by the concatenated string C(x1 )C(x2 ) . . . C(xn ). In order for this
method to succeed for all n, all (x1 , . . . , xn ) ∈ X n , the resulting procedure must define
a code, i.e. the function C (n) mapping (x1 , . . . , xn ) to C(x1 )C(x2 ) . . . C(xn ) must be
invertible. If it were not, we would have to use some marker such as a comma to
separate the codewords. We would then really be using a ternary rather than a binary
alphabet.
Since we always want to construct codes for sequences rather than single symbols,
we only allow codes C such that the extension C (n) defines a code for all n. We say
that such codes have ‘uniquely decodable extensions’. It is easy to see that (a) every
prefix code has uniquely decodable extensions. Conversely, although this is not at all
easy to see, it turns out that (b), for every code C with uniquely decodable extensions,
there exists a prefix code C0 such that for all n, xn ∈ X n , LC (n) (xn ) = LC (n) (xn )
0
[Cover and Thomas 1991]. Since in MDL we are only interested in code-lengths, and
never in actual codes, we can restrict ourselves to prefix codes without loss of generality.
Thus, the restriction to prefix code may also be understood as a means to send
concatenated messages while avoiding the need to introduce extra symbols into the
alphabet.
Whenever in the sequel we speak of ‘code’, we really mean ‘prefix code’. We call a
prefix code C for a set X complete if there exists no other prefix code that compresses
at least one x more and no x less then C, i.e. if there exists no code C ′ such that for
all x, LC ′ (x) ≤ LC (x) with strict inequality for at least one x.
We showed that (a) the first sequence - an n-fold repetition of 0001 - could be sub-
stantially compressed if we use as our code a general-purpose programming language
(assuming that valid programs must end with a halt-statement or a closing bracket,
such codes satisfy the prefix property). We also claimed that (b) the second sequence, n
independent outcomes of fair coin tosses, cannot be compressed, and that (c) the third
sequence could be compressed to αn bits, with 0 < α < 1. We are now in a position
to prove statement (b): strings which are ‘intuitively’ random cannot be substantially
26
compressed. Let us take some arbitrary but fixed description method over the data
alphabet consisting of the set of all binary sequences of length n. Such a code maps
binary strings to binary strings. There are 2n possible data sequences of length n.
Only two of these can be mapped to a description of length 1 (since there are only two
binary strings of length 1: ‘0’ and ‘1’). Similarly, only a subset ofPat most 2m sequences
can have a description of length m. This means that at most m m
i=1 2 < 2
m+1 data
sequences can have a description length ≤ m. The fraction of data sequences of length
n that can be compressed by more than k bits is therefore at most 2−k and as such
decreases exponentially in k. If data are generated by n tosses of a fair coin, then all 2n
possibilities for the data are equally probable, so the probability that we can compress
the data by more than k bits is smaller than 2−k . For example, the probability that
we can compress the data by more than 20 bits is smaller than one in a million.
We note that after the data (2.2) has been observed, it is always possible to design a
code which uses arbitrarily few bits to encode this data - the actually observed sequence
may be encoded as ‘1’ for example, and no other sequence is assigned a codeword. The
point is that with a code that has been designed before seeing any data, it is virtually
impossible to substantially compress randomly generated data.
The example demonstrates that achieving a short description length for the data is
equivalent to identifying the data as belonging to a tiny, very special subset out of all
a priori possible data sequences.
27
Probability Mass Functions correspond to Codelength Functions
Let Z be a finite or countable set and let P be a probability distribution on Z. Then
there exists a prefix code C for Z such that for all z ∈ Z, LC (z) = ⌈− log P (z)⌉.
C is called the code corresponding to P .
Similarly, let C ′ be a prefix code for Z. Then there exists a (possibly defective)
probability distribution P ′ such that for all z ∈ Z, − log P ′ (z) = LC ′ (z). P ′ is
called the probability distribution corresponding to C ′ .
P
Moreover C ′ is a complete prefix code iff P is proper ( z P (z) = 1).
Thus, large probability according to P means small code length according to the
code corresponding to P and vice versa.
We are typically concerned with cases where Z represents sequences of n outcomes;
that is, Z = X n (n ≥ 1) where X is the sample space for one observation.
element. We can arrive at a code corresponding to PU as follows. First, order and num-
ber the elements in Z as 0, 1, . . . , M −1. Then, for each z with number j, set C(z) to be
equal to j represented as a binary number with ⌈log M ⌉ bits. The resulting code has,
for all z ∈ Z, LC (z) = ⌈log M ⌉ = ⌈− log PU (z)⌉. This is a code corresponding to PU
(Figure 2.1). In general, there exist several codes corresponding to PU , one for each or-
dering of Z. But all these codes share the same length function LU (z) := ⌈− log PU (z)⌉.;
therefore, LU (z) is the unique codelength function corresponding to PU .
For example, if M = 4, Z = {a, b, c, d}, we can take C(a) = 00, C(b) = 01, C(c) =
10, C(d) = 11 and then LU (z) = 2 for all z ∈ Z. In general, codes corresponding to
uniform distributions assign fixed lengths to each z and are called fixed-length codes.
To map a non-uniform distribution to a corresponding code, we have to use a more
intricate construction [Cover and Thomas 1991].
In practical applications, we almost always deal with probability distributions P and
strings xn such that P (xn ) decreases exponentiallyQin n; for example, this will typically
be the case if data are i.i.d., such that P (xn ) = P (xi ). Then − log P (xn ) increases
linearly in n and the effect of rounding off − log P (xn ) becomes negligible. Note that
the code corresponding to the product distribution of P on X n does not have to be the
n-fold extension of the code for the original distribution P on X – if we were to require
that, the effect of rounding off would be on the order of n . Instead, we directly design a
code for the distribution on the larger space Z = X n . In this way, the effect of rounding
changes the codelength by at most 1 bit, which is truly negligible. For this and other4
reasons, we henceforth simply neglect the integer requirement for codelengths. This
simplification allows us to identify codelength functions and (defective) probability
28
New Definition of Code Length Function
In MDL we are NEVER concerned with actual encodings; we are only concerned
with code length functions. The set of all codelength functions for finite or count-
able sample space Z is defined as:
X
2−L(z) ≤ 1 ,
LZ = L : Z → [0, ∞] | (2.4)
z∈X
or equivalently, LZ P
is the set of those functions L on Z such that there exists
a function Q with z Q(z) ≤ 1 and for all z, L(z) = − log Q(z). (Q(z) = 0
corresponds to L(z) = ∞).
Again, Z usually represents a sample of n outcomes: Z = X n (n ≥ 1) where X is
the sample space for one observation.
mass functions, such that a short codelength corresponds to a high probability and
vice versa. Furthermore, as we will see, in MDL we are not interested in the details of
actual encodings C(z); we only care about the code lengths LC (z). It is so useful to
think about these as log-probabilities, and so convenient to allow for non-integer non-
probabilities, that we will simply redefine prefix code length functions as (defective)
probability mass functions that can have non-integer code lengths – see Figure 2.2.
The following example illustrates idealized codelength functions and at the same time
introduces a type of code that will be frequently used in the sequel:
Example 2.4 ‘Almost’ Uniform Code for the Positive Integers Suppose we
want to encode a number k ∈ {1, 2, . . .}. In Example 2.3, we saw that in order to
encode a number between 1 and M , we need log M bits. What if we cannot determine
the maximum M in advance? We cannot just encode k using the uniform code for
{1, . . . , k}, since the resulting code would not be prefix. So in general, we will need
more than log k bits. Yet there exists a prefix-free code which performs ‘almost’ as well
as log k. The simplest of such codes works as follows. k is described by a codeword
starting with ⌈log k⌉ 0s. This is followed by a 1, and then k is encoded using the
uniform code for {1, . . . , 2⌈log k⌉ }. With this protocol, a decoder can first reconstruct
⌈log k⌉ by counting all 0’s before the leftmost 1 in the encoding. He then has an upper
bound on k and can use this knowledge to decode k itself. This protocol uses less than
2⌈log k⌉ + 1 bits. Working with idealized, non-integer code-lengths we can simplify
this to 2 log k + 1 bits. To see this, consider the function P (x) = 2−2 log x−1 . An easy
29
calculation gives
X X 1 X −2 1 1 X 1
P (x) = 2−2 log x−1 = x < + = 1,
x∈1,2,... x∈1,2,...
2 x∈1,2,... 2 2 x=2,3,... x(x − 1)
Example 2.5 [Example 1.2 and 2.2, Continued.] We are now also in a posi-
tion to prove the third and final claim of Examples 1.2 and 2.2. Consider the three
sequences (2.1), (2.2) and (2.3) on page 26 again. It remains to investigate how
much the third sequence can be compressed. Assume for concreteness that, before
seeing the sequence, we are told that the sequence contains a fraction of 1s equal
to 1/5 + ǫ for some small unknown ǫ. By the Kraft inequality, Figure 2.1, for all
distributions P , there exists some code on sequences of length n such that for all
xn ∈ X n , L(xn ) = ⌈− log P (xn )⌉. The fact that the fraction of 1s is approximately
equal to 1/5 suggests to model xn as independent outcomes of a coin with bias
1/5-th. The corresponding distribution P0 satisfies
n[1] n[0]
n 1 4 1 1 4 4
− log P0 (x ) = log = n − + ǫ log − − ǫ log =
5 5 5 5 5 5
8
n[log 5 − + 2ǫ],
5
where n[j] denotes the number of occurrences of symbol j in xn . For small enough ǫ,
the part between brackets is smaller than 1, so that, using the code L0 with lengths
− log P0 , the sequence can be encoded using αn bits were α satisfies 0 < α < 1.
Thus, using the code L0 , the sequence can be compressed by a linear amount, if
we use a specially designed code that assigns short codelengths to sequences with
about four times as many 0s than 1s.
We note that after the data (2.3) has been observed, it is always possible to design
a code which uses arbitrarily few bits to encode xn - the actually observed sequence
may be encoded as ‘1’ for example, and no other sequence is assigned a codeword.
The point is that with a code that has been designed before seeing the actual
sequence, given only the knowledge that the sequence will contain approximately
four times as many 0s than 1s, the sequence is guaranteed to be compressed by an
amount linear in n.
Continuous Sample Spaces How does the correspondence work for continuous-
valued X ? In this tutorial we only consider P on X such that P admits a density5 .
30
The P that corresponds to L minimizes expected codelength
Let P be a distribution on (finite, countable or continuous-valued) Z and let L be
defined by
L:= arg min EP [L(Z)]. (2.5)
L∈LZ
Whenever in the following we make a general statement about sample spaces X and
distributions P , X may be finite, countable or any subset of Rl , for any integer l ≥ 1,
and P (x) represents the probability mass function or density of P , as the case may
be. In the continuous case, all sums should be read as integrals. The correspon-
dence between probability distributions and codes may be extended to distributions on
continuous-valued X : we may think of L(xn ) := − log P (xn ) as a code-length function
corresponding to Z = X n encoding the values in X n at unit precision; here P (xn ) is
the density of xn according to P . We refer to [Cover and Thomas 1991] for further
details.
In this form, the result is known as the information inequality. It is easily proved using
concavity of the logarithm [Cover and Thomas 1991].
The information inequality says the following: suppose Z is distributed according
to P (‘generated by P ’). Then, among all possible codes for Z, the code with lengths
31
− log P (Z) ‘on average’ gives the shortest encodings of outcomes of P . Why should
we be interested in the average? The law of large numbers [Feller 1968] implies that,
for large samples of data distributed according to P , with high P -probability, the code
that gives the shortest expected lengths will also give the shortest actual codelengths,
which is what we are really interested in. This will hold if data are i.i.d., but also more
generally if P defines a ‘stationary and ergodic’ process.
Example 2.6 Let us briefly illustrate this. Let P ∗ , QA and QB be three proba-
n ∗ n
bility
Q ∗ distributions on X , extended to Z = X by independence. Hence P (x ) =
P (xi ) and similarly for QA and QB . Suppose we obtain a sample generated by
P ∗ . Mr. A and Mrs. B both want to encode the sample using as few bits as possible,
but neither knows that P ∗ has actually been used to generate the sample. A decides
to use the code corresponding to distribution QA and B decides to use the code cor-
responding to QB . Suppose that EP ∗ [− log QA (X)] < EP ∗ [− log QB (X)]. Then,
by the law of large numbers , with P ∗ -probability 1, n−1 [− log QP
j (X1 , . . . , Xn )] →
n
EP ∗ [− log Qj (X)], for both j ∈ {A, B} (note − log Qj (X n ) = − i=1 log Qj (Xi )).
It follows that, with probability 1, Mr. A will need less (linearly in n) bits to
encode X1 , . . . , Xn than Mrs. B.
The qualitative content of this result is not so surprising: in a large sample generated
by P , the frequency of each x ∈ X will be approximately equal to the probability
P (x). In order to obtain a short codelength for xn , we should use a code that assigns a
small codelength to those symbols in X with high frequency (probability), and a large
codelength to those symbols in X with low frequency (probability).
Summary and Outlook In this section we introduced (prefix) codes and thoroughly
discussed the relation between probabilities and codelengths. We are now almost ready
to formalize a simple version of MDL – but first we need to review some concepts of
statistics.
32
θ → P (· | θ) is smooth (appropriately defined), we call M a parametric model or family.
For example, the model M of all normal distributions on X = R is a parametric model
that can be parameterized by θ = (µ, σ 2 ) where µ is the mean and σ 2 is the variance of
the distribution indexed by θ. The family of all Markov chains of all orders is a model,
but not a parametric model. We call a model M an i.i.d. model if, according to all
P ∈ M, X1 , X2 , . . . are i.i.d. We call M k-dimensional if k is the smallest integer k so
that M can be smoothly parameterized by some Θ ⊆ Rk .
For a given model M and sample D = xn , the maximum likelihood (ML) P is
the P ∈ M maximizing P (xn ). For a parametric model with parameter space Θ,
the maximum likelihood estimator θ̂ is the function that, for each n, maps xn to the
θ ∈ Θ that maximizes the likelihood P (xn | θ). The ML estimator may be viewed as
a ‘learning algorithm’. This is a procedure that, when input a sample xn of arbitrary
length, outputs a parameter or hypothesis Pn ∈ M. We say a learning algorithm is
consistent relative to distance measure d if for all P ∗ ∈ M, if data are distributed
according to P ∗ , then the output Pn converges to P ∗ in the sense that d(P ∗ , Pn ) → 0
with P ∗ -probability 1. Thus, if P ∗ is the ‘true’ state of nature, then given enough data,
the learning algorithm will learn a good approximation of P ∗ with very high probability.
Example 2.7 [Markov and Bernoulli models] Recall that a k-th order Markov
chain on X = {0, 1} is a probabilistic source such that for every n > k,
That is, the probability distribution on Xn depends only on the k symbols preceding n.
Thus, there are 2k possible distributions of Xn , and each such distribution is identified
with a state of the Markov chain. To fully identify the chain, we also need to specify the
starting state, defining the first k outcomes X1 , . . . , Xk . The k-th order Markov model
is the set of all k-th order Markov chains, i.e. all sources satisfying (2.6) equipped with
a starting state.
The special case of the 0-th order Markov model is the Bernoulli or biased coin
model, which we denote by B (0) We can parameterize the Bernoulli model by a param-
eter θ ∈ [0, 1] representing the probability of observing a 1. Thus, B (0) = {P (· | θ) | θ ∈
[0, 1]}, with P (xn | θ) by definition equal to
n
Y
P (xn | θ) = P (xi | θ) = θ n[1] (1 − θ)n[0] ,
i=1
where n[1] stands for the number of 1s, and n[0] for the number of 0s in the sample.
Note that the Bernoulli model is i.i.d. The log-likelihood is given by
33
Taking the derivative of (2.7) with respect to θ, we see that for fixed xn , the log-
likelihood is maximized by setting the probability of 1 equal to the observed frequency.
Since the logarithm is a monotonically increasing function, the likelihood is maximized
at the same value: the ML estimator is given by θ̂(xn ) = n[1] /n.
Similarly, the first-order Markov model B (1) can be parameterized by a vector θ =
(θ[1|0] , θ[1|1] ) ∈ [0, 1]2 together with a starting state in {0, 1}. Here θ[1|j] represents the
probability of observing a 1 following the symbol j. The log-likelihood is given by
log P (xn | θ) = n[1|1] log θ[1|1] + n[0|1] log(1 − θ[1|1] ) + n[1|0] log θ[1|0] + n[0|0] log(1 − θ[1|0] ),
n[i|j] denoting the number of times outcome i is observed in state (previous outcome)
j. This is maximized by setting θ̂ = (θ̂[1|0] , θ̂[1|1] ), with θ̂[i|j] = n[i|j] = n[ji] /n[j] set to
the conditional frequency of i preceded by j. In general, a k-th order Markov chain has
2k parameters and the corresponding likelihood is maximized by setting the parameter
θ[i|j] equal to the number of times i was observed in state j divided by the number of
times the chain was in state j.
Suppose now we are given data D = xn and we want to find the Markov chain that
best explains D. Since we do not want to restrict ourselves to chains of fixed order, we
run a large risk of overfitting: simply picking, among all Markov chains of each order,
the ML Markov chain that maximizes the probability of the data, we typically end up
with a chain of order n − 1 with starting state given by the sequence x1 , . . . , xn−1 , and
P (Xn = xn | X n−1 = xn−1 ) = 1. Such a chain will assign probability 1 to xn . Below
we show that MDL makes a more reasonable choice.
• L(D|H) is the length, in bits, of the description of the data when encoded
with the help of the hypothesis.
The best model to explain D is the smallest model containing the selected H.
34
In this section, we implement this crude MDL Principle by giving a precise definition
of the terms L(H) and L(D|H). To make the first term precise, we must design a
code C1 for encoding hypotheses H such that L(H) = LC1 (H). For the second term,
we must design a set of codes C2,H (one for each H ∈ M) such that for all D ∈ X n ,
L(D|H) = LC2,H (D). We start by describing the codes C2,H .
1. With this choice, the code length L(xn | H) is equal to minus the log-likelihood
of xn according to H, which is a standard statistical notion of ‘goodness-of-fit’.
2. If the data turn out to be distributed according to P , then the code L(· | H) will
uniquely minimize the expected code length (Section 2.2).
The second item implies that our choice is, in a sense, the only reasonable choice10 .
To see this, suppose M is a finite i.i.d. model containing, say, M distributions.
Suppose we assign an arbitrary but finite code length L(H) to each H ∈ M. Sup-
pose X1 , X2 , . . . are actually distributed i.i.d. according to some ‘true’ H ∗ ∈ M.
By the reasoning of Example 2.6, we see that MDL will select the true distribution
P (· | H ∗ ) for all large n, with probability 1. This means that MDL is consistent
for finite M. If we were to assign codes to distributions in some other manner
not satisfying L(D | H) = − log P (D | H), then there would exist distributions
P (· | H) such that L(D|H) 6= − log P (D|H). But by Figure 2.1, there must be
some distribution P (· | H ′ ) with L(·|H) = − log P (· | H ′ ). Now let M = {H, H ′ }
and suppose data are distributed according to P (· | H ′ ). Then, by the reasoning
of Example 2.6, MDL would select H rather than H ′ for all large n! Thus, MDL
would be inconsistent even in this simplest of all imaginable cases – there would
then be no hope for good performance in the considerably more complex situations
we intend to use it for11 .
35
with all parameters equal to 0 or 1. Since there are only a finite number (2n−1 ) of
these, this is possible. But then, for each n, xn ∈ X n , MDL would select the ML
Markov chain of order n − 1. Thus, MDL would coincide with ML and, no matter
how large n, we would overfit.
Example 2.8 [a Very Crude Code for the Markov Chains] We can describe
a Markov chain of order k by first describing k, and then describing a parameter
′
vector θ ∈ [0, 1]k with k′ = 2k . We describe k using our simple code for the integers
(Example 2.4). This takes 2 log k + 1 bits. We now have to describe the k′ -component
parameter vector. We saw in Example 2.7 that for any xn , the best-fitting (ML) k-
th order Markov chain can be identified with k′ frequencies. It is not hard to see that
these frequencies are uniquely determined by the counts n[1|0...00] , n[1|0...01] , . . . , n[1|1...11] .
Each individual count must be in the (n + 1)-element set {0, 1, . . . , n}. Since we assume
n is given in advance13 , we may use a simple fixed-length code to encode this count,
taking log(n + 1) bits (Example 2.3). Thus, once k is fixed, we can describe such a
Markov chain by a uniform code using k′ log(n + 1) bits. With the code just defined
we get for any P ∈ B, indexed by parameter Θ(k) ,
L(P ) = L(k, Θ(k) ) = 2 log k + 1 + k log(n + 1),
so that with these codes, MDL tells us to pick the k, θ (k) minimizing
L(k, θ (k) ) + L(D | k, θ (k) ) = 2 log k + 1 + k log(n + 1) − log P (D | k, θ (k) ), (2.8)
36
where the θ (k) that is chosen will be equal to the ML estimator for M(k) .
Why (not) this code? We may ask two questions about this code. First, why did
we only reserve codewords for θ that are potentially ML estimators for the given data?
The reason is that, given k′ = 2k , the codelength L(D | k, θ (k) ) is minimized by θ̂ (k) (D),
′
the ML estimator within θ (k) . Reserving codewords for θ ∈ [0, 1]k that cannot be ML
estimates would only serve to lengthen L(D | k, θ (k) ) and can never shorten L(k, θ (k) ).
Thus, the total description length needed to encode D will increase. Since our stated
goal is to minimize description lengths, this is undesirable.
However, by the same logic we may also ask whether we have not reserved too many
′
codewords for θ ∈ [0, 1]k . And in fact, it turns out that we have: the distance between
two adjacent ML estimators is O(1/n). Indeed, if we had used a coarser precision, only
√
reserving codewords for parameters with distances O(1/ n), we would obtain smaller
code lengths - (2.8) would become
k
L(k, θ (k) ) + L(D | k, θ (k) ) = − log P (D | k, θ̂ (k) ) + log n + ck , (2.9)
2
where ck is a small constant depending on k, but not n [Barron and Cover 1991]. In
Section 2.6 we show that (2.9) is in some sense ‘optimal’.
The Good News and the Bad News The good news is (1) we have found a
principled, non-arbitrary manner to encode data D given a probability distribution
H, namely, to use the code with lengths − log P (D | H); and (2), asymptotically,
any code for hypotheses will lead to a consistent criterion. The bad news is that we
have not found clear guidelines to design codes for hypotheses H ∈ M. We found
some intuitively reasonable codes for Markov chains, and we then reasoned that these
could be somewhat ‘improved’, but what is conspicuously lacking is a sound theoretical
principle for designing and improving codes.
We take the good news to mean that our idea may be worth pursing further. We take
the bad news to mean that we do have to modify or extend the idea to get a meaningful,
non-arbitrary and practically relevant model selection method. Such an extension was
already suggested in Rissanen’s early works [Rissanen 1978; Rissanen 1983] and refined
by Barron and Cover [1991]. However, in these works, the principle was still restricted
to two-part codes. To get a fully satisfactory solution, we need to move to ‘universal
codes’, of which the two-part codes are merely a special case.
37
sequence most. Two-part codes are universal (Section 2.5.1), but there exist other
universal codes such as the Bayesian mixture code (Section 2.5.2) and the Normalized
Maximum Likelihood (NML) code (Section 2.5.3). We also discuss universal models,
which are just the probability distributions corresponding to universal codes. In this
section, we are not concerned with learning from data; we only care about compressing
data as much as possible. We reconnect our findings with learning in Section 2.6.
Example 2.9 Suppose we think that our sequence can be reasonably well-compressed
by a code corresponding to some biased coin model. For simplicity, we restrict ourselves
to a finite number of such models. Thus, let L = {L1 , . . . , L9 } where L1 is the code
length function corresponding to the Bernoulli model P (· | θ) with parameter θ = 0.1,
L2 corresponds to θ = 0.2 and so on. From (2.7) we see that, for example,
L8 (xn ) = − log P (xn |0.8) = −n[0] log 0.2 − n[1] log 0.8
L9 (xn ) = − log P (xn |0.9) = −n[0] log 0.1 − n[1] log 0.9.
38
Both L8 (xn ) and L9 (xn ) are linearly increasing in the number of 1s in xn . However, if
the frequency n1 /n is approximately 0.8, then minL∈L L(xn ) will be achieved for L8 . If
n1 /n ≈ 0.9 then minL∈L L(xn ) is achieved for L9 . More generally, if n1 /n ≈ j/10 then
Lj achieves the minimum15 . We would like to send xn using a code L̄ such that for
all xn , we need at most L̂(xn ) bits, where L̂(xn ) is defined as L̂(xn ) := minL∈L L(xn ).
Since − log is monotonically decreasing, L̂(xn ) = − log P (xn | θ̂(xn )). We already gave
an informal explanation as to why a code with lengths L̂ does not exist. We can now
explain this more formally as follows: if such a code were to exist, it would correspond
to some distribution P̄ . Then we would have for all xn , L̄(xn ) = − log P̄ (xn ). But,
by definition, for all xn ∈ X n , L̄(xn ) ≤ L̂(xn ) = − log P (xn |θ̂(xn )) where θ̂(xn ) ∈
{0.1, . . . , 0.9}. Thus we get for all xn , − log P̄ (xn ) ≤ − log P (xn | θ̂(xn )) or P̄ (xn ) ≥
P (xn | θ̂(xn )), so that, since |L| > 1,
X X X
P̄ (xn ) ≥ P (xn | θ̂(xn )) = max P (xn | θ) > 1, (2.10)
θ
xn xn xn
where the last inequality follows because for any two θ1 , θ2 with θ1 6= θ2 , there is at
least one xn with P (xn | θ1 ) > P (xn | θ2 ). (2.10) says that P̄ is not a probability
distribution. It follows that L̄ cannot be a codelength function. The argument can
be extended beyond the Bernoulli model of the example above: as long as |L| > 1,
and all codes in L correspond to a non-defective distribution, (2.10) must still hold,
so that there exists no code L̄ with L̄(xn ) = L̂(xn ) for all xn . The underlying reason
that no such code exists is the fact that probabilities must sum up to something ≤ 1;
or equivalently, that there exists no coding scheme assigning short code words to many
different messages – see Example 2.2.
Since there exists no code which, no matter what xn is, always mimics the best code for
xn , it may make sense to look for the next best thing: does there exist a code which,
for all xn ∈ X n , is ‘nearly’ (in some sense) as good as L̂(xn )? It turns out that in
many cases, the answer is yes: there typically exists codes L̄ such that no matter what
xn arrives, L̄(xn ) is not much larger than L̂(xn ), which may be viewed as the code
that is best ‘with hindsight’ (i.e., after seeing xn ). Intuitively, codes which satisfy this
property are called universal codes - a more precise definition follows below. The first
(but perhaps not foremost) example of a universal code is the two-part code that we
have encountered in Section 2.4.
39
Thus, for every possible xn ∈ X n , we obtain
For all L ∈ L, minxn L(xn ) grows linearly in n: minθ,xn − log P (xn | θ) = −n log 0.9 ≈
0.15n. Unless n is very small, no matter what xn arises, the extra number of bits we
need using L̄2-p compared to L̂(xn ) is negligible.
More generally, let L = {L1 , . . . , LM } where M can be arbitrarily large, and the Lj
can be any codelength functions we like; they do not necessarily represent Bernoulli
distributions any more. By the reasoning of Example 2.10, there exists a (two-part)
code such that for all xn ∈ X n ,
In most applications min L(xn ) grows linearly in n, and we see from (2.11) that, as soon
as n becomes substantially larger than log M , the relative difference in performance
between our universal code and L̂(xn ) becomes negligible. In general, we do not always
want to use a uniform code for the elements in L; note that any arbitrary code on L
will give us an analogue of (2.11), but with a worst-case overhead larger than log M -
corresponding to the largest codelength of any of the elements in L.
Example 2.11 [Countably Infinite L] We can also construct a 2-part code for ar-
bitrary countably infinite sets of codes L = {L1 , L2 , . . .}: we first encode some k using
our simple code for the integers (Example 2.4). With this code we need 2 log k + 1 bits
to encode integer k. We then encode xn using the code Lk . L̄2-p is now defined as the
code we get if, for any xn , we encode xn using the Lk minimizing the total two-part
description length 2 log k + 1 + Lk (xn ).
In contrast to the case of finite L, there does not exist a constant c any more such
that for all n, xn ∈ X n , L̄2-p (xn ) ≤ inf L∈L L(xn ) + c. Instead we have the following
weaker, but still remarkable property: for all k, all n, all xn , L̄2-p (xn ) ≤ Lk (xn ) +
2 log k + 1, so that also,
For any k, as n grows larger, the code L̄2-p starts to mimic whatever L ∈ {L1 , . . . , Lk }
compresses the data most. However, the larger k, the larger n has to be before this
happens.
40
The reasoning is now as follows: we think that one of the P ∈ M will assign a high
likelihood to the data to be observed. Therefore we would like to design a code that,
for all xn we might observe, performs essentially as well as the code corresponding to
the best-fitting, maximum likelihood (minimum codelength) P ∈ M for xn . Similarly,
we can think of universal codes such as the two-part code in terms of the (possibly
defective, see Section 2.2 and Figure 2.1)) distributions corresponding to it. Such
distributions corresponding to universal codes are called universal models. The use of
mapping universal codes back to distributions is illustrated by the Bayesian universal
model which we now introduce.
Universal model: Twice Misleading Terminology The words ‘universal’ and
‘model’ are somewhat of a misnomer: first, these codes/models are only ‘universal’
relative to a restricted ‘universe’ M. Second, the use of the word ‘model’ will be
very confusing to statisticians, who (as we also do in this paper) call a family of
distributions such as M a ’model’. But the phrase originates from information
theory, where a ‘model’ often refers to a single distribution rather than a family.
Thus, a ‘universal model’ is a single distribution, representing a statistical ‘model’
M.
where the inequality follows because a sum is at least as large as each of its terms,
and cθ = − log W (θ) depends on θ but not on n. Thus, P̄Bayes is a universal model
or equivalently, the code with lengths − log P̄Bayes is a universal code. Note that the
derivation in (2.13) only works if Θ is finite or countable; the case of continuous Θ is
treated in Section 2.6.
Bayes is Better than Two-part The Bayesian model is in a sense superior to the
two-part code. Namely, in the two-part code we first encode an element of M or its
41
parameter set Θ using some code L0 . Such a code must correspond to some ‘prior’
distribution W on M so that the two-part code gives codelengths
where W depends on the specific code L0 that was used. Using the Bayes code with
prior W , we get as in (2.13),
X
− log P̄Bayes (xn ) = − log P (xn | θ)W (θ) ≤ min − log P (xn |θ) − log W (θ).
θ∈Θ
θ∈Θ
The inequality becomes strict whenever P (xn |θ) > 0 for more than one value of θ.
Comparing to (2.14), we see that in general the Bayesian code is preferable over the
two-part code: for all xn it never assigns codelengths larger than L̄2-p (xn ), and in many
cases it assigns strictly shorter codelengths for some xn . But this raises two important
issues: (1) what exactly do we mean by ‘better’ anyway? (2) can we say that ‘some
prior distributions are better than others’ ? These questions are answered below.
The regret of P̄ relative to M for xn is the additional number of bits needed to encode
xn using the code/distribution P̄ , as compared to the number of bits that had been
needed if we had used code/distribution in M that was optimal (‘best-fitting’) with
hind-sight. For simplicity, from now on we tacitly assume that for all the models M
we work with, there is a single θ̂(xn ) maximizing the likelihood for every xn ∈ X n . In
that case (2.15) simplifies to
We would like to measure the quality of a universal model P̄ in terms of its regret.
However, P̄ may have small (even < 0) regret for some xn , and very large regret for
other xn . We must somehow find a measure of quality that takes into account all
xn ∈ X n . We take a worst-case approach, and look for universal models P̄ with small
worst-case regret, where the worst-case is over all sequences. Formally, the maximum
or worst-case regret of P̄ relative to M is defined as
42
If we use Rmax as our quality measure, then the ‘optimal’ universal model relative to
M, for given sample size n, is the distribution minimizing
where the minimum is over all defective distributions on X n . The P̄ minimizing (2.16)
corresponds to the code minimizing the additional number of bits compared to code in
M that is best in hindsight in the worst-case over all possible xn . It turns out that we
can solve for P̄ in (2.16). To this end, we first define the complexity of a given model
M as X
COMPn (M):= log P (xn | θ̂(xn )). (2.17)
xn ∈X n
This quantity plays a fundamental role in refined MDL, Section 2.6. To get a first idea
of why COMPn is called model complexity, note that the more sequences xn with
large P (xn | θ̂(xn )), the larger COMPn (M). In other words, the more sequences that
can be fit well by an element of M, the larger M’s complexity.
Proposition 2.14 [Shtarkov 1987] Suppose that COMPn (M) is finite. Then the
minimax regret (2.16) is uniquely achieved for the distribution P̄nml given by
P (xn | θ̂(xn ))
P̄nml (xn ):= P . (2.18)
y n ∈X n P (y n | θ̂(y n ))
The distribution P̄nml is known as the Shtarkov distribution or the normalized maximum
likelihood (NML) distribution.
− log P̄ (xn ) − {− log P (xn | θ̂(xn ))} = Rmax (P̄ ) = COMPn (M), (2.19)
so that P̄nml achieves the same regret, equal to COMPn (M), no matter what xn
actually obtains. Since every distribution P on X n with P 6= P̄nml must satisfy P (z n ) <
P̄nml (z n ) for at least one z n ∈ X n , it follows that
43
Whenever X is finite, the sum COMPn (M) is finite so that the NML distribution
is well-defined. If X is countably infinite or continuous-valued, the sum COMPn (M)
may be infinite and then the NML distribution may be undefined. In that case, there
exists no universal model achieving constant regret as in (2.19). If M is parametric,
then P̄nml is typically well-defined as long as we suitably restrict the parameter space.
The parametric case forms the basis of ‘refined MDL’ and will be discussed at length
in the next section.
44
likelihood P̄nml (D | M(j) ), or, by (2.18), equivalently, minimizing
− log P̄nml (D | M(j) ) = − log P (D | θ̂ (j) (D)) + COMPn (M(j) ) (2.20)
From a coding theoretic point of view, we associate with each M(j) a code with lengths
P̄nml (· | M(j) ), and we pick the model minimizing the codelength of the data. The
codelength − log P̄nml (D | M(j) ) has been called the stochastic complexity of the data
D relative to model M(j) [Rissanen 1987], whereas COMPn (M(j) ) is called the para-
metric complexity or model cost of M(j) (in this survey we simply call it ‘complexity’).
We have already indicated in the previous section that COMPn (M(j) ) measures some-
thing like the ‘complexity’ of model M(j) . On the other hand, − log P (D | θ̂ (j)(D)) is
minus the maximized log-likelihood of the data, so it measures something like (minus)
fit or error – in the linear regression case, it can be directly related to the mean squared
error, Section 2.8. Thus, (2.20) embodies a trade-off between lack of fit (measured by
minus log-likelihood) and complexity (measured by COMPn (M(j) )). The confidence
in the decision is given by the codelength difference
In general, − log P̄nml (D | M) can only be evaluated numerically – the only exception
this author is aware of is when M is the Gaussian family, Example 2.20. In many cases
even numerical evaluation is computationally problematic. But the re-interpretations
of P̄nml we provide below also indicate that in many cases, − log P̄ (D | M) is relatively
easy to approximate.
Example 2.15 [Refined MDL and GLRT] Generalized likelihood ratio testing
[Casella and Berger 1990] tells us to pick the M(j) maximizing log P (D | θ̂ (j)(D)) +
c where c is determined by the desired type-I and type-II errors. In practice one
often applies a naive variation17 , simply picking the model M(j) maximizing log P (D |
θ̂ (j)(D)). This amounts to ignoring the complexity terms COMPn (M(j) ) in (2.20):
MDL tries to avoid overfitting by picking the model maximizing the normalized rather
than the ordinary likelihood. The more distributions in M that fit the data well, the
larger the normalization term.
The hope is that the normalization term COMPn (M(j) ) strikes the right balance
between complexity and fit. Whether it really does depends on whether COMPn is
a ‘good’ measure of complexity. In the remainder of this section we shall argue that
it is, by giving four different interpretations of COMPn and of the resulting trade-off
(2.20):
1. Compression interpretation.
2. Counting interpretation.
3. Bayesian interpretation.
4. Prequential (predictive) interpretation.
45
2.6.1 Compression Interpretation
Rissanen’s original goal was to select the model that detects the most regularity in
the data; he identified this with the ‘model that allows for the most compression of
data xn ’. To make this precise, a code is associated with each model. The NML code
with lengths − log P̄nml (· | M(j) ) seems to be a very reasonable choice for such a code
because of the following two properties:
1. The better the best-fitting distribution in M(j) fits the data, the shorter the
codelength − log P̄nml (D | M(j) ).
46
show that COMPn measures something like this. Consider a finite model M with
parameter set Θ = {θ1 , . . . , θM }. Note that
X X X
P (xn |θ̂(xn )) = P (xn |θj ) =
xn ∈X n j=1..M xn :θ̂(xn )=θj
X X X
P (xn |θj ) = M − P (θ̂(xn ) 6= θj |θj ).
1−
j=1..M xn :θ̂(xn )6=θj j
We may think of P (θ̂(xn ) 6= θj |θj ) as the probability, according to θj , that the data
look as if they come from some θ 6= θj . Thus, it is the probability that θj is mis-
taken for another distribution in Θ. Therefore, for finite M, Rissanen’s model com-
plexity is the logarithm of the number of distributions minus the summed probability
that some θj is ‘mistaken’ for some θ 6= θj . Now suppose M is i.i.d. By the law of
large
P numbers [Feller 1968], we immediately see that the ‘sum of mistake probabilities’
P (θ̂(x n ) 6= θ |θ ) tends to 0 as n grows. It follows that for large n, the model
j j j
complexity converges to log M . For large n, the distributions in M are ‘perfectly dis-
tinguishable’ (the probability that a sample coming from one is more representative
of another is negligible), and then the parametric complexity COMPn (M) of M is
simply the log of the number of distributions in M.
Example 2.16 [NML vs. Two-part Codes] Incidentally, this shows that for finite
i.i.d. M, the two-part code with uniform prior W on M is asymptotically minimax
optimal: for all n, the regret of the 2-part code is log M (Equation 2.11), whereas we
just showed that for n → ∞, R(P̄nml ) = COMPn (M) → log M . However, for small n,
some distributions in M may be mistaken for one another; the number of distinguishable
distributions in M is then smaller than the actual number of distributions, and this is
reflected in COMPn (M) being (sometimes much) smaller than log M .
For the more interesting case of parametric models, containing infinitely many distribu-
tions, Balasubramanian [1997, 2004] has a somewhat different counting interpretation
of COMPn (M) as a ratio between two volumes. Rissanen and Tabus [2004] give a
more direct counting interpretation of COMPn (M). These extensions are both based
on the asymptotic expansion of P̄nml , which we now discuss.
k n
Z p
COMPn (M) = log + log |I(θ)|dθ + o(1). (2.21)
2 2π θ∈Θ
Here k is the number of parameters (degrees of freedom) in model M, n is the sample
size, and o(1) → 0 as n → ∞. |I(θ)| is the determinant of the k × k Fisher information
47
matrix18 I evaluated at θ. In case M is an i.i.d. model, I is given by
∂2
Iij (θ ∗ ):=Eθ∗ − log P (X|θ) .
∂θi ∂θj θ=θ ∗
∂2
1
Iij (θ ∗ ):= lim Eθ∗ − log P (X n |θ) .
n→∞ n ∂θi ∂θj θ=θ ∗
(2.21) only holds if the model M, its parameterization Θ and the sequence x1 , x2 , . . .
all satisfy certain conditions. Specifically, we require:
Rp
1. COMPn (M) < ∞ and |I(θ)|dθ < ∞;
2. θ̂(xn ) does not come arbitrarily close to the boundary of Θ: for some ǫ > 0, for
all large n, θ̂(xn ) remains farther than ǫ from the boundary of Θ.
More general conditions are given by Takeuchi and Barron [1997, 1998, 2000]. Essen-
tially, if M behaves ‘asymptotically’ like an exponential family, then (2.21) still holds.
For example, (2.21) holds for the Markov models and for AR and ARMA processes.
Example 2.17 [Complexity of the Bernoulli Model] The Bernoulli model B (0)
can be parameterized in a 1-1 way by the unit interval (Example 2.7). Thus, k = 1.
An easy calculation shows that Rthe
p Fisher information is given by θ(1 − θ). Plugging
this into (2.21) and calculating |θ(1 − θ)|dθ gives
1 1 π 1
COMPn (B (0) ) = log n + log − 3 + o(1) = log n − 2.674251935 + o(1).
2 2 2 2
Computing the integral of the Fisher determinant is not easy in general. Hanson and Fu [2004]
compute it for several practically relevant models.
Whereas for finite M, COMPn (M) remains finite, for parametric models it gener-
ally grows logarithmically in n. Since typically − log P (xn | θ̂(xn )) grows linearly in
n, it is still the case that for fixed dimensionality k (i.e. for a fixed M that is k-
dimensional) and large n, the part of the codelength − log P̄nml (xn | M) due to the
complexity of MRis p very small compared to the part needed to encode data xn with
θ̂(xn ). The term Θ |I(θ)|dθ may be interpreted as the contribution of the functional
48
form of M to the model complexity [Balasubramanian 2004]. It does not grow with
n so that, when selecting between two models, it becomes irrelevant and can be ig-
nored for very large n. But for small n, it can be important, as can be seen from
Example
R p 1.4, Fechner’s and Stevens’ model. Both models have two parameters, yet
the Θ |I(θ)|dθ-term is much larger for Fechner’s than for Stevens’ model. In the
experiments of Myung, Balasubramanian, and Pitt [2000], the parameter set was re-
stricted to 0 < a < ∞, 0 < b < 3 for Stevens’ model and 0 < a < ∞, 0 < b < ∞ for
Fechner’s model. The variance
R p of the error Z was set to 1 in both models. With these
values, the difference in Θ |I(θ)|dθ is 3.804, which is non-negligible for small sam-
ples. Thus, Stevens’ model contains more distinguishable distributions than Fechner’s,
and is better able to capture random noise in the data – as Townsend [1975] already
speculated almost 30 years ago. Experiments suggest that for regression models such as
Stevens’ and Fechner’s’, as well as for Markov models and general exponential families,
the approximation (2.21) is reasonably accurate already for small samples. But this is
certainly not true for general models:
Two-part codes and COMPn (M) We now have a clear guiding principle (mini-
max regret) which we can use to construct ‘optimal’ two-part codes, that achieve the
minimax regret among all two-part codes. How do such optimal two-part codes compare
to the NML codelength? Let M be a k-dimensional model. By slightly adjusting the
arguments of [Barron and Cover 1991, Appendix], one can show that, under regularity
conditions, the minimax optimal two-part code P̄2-p achieves regret
k n
Z p
n n n
− log P̄2-p (x | M) + log P (x | θ̂(x )) = log + log |I(θ)|dθ + f (k) + o(1),
2 2π θ∈Θ
49
The Bayes factor method is very closely related to the refined MDL approach. As-
suming uniform priors on models M(1) and M(2) , it tells us to select the model with
largest marginal likelihood P̄Bayes (xn | M(j) ), where P̄Bayes is as in (2.12), with the sum
replaced by an integral, and w(j) is the density of the prior distribution on M(j) :
Z
P̄Bayes (x | M ) = P (xn | θ)w(j) (θ)dθ.
n (j)
(2.22)
M is Exponential Family Let now P̄Bayes = P̄Bayes (· | M) for some fixed model M.
Under regularity conditions on M, we can perform a Laplace approximation of the in-
tegral in (2.12). For the special case that M is an exponential family, we obtain the fol-
lowing expression for the regret [Jeffreys 1961; Schwarz 1978; Kass and Raftery 1995;
Balasubramanian 1997]:
k n
q
n n n
− log P̄Bayes (x ) − [− log P (x | θ̂(x ))] = log − log w(θ̂) + log |I(θ̂)| + o(1).
2 2π
(2.23)
Let us compare this with (2.21). Under the regularity conditions needed for (2.21), the
quantity on the right of (2.23) is within O(1) of COMPn (M). Thus, the code length
achieved with P̄Bayes is within a constant of the minimax optimal − log P̄nml (xn ). Since
− log P (xn | θ̂(xn )) increases linearly in n, this means that if we compare two models
M(1) and M(2) , then for large enough n, Bayes and refined MDL select the same
model. If we equip the Bayesian universal model with a special prior known as the
Jeffreys-Bernardo prior [Jeffreys 1946; Bernardo and Smith 1994],
p
|I(θ)|
wJeffreys (θ) = R p , (2.24)
θ∈Θ |I(θ)|dθ
then Bayes and refined NML become even more closely related: plugging in (2.24) into
(2.23), we find that the right-hand side of (2.23) now simply coincides with (2.21). A
concrete example of Jeffreys’ prior is given in Example 2.19. Jeffreys introduced his
prior as a ‘least informative prior’, to be used when no useful prior knowledge about
the parameters is available [Jeffreys 1946]. As one may expect from such a prior, it
is invariant under continuous 1-to-1 reparameterizations of the parameter space. The
present analysis shows that, when M is an exponential family, then it also leads to
asymptotically minimax codelength regret: for large n, refined NML model selection
becomes indistinguishable from Bayes factor model selection with Jeffreys’ prior.
50
ˆ n ) is the so-called observed information, sometimes also called observed Fisher
Here I(x
information; see [Kass and Voss 1997] for a definition. If M is an exponential family,
then the observed Fisher information at xn coincides with the Fisher information at
θ̂(xn ), leading to (2.23). If M is not exponential, then if data are distributed according
to one of the distributions in M, the observed Fisher information still converges with
probability 1 to the expected Fisher information. If M is neither exponential, nor
are the data actually generated by a distribution in M, then there may be O(1)-
discrepancies between − log P̄nml and − log P̄Bayes even for large n.
so that also
n
X
n
− log P (x ) = − log P (xi | xi−1 ) (2.27)
i=1
Let us abbreviate P (Xi = · | X i−1 = xi−1 ) to P (Xi | xi−1 ). P (Xi | xi−1 ) (capital
Xi ) is the distribution (not a single number) of Xi given xi−1 ; P (xi | xi−1 ) (lower case
xi ) is the probability (a single number) of actual outcome xi given xi−1 . We can think
of − log P (xi | xi−1 ) as the loss incurred when predicting Xi based on the conditional
distribution P (Xi | xi−1 ), and the actual outcome turned out to be xi . Here ‘loss’ is
measured using the so-called logarithmic score, also known simply as ‘log loss’. Note
that the more likely x is judged to be, the smaller the loss incurred when x actu-
ally obtains. The log loss has a natural interpretation in terms of sequential gambling
[Cover and Thomas 1991], but its main interpretation is still in terms of coding: by
(2.27), the codelength needed to encode xn based on distribution P is just the accu-
mulated log loss incurred when P is used to sequentially predict the i-th outcome based
on the past (i − 1)-st outcomes.
(2.26) gives a fundamental re-interpretation of probability distributions as predic-
tion strategies, mapping each individual sequence of past observations x1 , . . . , xi−1 to a
probabilistic prediction of the next outcome P (Xi | xi−1 ). Conversely, (2.26) also shows
that every probabilistic prediction strategy for sequential prediction of n outcomes may
be thought of as a probability distribution on X n : a strategy is identified with a function
mapping all potential initial segments xi−1 to the prediction that is made for the next
outcome Xi , after having seen xi−1 . Thus, it is a function S : ∪0≤i<n X i → PX , where
PX is the set of distributions on X . We can now define, for each i < n, all xi ∈ X i ,
P (Xi | xi−1 ):=S(xi−1 ). We can turn these partial distributions into a distribution on
X n by sequentially plugging them into (2.26).
51
Log Loss for Universal Models Let M be some parametric model and let P̄
be some universal model/code relative to M. What do the individual predictions
P̄ (Xi | xi−1 ) look like? Readers familiar with Bayesian statistics will realize that
for i.i.d. models, the Bayesian predictive distribution P̄Bayes (Xi | xi−1 ) converges to
the ML distribution P (· | θ̂(xi−1 )); Example 2.19 provides a concrete case. It seems
reasonable to assume that something similar holds not just for P̄Bayes but for universal
models in general. This in turn suggests that we may approximate the conditional
distributions P̄ (Xi | xi−1 ) of any ‘good’ universal model by the maximum likelihood
predictions P (· | θ̂(xi−1 )). Indeed, we can recursively define the ‘maximum likelihood
plug-in’ distribution P̄plug-in by setting, for i = 1 to n,
Then
n
X
− log P̄plug-in (xn ):= − log P (xi | θ̂(xi−1 )). (2.29)
i=1
In general, there is no need to use the ML estimator θ̂(xi−1 ) in the definition (2.28).
Instead, we may try some other estimator which asymptotically converges to the ML
estimator – it turns out that some estimators considerably outperform the ML estima-
tor in the sense that (2.29) becomes a much better approximation of − log P̄nml , see
Example 2.19. Irrespective of whether we use the ML estimator or something else,
we call model selection based on (2.29) the prequential form of MDL in honor of A.P.
Dawid’s ‘prequential analysis’, Section 2.9. It is also known as ‘predictive MDL’. The
validity of (2.30) was discovered independently by Rissanen [1984] and Dawid [1984].
The prequential view gives us a fourth interpretation of refined MDL model selec-
tion: given models M(1) and M(2) , MDL tells us to pick the model that minimizes the
52
accumulated prediction error resulting from sequentially predicting future outcomes
given all the past outcomes.
Example 2.18 [GLRT and Prequential Model Selection] How does this differ
from the naive version of the generalized likelihood ratio test (GLRT) that we intro-
duced in Example 2.15? In GLRT, we associate with each model the log-likelihood
(minus log loss) that can be obtained by the ML estimator. This is the predictor
within the model that minimizes log loss with hindsight, after having seen the data.
In contrast, prequential model selection associates with each model the log-likelihood
(minus log loss) that can be obtained by using a sequence of ML estimators θ̂(xi−1 )
to predict data xi . Crucially, the data on which ML estimators are evaluated has not
been used in constructing the ML estimators themselves. This makes the prediction
scheme ‘honest’ (different data are used for training and testing) and explains why it
automatically protects us against overfitting.
Example 2.19 [Laplace and Jeffreys] Consider the prequential distribution for the
Bernoulli model, Example 2.7, defined as in (2.28). We show that if we take θ̂ in (2.28)
equal to the ML estimator n[1] /n, then the resulting P̄plug-in is not a universal model;
but a slight modification of the ML estimator makes P̄plug-in a very good universal
model. Suppose that n ≥ 3 and (x1 , x2 , x3 ) = (0, 0, 1) – a not-so-unlikely initial segment
according to most θ. Then P̄plug-in (X3 = 1 | x1 , x2 ) = P (X = 1 | θ̂(x1 , x2 )) = 0, so that
by (2.29),
− log P̄plug-in (xn ) ≥ − log P̄plug-in (x3 | x1 , x2 ) = ∞,
whence P̄plug-in is not universal. Now let us consider the modified ML estimator
n[1] + λ
θ̂λ (xn ):= . (2.31)
n + 2λ
If we take λ = 0, we get the ordinary ML estimator. If we take λ = 1, then an exercise
involving beta-integrals shows that, for all i, xi , P (Xi | θ̂1 (xi−1 )) = P̄Bayes (Xi | xi−1 ),
where P̄Bayes is defined relative to the uniform prior w(θ) ≡ 1. Thus θ̂1 (xi−1 ) corre-
sponds to the Bayesian predictive distribution for the uniform prior. This prediction
rule was advocated by the great probabilist P.S. de Laplace, co-originator of Bayesian
statistics. It may be interpreted as ML estimation based on an extended sample, con-
taining some ‘virtual’ data: an extra 0 and an extra 1.
Even better, a similar calculation shows that if we take λ = 1/2, the resulting esti-
mator is equal to P̄Bayes (Xi | xi−1 ) defined relative to Jeffreys’ prior. Asymptotically,
P̄Bayes with Jeffreys’ prior achieves the same codelengths as P̄nml (Section 2.6.3). It
follows that P̄plug-in with the slightly modified ML estimator is asymptotically indistin-
guishable from the optimal universal model P̄nml !
For more general models M, such simple modifications of the ML estimator usually
do not correspond to a Bayesian predictive distribution; for example, if M is not convex
(closed under taking mixtures) then a point estimator (an element of M) typically does
53
not correspond to the Bayesian predictive distribution (a mixture of elements of M).
Nevertheless, modifying the ML estimator by adding some virtual data y1 , . . . , ym and
replacing P (Xi | θ̂(xi−1 )) by P (Xi | θ̂(xi−1 , y m )) in the definition (2.28) may still lead
to good universal models. This is of great practical importance, since, using (2.29),
− log P̄plug-in (xn ) is often much easier to compute than − log P̄Bayes (xn ).
Summary We introduced the refined MDL Principle for model selection in a re-
stricted setting. Refined MDL amounts to selecting the model under which the data
achieve the smallest stochastic complexity, which is the codelength according to the
minimax optimal universal model. We gave an asymptotic expansion of stochastic and
parametric complexity, and interpreted these concepts in four different ways.
54
where L is the codelength function of some code for encoding model indices k. We would
typically choose the standard prior for the integers, L(k) = 2 log k +1, Example 2.4. By
using (2.32) we avoid the overfitting problem mentioned above: if M(1) = {P1 }, M(2) =
{P2 }, . . . where P1 , P2 , . . . is a list of all the rational-parameter Markov chains, (2.32)
would reduce to two-part code MDL (Section 2.4) which is asymptotically consistent.
On the other hand, if M(k) represents the set of k-th order Markov chains, the term L(k)
is typically negligible compared to COMPn (M(k) ), the complexity term associated
with M(k) that is hidden in − log P̄nml (M(k) ): thus, the complexity of M(k) comes
from the fact that for large k, M(k) contains many distinguishable distributions; not
from the much smaller term L(k) ≈ 2 log k.
To make our previous approach for a finite set of models compatible with (2.32),
we can reinterpret it as follows: we assign uniform codelengths (a uniform prior) to the
M(1) , . . . , M(M ) under consideration, so that for k = 1, . . . , M , L(k) = log M . We then
pick the model minimizing (2.32). Since L(k) is constant over k, it plays no role in the
minimization and can be dropped from the equation, so that our procedure reduces to
our original refined MDL model selection method. We shall henceforth assume that we
always encode the model index, either implicitly (if the number of models is finite) or
explicitly. The general principle behind this is explained in Section 2.7.3.
1 (x−µ)2
P (x|µ) = √ e− 2σ2 ,
2πσ
extended to sequences x1 , . . . , xn by taking product densities. As is well-known
Pn [Casella and Berger 1990],
n
the ML estimator µ̂(x ) is equal to the sample mean: µ̂(x ) = n n −1
i=1 xi . An easy
calculation shows that
Z
COMPn (M) = P (xn | µ̂(xn ))dxn = ∞,
xn
where we abbreviated dx1 . . . dxn to dxn . Therefore, we cannot use basic MDL model
selection. It also turns out that I(µ) = σ −2 so that
Z p Z p
|I(θ)|dθ = |I(µ)|dµ = ∞.
Θ µ∈R
55
Thus, the Bayesian universal model approach with Jeffreys’ prior cannot be applied
either. Does this mean that our MDL model selection and complexity definitions break
down even in such a simple case? Luckily, it turns out that they can be repaired, as
we now show. Barron, Rissanen, and Yu [1998] and Foster and Stine [2001] show that,
for all intervals [a, b],
b−a √
Z
P (xn | µ̂(xn ))dxn = √ · n. (2.33)
xn :µ̂(xn )∈[a,b] 2πσ
Suppose for the moment that it is known that µ̂ lies in some set [−K, K] for some fixed
K. Let MK be the set of conditional distributions thus obtained: MK = {P ′ (· | µ) |
µ ∈ R}, where P ′ (xn | µ) is the density of xn according to the normal distribution
with mean µ, conditioned on |n−1 xi | ≤ K. By (2.33), the ‘conditional’ minimax
P
regret distribution P̄nml (· | MK ) is well-defined for all K > 0. That is, for all xn with
|µ̂(xn )| ≤ K,
P ′ (xn | µ̂(xn ))
P̄nml (xn | MK ) = R ′ n n n
xn : |µ̂(xn )|<K P (x | µ̂(x ))dx ,
This suggests to redefine the complexity of the full model M so that its regret depends
on the area in which µ̂ falls. The most straightforward way of achieving this is to define
a meta-universal model for M, combining the NML with a two-part code: we encode
data by first encoding some value for K. We then encode the actual data xn using the
code P̄nml (·|MK ). The resulting code P̄meta is a universal code for M with lengths
The idea is now to base MDL model selection on P̄meta (·|M) as in (2.34) rather than
on the (undefined) P̄nml (·|M). To make this work, we need to choose L in a clever
manner. A good choice is to encode K ′ = log K as an integer, using the standard code
for the integers. To see why, note that the regret of P̄meta now becomes:
If we had known a good bound K on |µ̂| a priori, we could have used the NML model
P̄nml (· | MK ). With ‘maximal’ a priori knowledge, we would have used the model
56
P̄nml (· | M|µ̂| ), leading to regret COMPn (M|µ̂| ). The regret achieved by P̄meta is
almost as good as this ‘smallest possible regret-with-hindsight’ COMPn (M|µ̂| ): the
difference is much smaller than, in fact logarithmic in, COMPn (M|µ̂| ) itself, no matter
what xn we observe. This is the underlying reason why we choose to encode K with
log-precision: the basic idea in refined MDL was to minimize worst-case regret, or
additional code-length compared to the code that achieves the minimal code-length
with hindsight. Here, we use this basic idea on a meta-level: we design a code such
that the additional regret is minimized, compared to the code that achieves the minimal
regret with hindsight.
This meta-two-part coding idea was introduced by Rissanen [1996]. It can be extended
to a wide range of models with COMPn (M) = ∞; for example, if the Xi represent
outcomes of a Poisson or geometric distribution, one can encode a bound on µ just like
in Example 2.20. If M is the full Gaussian model with both µ and σ 2 allowed to vary,
one has to encode a bound on µ̂ and a bound on σ̂ 2 . Essentially the same holds for
linear regression problems, Section 2.8.
Model selection between a finite set of models now proceeds by selecting the model
maximizing the re-normalized likelihood (2.36).
Region Indifference All the approaches considered thus far slightly prefer some
regions of the parameter space over others. In spite of its elegance, even the Rissanen
renormalization is slightly ‘arbitrary’ in this way: had we chosen the origin of the
real line differently, the same sequence xn would have achieved a different codelength
57
− log P̄rnml (xn | M). In recent work, Liang and Barron [2004a, 2004b] consider a
novel and quite different approach for dealing with infinite COMPn (M) that partially
addresses
Rp this problem. They make use of the fact that, while Jeffreys’ prior is improper
( |I(θ)|dθ is infinite), using Bayes’ rule we can still compute Jeffreys’ posterior based
on the first few observations, and this posterior turns out to be a proper probability
measure after all. Liang and Barron use universal models of a somewhat different type
than P̄nml , so it remains to be investigated whether their approach can be adapted to
the form of MDL discussed here.
Example 2.21 [MDL and Local Maxima in the Likelihood] In practice we often
work with models for which the ML estimator cannot be calculated efficiently; or at
least, no algorithm for efficient calculation of the ML estimator is known. Examples are
finite and Gaussian mixtures and Hidden Markov models. In such cases one typically
resorts to methods such as EM or gradient descent, which find a local maximum of
the likelihood surface (function) P (xn | θ), leading to a local maximum likelihood
estimator (LML) θ̇(xn ). Suppose we need to select between a finite number of such
58
GENERAL ‘REFINED’ MDL PRINCIPLE for Model Selection
Suppose we plan to select between models M(1) , M(2) , . . . for data D =
(x1 , . . . , xn ). MDL tells us to design a universal code P̄ for X n , in which the
index k of M(k) is encoded explicitly. The resulting code has two parts, the two
sub-codes being defined such that
1. All models M(k) are treated on the same footing, as far as possible: we assign
a uniform prior to these models, or, if that is not a possible, a prior ‘close to’
uniform.
2. All distributions within each M(k) are treated on the same footing, as far
as possible: we use the minimax regret universal model P̄nml (xn | M(k) ). If
this model is undefined or too hard to compute, we instead use a different
universal model that achieves regret ‘close to’ the minimax regret for each
submodel of M(k) in the sense of (2.35).
59
models. We may be tempted to pick the model M maximizing the normalized likelihood
P̄nml (xn | M). However, if we then plan to use the local estimator θ̇(xn ) for predicting
future data, this is not the right thing to do. To see this, note that, if suboptimal
estimators θ̇ are to be used, the ability of model M to fit arbitrary data patterns may
be severely diminished! Rather than using P̄nml , we should redefine it to take into
account the fact that θ̇ is not the global ML estimator:
′ P (xn | θ̇(xn ))
P̄nml (xn ):= P ,
n n
xn ∈X n P (x | θ̇(x ))
which, for every estimator θ̇ different from θ̂ must be strictly smaller than COMPn (M).
Summary We have shown how to extend refined MDL beyond the restricted settings
of Section 2.6. This uncovered the general principle behind refined MDL for model
selection, given in Figure 2.4. General as it may be, it only applies to model selection
– in the next section we briefly discuss extensions to other applications.
60
can be arbitrarily well approximated by members of M(n) , M(n+1) , . . . in the sense
that limn→∞ inf P ∈M(n) D(P ∗ kP ) = 0 [Barron, Rissanen, and Yu 1998]. Here D is the
Kullback-Leibler divergence [Cover and Thomas 1991] between P ∗ and P .
61
When the Codelength for xn Can Be Ignored
If all models under consideration represent conditional densities or probability mass
functions P (Y | X), then the codelength for X1 , . . . , Xn can be ignored in model
and parameter selection. Examples are applications of MDL in classification and
regression.
With this choice, the log-likelihood becomes a linear function of the squared error:
n
1 X n
− log P (y n | xn , σ, h) = (yi − h(xi ))2 + log 2πσ 2 . (2.40)
2σ 2 2
i=1
Let us now assume that H = ∪k≥1 H(k) where for each k, H(k) is a set of functions
h : X → Y. For example, H(k) may be the set of k-th degree polynomials.
With each model H(k) we can associate a set of densities (2.39), one for each (h, σ 2 )
with h ∈ H(k) and σ 2 ∈ R+ . Let M(k) be the resulting set of conditional distributions.
Each PP(· | h, σ 2 ) ∈ M(k) is identified by the parameter vector (α0 , . . . , αk , σ 2 ) so that
h(x):= kj=0 αj xj . By Section 2.7.1, (2.8) MDL tells us to select the model minimizing
where L′ represents some code for X n . Since this codelength does not involve k, it
can be dropped from the minimization; see Figure 2.5. We will not go into the precise
definition of P̄ (y n | M(k) , xn ). Ideally, it should be an NML distribution, but just as in
Example 2.20, this NML distribution is not well-defined. We can get reasonable alter-
native universal models after all using any of the methods described in Section 2.7.2;
see [Barron, Rissanen, and Yu 1998] and [Rissanen 2000] for details.
62
tried to directly try to learn functions h ∈ H from the data, without making any prob-
abilistic assumptions about the noise [Rissanen 1989; Barron 1990; Yamanishi 1998;
Grünwald 1998; Grünwald 1999]. The idea is to learn a function h that leads to good
predictions of future data from the same source in the spirit of Vapnik’s [1998] statis-
tical learning theory. Here prediction quality is measured by some fixed loss function;
different loss functions lead to different instantiations of the procedure. Such a version
of MDL is meant to be more robust, leading to inference of a ‘good’ h ∈ H irrespective
of the details of the noise distribution. This loss-based approach has also been the
method of choice in applying MDL to classification problems. Here Y takes on values
in a finite set, and the goal is to match each feature X (for example, a bit map of
a handwritten digit) with its corresponding label or class (e.g., a digit). While sev-
eral versions of MDL for classification have been proposed [Quinlan and Rivest 1989;
Rissanen 1989; Kearns, Mansour, Ng, and Ron 1997], most of these can be reduced to
the same approach based on a 0/1-valued loss function [Grünwald 1998]. In recent
work [Grünwald and Langford 2004] we show that this MDL approach to classification
without making assumptions about the noise may behave suboptimally: we exhibit
situations where no matter how large n, MDL keeps overfitting, selecting an overly
complex model with suboptimal predictive behavior. Modifications of MDL suggested
by Barron [1990] and Yamanishi [1998] do not suffer from this defect, but they do not
admit a natural coding interpretation any longer. All in all, current versions of MDL
that avoid probabilistic assumptions are still in their infancy, and more research is
needed to find out whether they can be modified to perform well in more general and
realistic settings.
Summary In the previous sections, we have covered basic refined MDL (Section 2.6),
general refined MDL (Section 2.7), and several extensions of refined MDL (this section).
This concludes our technical description of refined MDL. It only remains to place MDL
in its proper context: what does it do compared to other methods of inductive inference?
And how well does it perform, compared to other methods? The next two sections are
devoted to these questions.
63
tions between MDL and Akaike’s AIC [Burnham and Anderson 2002] are subtle. They
are discussed by, for example, Speed and Yu [1993].
64
MDL as a Maximum Probability Principle For a more detailed analysis, we
need to distinguish between the two central tenets of modern Bayesian statistics: (1)
Probability distributions are used to represent uncertainty, and to serve as a basis for
making predictions; rather than standing for some imagined ‘true state of nature’. (2)
All inference and decision-making is done in terms of prior and posterior distributions.
MDL sticks with (1) (although here the ‘distributions’ are primarily interpreted as
‘codelength functions’), but not (2): MDL allows the use of arbitrary universal mod-
els such as NML and prequential universal models; the Bayesian universal model does
not have a special status among these. In this sense, Bayes offers the statistician less
freedom in choice of implementation than MDL. In fact, MDL may be reinterpreted as
a maximum probability principle, where the maximum is relative to some given model,
in the worst-case over all sequences (Rissanen [1987, 1989] uses the phrase ‘global max-
imum likelihood principle’). Thus, whenever the Bayesian universal model is used in
an MDL application, a prior should be used that minimizes worst-case codelength re-
gret, or equivalently, maximizes worst-case relative probability. There is no comparable
principle for choosing priors in Bayesian statistics, and in this respect, Bayes offers a
lot more freedom than MDL.
Example 2.22 There is a conceptual problem with Bayes’ use of prior distribu-
tions: in practice, we very often want to use models which we a priori know to be
wrong – see Example 1.5. If we use Bayes for such models, then we are forced to
put a prior distribution on a set of distributions which we know to be wrong - that
is, we have degree-of-belief 1 in something we know not to be the case. From an
MDL viewpoint, these priors are interpreted as tools to achieve short codelengths
rather than degrees-of-belief and there is nothing strange about the situation; but
from a Bayesian viewpoint, it seems awkward. To be sure, Bayesian inference often
gives good results even if the model M is known to be wrong; the point is that
(a) if one is a strict Bayesian, one would never apply Bayesian inference to such
misspecified M, and (b), the Bayesian theory offers no clear explanation of why
Bayesian inference might still give good results for such M. MDL provides both
codelength and predictive-sequential interpretations of Bayesian inference, which
help explain why Bayesian inference may do something reasonable even if M is
misspecified. To be fair, we should add that there exists variations of the Bayesian
philosophy (e.g. De Finetti [1974]’s) which avoid the conceptual problem we just
described.
MDL and BIC In the first paper on MDL, Rissanen [1978] used a two-part code and
showed that, asymptotically, and under regularity conditions, the two-part codelength
of xn based on a k-parameter model M with an optimally discretized parameter space
is given by
k
− log P (xn | θ̂(xn )) + log n, (2.42)
2
thus ignoring O(1)-terms, which, as we have already seen, can be quite important.
In the same year Schwarz [1978] showed that, for large enough n, Bayesian model
selection between two exponential families amounts to selecting the model minimizing
65
(2.42), ignoring O(1)-terms as well. As a result of Schwarz’s paper, model selection
based on (2.42) became known as the BIC (Bayesian Information Criterion). Not
taking into account the functional form of the model M, it often does not work very
well in practice.
It has sometimes been claimed that MDL = BIC; for example, [Burnham and Anderson 2002,
page 286] write “Rissanen’s result is equivalent to BIC”. This is wrong, even for
the 1989 version of MDL that Burnham and Anderson refer to – as pointed out by
Foster and Stine [2004], the BIC approximation only holds if the number of parame-
ters k is kept fixed and n goes to infinity. If we select between nested families of models
where the maximum number of parameters k considered is either infinite or grows
with n, then model selection based on both P̄nml and on P̄Bayes tends to select quite
different models than BIC - if k gets closer to n, the contribution to COMPn (M) of
each additional parameter becomes much smaller than 0.5 log n [Foster and Stine 2004].
However, researchers who claim MDL = BIC have a good excuse: in early work, Rissa-
nen himself has used the phrase ‘MDL criterion’ to refer to (2.42), and unfortunately,
the phrase has stuck.
MDL and MML MDL shares some ideas with the Minimum Message Length (MML)
Principle which predates MDL by 10 years. Key references are [Wallace and Boulton 1968;
Wallace and Boulton 1975] and [Wallace and Freeman 1987]; a long list is in [Comley and Dowe 2004].
Just as in MDL, MML chooses the hypothesis minimizing the code-length of the data.
But the codes that are used are quite different from those in MDL. First of all, in MML
one always uses two-part codes, so that MML automatically selects both a model family
and parameter values. Second, while the MDL codes such as P̄nml minimize worst-case
relative code-length (regret), the two-part codes used by MML are designed to min-
imize expected absolute code-length. Here the expectation is taken over a subjective
prior distribution defined on the collection of models and parameters under considera-
tion. While this approach contradicts Rissanen’s philosophy, in practice it often leads
to similar results.
Indeed, Wallace and his co-workers stress that their approach is fully (subjective)
Bayesian. Strictly speaking, a Bayesian should report his findings by citing the full
posterior distribution. But sometimes one is interested in a single model, or hypothesis
for the data. A good example is the inference of phylogenetic trees in biological applica-
tions: the full posterior would consist of a mixture of several of such trees, which might
all be quite different from each other. Such a mixture is almost impossible to interpret
– to get insight in the data we need a single tree. In that case, Bayesians often use
the MAP (Maximum A Posteriori) hypothesis which maximizes the posterior, or the
posterior mean parameter value. The first approach has some unpleasant properties,
for example, it is not invariant under reparameterization. The posterior mean approach
cannot be used if different model families are to be compared with each other. The
MML method provides a theoretically sound way of proceeding in such cases.
66
MDL MML MDL Prequential
use general use two−part use general use predictive
universal codes codes only universal models, universal models
select H predictive for general
non−Bayesian: minimizing Bayesian: including universal
non−predictive loss functions
codes optimal total code− codes optimal models
length universal models, with more than
in worst−case in expectation for log loss only log loss statistical
sense of D according to
subjective inference, e.g.
prior prequential
probability
Figure 2.6: Rissanen’s MDL, Wallace’s MML and Dawid’s Prequential Approach.
67
ses are described using a universal programming language such as C or Pascal. For
example, in one proposal [Barron and Cover 1991], given data D one picks the distri-
bution minimizing
K(P ) + − log P (D) , (2.43)
where the minimum is taken over all computable probability distributions, and K(P )
is the length of the shortest computer program that, when input (x, d), outputs P (x)
to d bits precision. While such a procedure is mathematically well-defined, it cannot
be used in practice. The reason is that in general, the P minimizing (2.43) cannot
be effectively computed. Kolmogorov himself used a variation of (2.43) in which one
adopts, among all P with K(P ) − log P (D) ≈ K(D), the P with smallest K(P ). Here
K(D) is the Kolmogorov complexity of D, that is, the length of the shortest computer
program that prints D and then halts. This approach is known as the Kolmogorov
structure function or minimum sufficient statistic approach [Vitányi 2004]. In this
approach, the idea of separating data and noise (Section 2.6.1) is taken as basic, and
the hypothesis selection procedure is defined in terms of it. The selected hypothesis may
now be viewed as capturing all structure inherent in the data - given the hypothesis,
the data cannot be distinguished from random noise. Therefore, it may be taken
as a basis for lossy data compression – rather than sending the whole sequence, one
only sends the hypothesis representing the ‘structure’ in the data. The receiver can
then use this hypothesis to generate ‘typical’ data for it - this data should then ‘look
just the same’ as the original data D. Rissanen views this separation idea as perhaps
the most fundamental aspect of ‘learning by compression’. Therefore, in recent work
he has tried to relate MDL (as defined here, based on lossless compression) to the
Kolmogorov structure function, thereby connecting it to lossy compression, and, as he
puts it, ‘opening up a new chapter in the MDL theory’ [Vereshchagin and Vitányi 2002;
Vitányi 2004; Rissanen and Tabus 2004].
Summary and Outlook We have shown that MDL is closely related to, yet distinct
from, several other methods for inductive inference. In the next section we discuss how
well it performs compared to such other methods.
68
criticisms as being entirely mistaken. Based on our newly acquired technical knowledge
of MDL, let us discuss these criticisms a little bit further:
Things get more subtle if we are interested not in model selection (find the best
order Markov chain for the data) but in infinite-dimensional estimation (find the
best Markov chain parameters for the data, among the set B of all Markov chains
of each order). In the latter case, if we are to apply MDL, we somehow have to
carve up B into subsets M(0) ⊆ M(1) ⊆ . . . ⊆ B. Suppose that we have already
chosen M(1) = B (1) as the set of 1-st order Markov chains. We normally take
M(0) = B (0) , the set of 0-th order Markov chains (Bernoulli distributions). But
we could also have defined M(0) as the set of all 1-st order Markov chains with
P (Xi+1 = 1 | Xi = 1) = P (Xi+1 = 0 | Xi = 0). This defines a one-dimensional
subset of B (1) that is not equal to B (0) . While there are several good reasons21 for
choosing B (0) rather than M(0) , there may be no indication that B (0) is somehow
a priori more likely than M(0) . While MDL tells us that we somehow have to
carve up the full set B, it does not give us precise guidelines on how to do this
– different carvings may be equally justified and lead to different inferences for
small samples. In this sense, there is indeed some form of arbitrariness in this
type of MDL applications. But this is unavoidable: we stress that this type of
arbitrariness is enforced by all combined model/parameter selection methods -
whether they be of the Structural Risk Minimization type [Vapnik 1998], AIC-
type [Burnham and Anderson 2002], cross-validation or any other type. The only
alternative is treating all hypotheses in the huge class B on the same footing, which
amounts to maximum likelihood estimation and extreme overfitting.
2. ‘Occam’s razor is false’ (page 17) We often try to model real-world situations
that can be arbitrarily complex, so why should we favor simple models? We gave in
informal answer on page 17 where we claimed that even if the true data generating
69
machinery is very complex, it may be a good strategy to prefer simple models for small
sample sizes.
We are now in a position to give one formalization of this informal claim: it is
simply the fact that MDL procedures, with their built-in preference for ‘simple’ models
with small parametric complexity, are typically statistically consistent achieving good
rates of convergence (page 16), whereas methods such as maximum likelihood which do
not take model complexity into account are typically in-consistent whenever they are
applied to complex enough models such as the set of polynomials of each degree or the
set of Markov chains of all orders. This has implications for the quality of predictions:
with complex enough models, no matter how many training data we observe, if we
use the maximum likelihood distribution to predict future data from the same source,
the prediction error we make will not converge to the prediction error that could be
obtained if the true distribution were known; if we use an MDL submodel/parameter
estimate (Section 2.8), the prediction error will converge to this optimal achieveable
error.
Of course, consistency is not the only desirable property of a learning method, and
it may be that in some particular settings, and under some particular performance
measures, some alternatives to MDL outperform MDL. Indeed this can happen – see
below. Yet it remains the case that all methods the author knows of that successfully
deal with models of arbitrary complexity have a built-in preference for selecting sim-
pler models at small sample sizes – methods such as Vapnik’s [1998] Structural Risk
Minimization, penalized minimum error estimators [Barron 1990] and the Akaike crite-
rion [Burnham and Anderson 2002] all trade-off complexity with error on the data, the
result invariably being that in this way, good convergence properties can be obtained.
While these approaches measure ‘complexity’ in a manner different from MDL, and
attach different relative weights to error on the data and complexity, the fundamental
idea of finding a trade-off between ‘error’ and ‘complexity’ remains.
1. An asymptotic formula like (2.21) was used and the sample size was not large
enough to justify this [Navarro 2004].
2. P̄nml was undefined for the models under consideration, and this was solved by
cutting off the parameter ranges at ad hoc values [Lanterman 2004].
In these cases the problem probably lies with the use of invalid approximations rather
than with the MDL idea itself. More research is needed to find out when the asymp-
totics and other approximations can be trusted, and what is the ‘best’ way to deal
70
with undefined P̄nml . For the time being, we suggest to avoid using (2.21) whenever
possible, and to never cut off the parameter ranges at arbitrary values – instead, if
COMPn (M) becomes infinite, then some of the methods described in Section 2.7.2
should be used. Given these restrictions, P̄nml and Bayesian inference with Jeffreys’
prior are the preferred methods, since they both achieve the minimax regret. If they
are either ill-defined or computationally prohibitive for the models under consideration,
one can use a prequential method or a sophisticated two-part code such as described
by Barron and Cover [1991].
2.11 Conclusion
MDL is a versatile method for inductive inference: it can be interpreted in at least
four different ways, all of which indicate that it does something reasonable. It is
typically asymptotically consistent, achieving good rates of convergence. It achieves all
this without having been designed for consistency, being based on a philosophy which
makes no metaphysical assumptions about the existence of ‘true’ distributions. All this
strongly suggests that it is a good method to use in practice. Practical evidence shows
that in many contexts it is, in other contexts its behavior can be problematic. In the
author’s view, the main challenge for the future is to improve MDL for such cases, by
somehow extending and further refining MDL procedures in a non ad-hoc manner. I
am confident that this can be done, and that MDL will continue to play an important
role in the development of statistical, and more generally, inductive inference.
71
introduction to Rissanen’s radical but appealing philosophy, which is described very
eloquently.
Acknowledgments The author would like to thank Jay Myung, Mark Pitt, Steven
de Rooij and Teemu Roos, who read a preliminary version of this chapter and suggested
several improvements.
Notes
1. But see Section 2.9.4.
2. Working directly with distributions on infinite sequences is more elegant, but it requires measure
theory, which we want to avoid here.
3. Also known as instantaneous codes and called, perhaps more justifiably, ‘prefix-free’ codes in
[Li and Vitányi 1997].
4. For example, with non-integer codelengths the notion of ‘code’ becomes invariant to the size of the
alphabet in which we describe data.
5. As understood in elementary probability, i.e. with respect to Lebesgue measure.
6. Even if one adopts a Bayesian stance and postulates that an agent can come up with a (subjective)
distribution for every conceivable domain, this problem remains: in practice, the adopted distribution
may be so complicated that we cannot design the optimal code corresponding to it, and have to use
some ad hoc-instead.
7. Henceforth, we simply use ‘model’ to denote probabilistic models; we typically use H to denote
sets of hypotheses such as polynomials, and reserve M for probabilistic models.
8. The terminology ‘crude MDL’ is not standard. It is introduced here for pedagogical reasons,
to make clear the importance of having a single, unified principle for designing codes. It should
be noted that Rissanen’s and Barron’s early theoretical papers on MDL already contain such prin-
ciples, albeit in a slightly different form than in their recent papers. Early practical applications
[Quinlan and Rivest 1989; Grünwald 1996] often do use ad hoc two-part codes which really are ‘crude’
in the sense defined here.
9. See the previous endnote.
10. but see [Grünwald 1998], Chapter 5 for more discussion.
11. See Section 1.5 of Chapter 1 for a discussion on the role of consistency in MDL.
12. See, for example [Barron and Cover 1991], [Barron 1985]
13. Strictly speaking, the assumption that n is given in advance (i.e., both encoder and decoder know
n) contradicts the earlier requirement that the code to be used for encoding hypotheses is not allowed
to depend on n. Thus, strictly speaking, we should first encode some n explicitly, using 2 log n + 1 bits
(Example 2.4), and then pick the n (typically, but not necessarily equal to the actual sample size) that
allows for the shortest three-part codelength of the data (first encode n, then (k, θ), then the data).
In practice this will not significantly alter the chosen hypothesis, unless for some quite special data
sequences.
14. As explained in Figure 2.2, we identify these codes with their length functions, which is the only
aspect we are interested in.
15. The reason is that, in the full Bernoulli model with parameter θ ∈ [0, 1], the maximum likelihood
estimator is given by n1 /n, see Example 2.7. Since the likelihood log P (xn | θ) is a continuous function
of θ, this implies that if the frequency n1 /n in xn is approximately (but not precisely) j/10, then the
ML estimator in the restricted model {0.1, . . . , 0.9} is still given by θ̂ = j/10. Then log P (xn |θ) is
maximized by θ̂ = j/10, so that the L ∈ L that minimizes codelength corresponds to θ = j/10.
16. What we call ‘universal model’ in this text is known in the literature as a ‘universal model in the
individual sequence sense’ – there also exist universal models in an ‘expected sense’, see Section 2.9.1.
These lead to slightly different versions of MDL.
72
17. To be fair, we should add that this naive version of GLRT is introduced here for educational
purposes only. It is not recommended by any serious statistician!
18. The standard definition of Fisher information [Kass and Voss 1997] is in terms of first derivatives
of the log-likelihood; for most parametric models of interest, the present definition coincides with the
standard one.
19. The author has heard many people say this at many conferences. The reasons are probably his-
torical: while the underlying philosophy has always been different, until Rissanen introduced the use
of P̄nml , most actual implementations of MDL ‘looked’ quite Bayesian.
20. The reason is that the Bayesian and plug-in models can be interpreted as probabilistic sources.
The NML and the two-part code models are no probabilistic sources, since P̄ (n) and P̄ (n+1 ) are not
compatible in the sense of Section 2.2.
21. For example, B(0) is better interpretable.
22. We mention [Hansen and Yu 2000; Hansen and Yu 2001] reporting excellent behavior of MDL in re-
gression contexts; and [Allen, Madani, and Greiner 2003; Kontkanen, Myllymäki, Silander, and Tirri 1999;
Modha and Masry 1998] reporting excellent behavior of predictive (prequential) coding in Bayesian net-
work model selection and regression. Also, ‘objective Bayesian’ model selection methods are frequently
and successfully used in practice [Kass and Wasserman 1996]. Since these are based on non-informative
priors such as Jeffreys’, they often coincide with a version of refined MDL and thus indicate successful
performance of MDL.
23. But see Viswanathan., Wallace, Dowe, and Korb [1999] who point out that the problem of [Kearns, Mansour, Ng, and Ron 1997
disappears if a more reasonable coding scheme is used.
73
74
Bibliography
75
Chaitin, G. (1969). On the length of programs for computing finite binary sequences:
statistical considerations. Journal of the ACM 16, 145–159.
Clarke, B. (2002). Comparing Bayes and non-Bayes model averaging when model
approximation error cannot be ignored. Under submission.
Comley, J. W. and D. L. Dowe (2004). Minimum Message Length and generalised
Bayesian nets with asymmetric languages. In P. D. Grünwald, I. J. Myung, and
M. A. Pitt (Eds.), Advances in Minimum Description Length: Theory and Ap-
plications. MIT Press.
Cover, T. and J. Thomas (1991). Elements of Information Theory. New York: Wiley
Interscience.
Csiszár, I. and P. Shields (2000). The consistency of the BIC Markov order estimator.
The Annals of Statistics 28, 1601–1619.
Dawid, A. (1984). Present position and potential developments: Some personal views,
statistical theory, the prequential approach. Journal of the Royal Statistical So-
ciety, Series A 147 (2), 278–292.
Dawid, A. (1992). Prequential analysis, stochastic complexity and Bayesian infer-
ence. In J. Bernardo, J. Berger, A. Dawid, and A. Smith (Eds.), Bayesian Statis-
tics, Volume 4, pp. 109–125. Oxford University Press. Proceedings of the Fourth
Valencia Meeting.
Dawid, A. (1997). Prequential analysis. In S. Kotz, C. Read, and D. Banks (Eds.),
Encyclopedia of Statistical Sciences, Volume 1 (Update), pp. 464–470. Wiley-
Interscience.
Dawid, A. P. and V. G. Vovk (1999). Prequential probability: Principles and prop-
erties. Bernoulli 5, 125–162.
De Finetti, B. (1974). Theory of Probability. A critical introductory treatment. Lon-
don: John Wiley & Sons.
Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data Mining
and Knowledge Discovery 3 (4), 409–425.
Feder, M. (1986). Maximum entropy as a special case of the minimum description
length criterion. IEEE Transactions on Information Theory 32 (6), 847–849.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol-
ume 1. Wiley. Third edition.
Foster, D. and R. Stine (1999). Local asymptotic coding and the minimum description
length. IEEE Transactions on Information Theory 45, 1289–1293.
Foster, D. and R. Stine (2001). The competitive complexity ratio. In Proceedings of
the 2001 Conference on Information Sciences and Systems. WP8 1-6.
Foster, D. P. and R. A. Stine (2004). The contribution of parameters to stochastic
complexity. In P. D. Grünwald, I. J. Myung, and M. A. Pitt (Eds.), Advances in
Minimum Description Length: Theory and Applications. MIT Press.
76
Gács, P., J. Tromp, and P. Vitányi (2001). Algorithmic statistics. IEEE Transactions
on Information Theory 47 (6), 2464–2479.
Grünwald, P. D. (1996). A minimum description length approach to grammar infer-
ence. In G. S. S. Wermter, E. Riloff (Ed.), Connectionist, Statistical and Sym-
bolic Approaches to Learning for Natural Language Processing, Number 1040 in
Springer Lecture Notes in Artificial Intelligence, pp. 203–216.
Grünwald, P. D. (1998). The Minimum Description Length Principle and Reasoning
under Uncertainty. Ph. D. thesis, University of Amsterdam, The Netherlands.
Available as ILLC Dissertation Series 1998-03.
Grünwald, P. D. (1999). Viewing all models as ‘probabilistic’. In Proceedings of the
Twelfth Annual Workshop on Computational Learning Theory (COLT’ 99), pp.
171–182.
Grünwald, P. D. (2000). Maximum entropy and the glasses you are looking through.
In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelli-
gence (UAI 2000), pp. 238–246. Morgan Kaufmann Publishers.
Grünwald, P. D. and A. P. Dawid (2004). Game theory, maximum entropy, minimum
discrepancy, and robust Bayesian decision theory. Annals of Statistics 32 (4).
Grünwald, P. D. and J. Langford (2004). Suboptimal behaviour of Bayes and MDL
in classification under misspecification. In Proceedings of the Seventeenth Annual
Conference on Computational Learning Theory (COLT’ 04).
Grünwald, P. D., I. J. Myung, and M. A. Pitt (Eds.) (2004). Advances in Minimum
Description Length: Theory and Applications. MIT Press.
Hansen, M. and B. Yu (2000). Wavelet thresholding via MDL for natural images.
IEEE Transactions on Information Theory 46, 1778–1788.
Hansen, M. and B. Yu (2001). Model selection and the principle of minimum descrip-
tion length. Journal of the American Statistical Association 96 (454), 746–774.
Hanson, A. J. and P. C.-W. Fu (2004). Applications of MDL to selected families of
models. In P. D. Grünwald, I. J. Myung, and M. A. Pitt (Eds.), Advances in
Minimum Description Length: Theory and Applications. MIT Press.
Hjorth, U. (1982). Model selection and forward validation. Scandinavian Journal of
Statistics 9, 95–105.
Jaynes, E. (2003). Probability Theory: the logic of science. Cambridge University
Press. Edited by G. Larry Bretthorst.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation prob-
lems. Proceedings of the Royal Statistical Society (London) Series A 186, 453–461.
Jeffreys, H. (1961). Theory of Probability (Third ed.). London: Oxford University
Press.
Kass, R. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical
Association 90 (430), 773–795.
77
Kass, R. and P. Voss (1997). Geometrical Foundations of Asymptotic Inference. Wiley
Interscience.
Kass, R. and L. Wasserman (1996). The selection of prior distributions by formal
rules. Journal of the American Statistical Association 91, 1343–1370.
Kearns, M., Y. Mansour, A. Ng, and D. Ron (1997). An experimental and theoretical
comparison of model selection methods. Machine Learning 27, 7–50.
Kolmogorov, A. (1965). Three approaches to the quantitative definition of informa-
tion. Problems Inform. Transmission 1 (1), 1–7.
Kontkanen, P., P. Myllymäki, W. Buntine, J. Rissanen, and H. Tirri (2004). An
MDL framework for data clustering. In P. D. Grünwald, I. J. Myung, and M. A.
Pitt (Eds.), Advances in Minimum Description Length: Theory and Applications.
MIT Press.
Kontkanen, P., P. Myllymäki, T. Silander, and H. Tirri (1999). On supervised selec-
tion of Bayesian networks. In K. Laskey and H. Prade (Eds.), Proceedings of the
15th International Conference on Uncertainty in Artificial Intelligence (UAI’99).
Morgan Kaufmann Publishers.
Lanterman, A. D. (2004). Hypothesis testing for Poisson versus geometric distri-
butions using stochastic complexity. In P. D. Grünwald, I. J. Myung, and M. A.
Pitt (Eds.), Advances in Minimum Description Length: Theory and Applications.
MIT Press.
Lee, P. (1997). Bayesian Statistics – an introduction. Arnold & Oxford University
Press.
Li, M., X. Chen, X. Li, B. Ma, and P. Vitányi (2003). The similarity metric. In Proc.
14th ACM-SIAM Symp. Discrete Algorithms (SODA).
Li, M. and P. Vitányi (1997). An Introduction to Kolmogorov Complexity and Its
Applications (revised and expanded second ed.). New York: Springer-Verlag.
Liang, F. and A. Barron (2004a). Exact minimax predictive density estimation and
MDL. In P. D. Grünwald, I. J. Myung, and M. A. Pitt (Eds.), Advances in
Minimum Description Length: Theory and Applications. MIT Press.
Liang, F. and A. Barron (2004b). Exact minimax strategies for predictive density
estimation. To appear in IEEE Transactions on Information Theory.
Modha, D. S. and E. Masry (1998). Prequential and cross-validated regression esti-
mation. Machine Learning 33 (1), 5–39.
Myung, I. J., V. Balasubramanian, and M. A. Pitt (2000). Counting probability dis-
tributions: Differential geometry and model selection. Proceedings of the National
Academy of Sciences USA 97, 11170–11175.
Navarro, D. (2004). Misbehaviour of the Fisher information approximation to Mini-
mum Description Length. Under submission.
Pednault, E. (2003). Personal communication.
78
Quinlan, J. and R. Rivest (1989). Inferring decision trees using the minimum de-
scription length principle. Information and Computation 80, 227–248.
Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University
Press.
Rissanen, J. (1978). Modeling by the shortest data description. Automatica 14, 465–
471.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum de-
scription length. The Annals of Statistics 11, 416–431.
Rissanen, J. (1984). Universal coding, information, prediction and estimation. IEEE
Transactions on Information Theory 30, 629–636.
Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics 14,
1080–1100.
Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Society,
series B 49, 223–239. Discussion: pages 252–265.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific
Publishing Company.
Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transac-
tions on Information Theory 42 (1), 40–47.
Rissanen, J. (2000). MDL denoising. IEEE Transactions on Information The-
ory 46 (7), 2537–2543.
Rissanen, J. (2001). Strong optimality of the normalized ML models as universal
codes and information in data. IEEE Transactions on Information Theory 47 (5),
1712–1717.
Rissanen, J., T. Speed, and B. Yu (1992). Density estimation by stochastic complex-
ity. IEEE Transactions on Information Theory 38 (2), 315–323.
Rissanen, J. and I. Tabus (2004). Kolmogorov’s structure function in MDL theory and
lossy data compression. In P. D. Grünwald, I. J. Myung, and M. A. Pitt (Eds.),
Advances in Minimum Description Length: Theory and Applications. MIT Press.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statis-
tics 6 (2), 461–464.
Shafer, G. and V. Vovk (2001). Probability and Finance – It’s only a game! Wiley.
Shtarkov, Y. M. (1987). Universal sequential coding of single messages. (translated
from) Problems of Information Transmission 23 (3), 3–17.
Solomonoff, R. (1964). A formal theory of inductive inference, part 1 and part 2.
Information and Control 7, 1–22, 224–254.
Solomonoff, R. (1978). Complexity-based induction systems: comparisons and con-
vergence theorems. IEEE Transactions on Information Theory 24, 422–432.
79
Speed, T. and B. Yu (1993). Model selection and prediction: normal regression. Ann.
Inst. Statist. Math. 45 (1), 35–54.
Takeuchi, J. (2000). On minimax regret with respect to families of stationary stochas-
tic processes (in Japanese). In Proceedings IBIS 2000, pp. 63–68.
Takeuchi, J. and A. Barron (1997). Asymptotically minimax regret for exponential
families. In Proceedings SITA ’97, pp. 665–668.
Takeuchi, J. and A. Barron (1998). Asymptotically minimax regret by Bayes mix-
tures. In Proceedings of the 1998 International Symposium on Information Theory
(ISIT 98).
Townsend, P. (1975). The mind-body equation revisited. In C.-Y. Cheng (Ed.), Psy-
chological Problems in Philosophy, pp. 200–218. Honolulu: University of Hawaii
Press.
Vapnik, V. (1998). Statistical Learning Theory. John Wiley.
Vereshchagin, N. and P. M. B. Vitányi (2002). Kolmogorov’s structure functions with
an application to the foundations of model selection. In Proc. 47th IEEE Symp.
Found. Comput. Sci. (FOCS’02).
Viswanathan., M., C. Wallace, D. Dowe, and K. Korb (1999). Finding cutpoints in
noisy binary sequences - a revised empirical evaluation. In Proc. 12th Australian
Joint Conf. on Artif. Intelligence, Volume 1747 of Lecture Notes in Artificial
Intelligence (LNAI), Sidney, Australia, pp. 405–416.
Vitányi, P. M. (2004). Algorithmic statistics and Kolmogorov’s structure function.
In P. D. Grünwald, I. J. Myung, and M. A. Pitt (Eds.), Advances in Minimum
Description Length: Theory and Applications. MIT Press.
Wallace, C. and D. Boulton (1968). An information measure for classification. Com-
puting Journal 11, 185–195.
Wallace, C. and D. Boulton (1975). An invariant Bayes method for point estimation.
Classification Society Bulletin 3 (3), 11–34.
Wallace, C. and P. Freeman (1987). Estimation and inference by compact coding.
Journal of the Royal Statistical Society, Series B 49, 240–251. Discussion: pages
252–265.
Webb, G. (1996). Further experimental evidence against the utility of Occam’s razor.
Journal of Artificial Intelligence Research 4, 397–417.
Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and
its applications to learning. IEEE Transactions on Information Theory 44 (4),
1424–1439.
Zhang, T. (2004). On the convergence of MDL density estimation. In Y. Singer and
J. Shawe-Taylor (Eds.), Proceedings of the Seventeenth Annual Conference on
Computational Learning Theory (COLT’ 04), Lecture Notes in Computer Science.
Springer-Verlag.
80