Bishop Valencia 07
Bishop Valencia 07
Generative or Discriminative?
Getting the Best of Both Worlds
Christopher M. Bishop Julia Lasserre
Microsoft Research, UK Cambridge University. UK
[email protected] [email protected]
Summary
For many applications of machine learning the goal is to predict the value of
a vector c given the value of a vector x of input features. In a classification
problem c represents a discrete class label, whereas in a regression problem
it corresponds to one or more continuous variables. From a probabilistic
perspective, the goal is to find the conditional distribution p(c|x). The most
common approach to this problem is to represent the conditional distribution
using a parametric model, and then to determine the parameters using a
training set consisting of pairs {xn , cn } of input vectors along with their
corresponding target output vectors. The resulting conditional distribution
can be used to make predictions of c for new values of x. This is known
as a discriminative approach, since the conditional distribution discriminates
directly between the different values of c.
An alternative approach is to find the joint distribution p(x, c), expressed
for instance as a parametric model, and then subsequently uses this joint
distribution to evaluate the conditional p(c|x) in order to make predictions of c
for new values of x. This is known as a generative approach since by sampling
from the joint distribution it is possible to generate synthetic examples of the
feature vector x. In practice, the generalization performance of generative
models is often found to be poorer than than of discriminative models due to
differences between the model and the true distribution of the data.
When labelled training data is plentiful, discriminative techniques are widely
used since they give excellent generalization performance. However, although
collection of data is often easy, the process of labelling it can be expensive.
Consequently there is increasing interest in generative methods since these
can exploit unlabelled data in addition to labelled data.
Although the generalization performance of generative models can often be
improved by ‘training them discriminatively’, they can then no longer make
use of unlabelled data. In an attempt to gain the benefit of both generative
and discriminative approaches, heuristic procedure have been proposed which
interpolate between these two extremes by taking a convex combination of
the generative and discriminative objective functions.
Julia Lasserre is funded by the Microsoft Research European PhD Scholarship programme.
4 Christopher M. Bishop and Julia Lasserre
Here we discuss a new perspective which says that there is only one correct
way to train a given model, and that a ‘discriminatively trained’ generative
model is fundamentally a new model (Minka, 2006). From this viewpoint,
generative and discriminative models correspond to specific choices for the
prior over parameters. As well as giving a principled interpretation of ‘dis-
criminative training’, this approach opens the door to very general ways of
interpolating between generative and discriminative extremes through alter-
native choices of prior. We illustrate this framework using both synthetic
data and a practical example in the domain of multi-class object recognition.
Our results show that, when the supply of labelled training data is limited,
the optimum performance corresponds to a balance between the purely gen-
erative and the purely discriminative. We conclude by discussing how to use
a Bayesian approach to find automatically the appropriate trade-off between
the generative and discriminative extremes.
1. INTRODUCTION
In many applications of machine learning the goal is to take a vector x of input
features and to assign it to one of a number of alternative classes labelled by a
vector c (for instance, if we have C classes, then c might be a C-dimensional binary
vector in which all elements are zero except the one corresponding to the class).
In the simplest scenario, we are given a training data set X comprising N
input vectors X = {x1 , . . . , xN } together with a set of corresponding labels C =
{c1 , . . . , cN }, in which we assume that the input vectors, and their labels, are drawn
independently from the same fixed distribution. Our goal is to predict the class cb
for a new input vector x b, and so we require the conditional distribution
c|b
p(b x, X, C). (1)
The likelihood function can be combined with a prior p(θ), to give a joint distribution
p(θ)L(θ)
p(θ|X, C) = (4)
p(C|X)
where Z
p(C|X) = p(θ)L(θ) dθ. (5)
Generative or Discriminative? 5
Predictions for new inputs are then made by marginalizing the predictive distribu-
tion with respect to θ weighted by the posterior distribution
Z
c|b
p(b x, X, C) = p(b c|b
x, θ)p(θ|X, C) dθ. (6)
where θ = {π, λ}. Since the data points are assumed to be independent, the joint
distribution is given by
N
Y
LG (θ) = p(X, C, θ) = p(θ) p(xn , cn |θ). (9)
n=1
This can be maximized to determine the most probable (MAP) value of θ. Again,
since p(X, C, θ) = p(θ|X, C)p(X, C), this is equivalent to maximizing the posterior
distribution p(θ|X, C).
In order to improve the predictive performance of generative models it has been
proposed to use ‘discriminative training’ (Yakhnenko et al., 2005) which involves
maximizing
N
Y
LD (θ) = p(C, θ|X) = p(θ) p(cn |xn , θ) (10)
n=1
Generative or Discriminative? 7
in which we are conditioning on the input vectors instead of modelling their distri-
bution. Here we have used
p(x, c|θ)
p(c|x, θ) = P . (11)
c0 p(x, c0 |θ)
Note that (10) is not the joint distribution for the original model defined by (9), and
so does not correspond to MAP for this model. The terminology of ‘discriminative
training’ is therefore misleading, since for a given model there is only one correct
way to train it. It is not the training method which has changed, but the model
itself.
This concept of discriminative training has been taken a stage further (Yakhnenko,
Silvescu, and Honavar, 2005) by maximizing a function given by a convex combina-
tion of (9) and (10) of the form
q(x, c|θ, θ)
e = p(c|x, θ)p(x|θ)
e (13)
where X
p(x|θ)
e = p(x, c0 |θ).
e (14)
c0
N
Y
q(X, C, θ, θ)
e = p(θ, θ)
e p(cn |xn , θ)p(xn |θ).
e (15)
n=1
Now suppose we consider a special case in which the prior factorizes, so that
p(θ, θ)
e = p(θ)p(θ).
e (16)
We see that the resulting value of θ will be identical to that found by maximizing
(11), since it is the same function which is being maximized. Since it is θ and
not θe which determines the predictive distribution p(c|x, θ) we see that this model
is equivalent in its predictions to the ‘discriminatively trained’ generative model.
This gives a consistent view of training in which we always maximize the joint
distribution, and the distinction between generative and discriminative training lies
in the choice of model.
The relationship between the generative model and the discriminative model is
illustrated using directed graphs in Fig. 1.
c π c θ = {λ, π}
p q
x λ x θ’ = {λ’, π’}
N images N images
Now suppose instead that we consider a prior which enforces equality between
the two sets of parameters
e = p(θ)δ(θ − θ).
p(θ, θ) e (18)
results. Any improvement from the discriminative approach must therefore be the
result of a mis-match between the model and the true distribution of the (process
which generates the) data. In other words, the benefit of ‘discriminative training’
is dependent on model mis-specification.
Conversely, the benefit of the generative approach is that it can make use of
unlabelled data to augment the labelled training set. Suppose we have a data set
comprising a set of inputs XL for which we have corresponding labels CL , together
with a set of inputs XU for which we have no labels. For the correctly trained
generative model, the function which is maximized is given by
Y Y
p(θ) p(xn , cn |θ) p(xm |θ) (19)
n∈L m∈U
We see that the unlabelled data influences the choice of θ and hence affects the
predictions of the model. By contrast, for the ‘discriminatively trained’ generative
model the function which is now optimized is again the product of the prior and the
likelihood function and so takes the form
Y
p(θ) p(xc |xn , θ) (21)
n∈L
and we see that the unlabelled data plays no role. Thus, in order to make use of
unlabelled data we cannot use a discriminative approach.
Now let us consider how a combination of labelled and unlabelled data can be
exploited from the perspective of our new approach defined by (15), for which the
joint distribution becomes
q(XL , CL , XU , θ, θ)
e = p(θ, θ)
e
" #" #
Y Y
p(cn |xn , θ)p(xn |θ)
e p(xm |θ) .
e (22)
n∈L m∈U
We see that the unlabelled data (as well as the labelled data) influences the param-
eters θ
e which in turn influence θ via the soft constraint imposed by the prior.
In general, if the model is not a perfect representation of reality, and if we have
unlabelled data available, then we would expect the optimal balance to lie neither
at the purely generative extreme nor at the purely discriminative extreme.
As a simple example of a prior which interpolates smoothly between the gener-
ative and discriminative limits, consider the class of priors of the form
ff
e ∝ p(θ)p(θ)
p(θ, θ) e 1 exp − 1 kθ − θk e 2 . (23)
σ 2σ 2
If desired, we can relate σ to an α like parameter by defining a map from (0, 1) to
(0, ∞), for example using
„ «2
α
σ(α) = . (24)
1−α
10 Christopher M. Bishop and Julia Lasserre
3. ILLUSTRATION
We now illustrate the new framework for blending between generative and discrimi-
native approaches using an example based on synthetic data. This is chosen to be as
simple as possible, and so involves data vectors xn which live in a two-dimensional
Euclidean space for easy visualization, and which belong to one of two classes. Data
from each class is generated from a Gaussian distribution as illustrated in Fig. 2.
Data distribution
7
6
5
4
3
2
1
0
−1
−2 0 2 4 6 8
Here the scales on the axes are equal, and so we see that the class-conditional
densities are elongated in the horizontal direction.
We now consider a continuum of models which interpolate between purely gen-
erative and purely discriminative. To define this model we consider the generative
limit, and represent each class-conditional density using an isotropic Gaussian dis-
tribution. Since this does not capture the horizontally elongated nature of the true
class distributions, this represents a form of model mis-specification. The parame-
ters of the model are the means and variances of the Gaussians for each class, along
with the class prior probabilities.
Generative or Discriminative? 11
We consider a prior of the form (23) in which σ(α) is defined by (24). Here
we choose p(θ, θ|α)
e = p(θ) N (θ|θ,
e σ(α)), where p(θ) is the usual conjugate prior (a
Gaussian-gamma prior for the means and variances, and a Dirichlet prior for the class
label). The generative model consists of a spherical Gaussian per class, with mean
µ and a diagonal precision matrix ∆I, so that θ = {µk , ∆k } and θ e = {µ e k }.
ek , ∆
Specifically we have chosen
p(θ, θ|α)
e ∝ N (θ|θ 0 , σ(α))
Yˆ
N (µ0 k |0, (10∆0 k )−1 )G(∆0 k |0.01, 100)G(∆k |0.01, 100) (25)
˜
k
where N (·|·, ·) denotes a Gaussian distribution and G(·|·, ·) denotes a gamma distri-
bution.
The training data set comprises 200 points from each class, of which just two
from each class are labelled, and the test set comprises 200 points all of which
are labelled. Experiments are run 10 times with differing random initializations
(including the random selection of which subset of training points to label) and the
results used to computer a mean and variance over the test set classification, which
are shown by ‘error bars’ in Fig. 3.
90
Classificaton performance
80
70
60
50
40
0 0.2 0.4 0.6 0.8 1
Values of alpha
We see that the best generalization occurs for values of α intermediate between
the generative and discriminative extremes.
To gain insight into this behaviour we can plot the contours of density for each
class corresponding to different values of α, as shown in Fig. 4.
4. OBJECT RECOGNITION
We now apply our approach to a realistic application involving object recognition
in static images. This is a widely studied problem which has been tackled using a
Generative or Discriminative? 13
range of different discriminative and generative models. The long term goal of such
research is to achieve near human levels of recognition accuracy across thousands of
object classes in the presence of wide variations in location, scale, orientation and
lighting, as well as changes due to intra-class variability and occlusion.
We used eight different classes: airplanes, bikes, cows, faces, horses, leaves, motor-
bikes, sheep (Bishop, 2006). Together these images exhibit a wide variety of poses,
colours, and illumination, as illustrated by the sample images shown in Fig. 5. The
goal is to assign images from the test set to one of the eight classes.
C
Y c
p(c) = ψkk . (26)
k=1
Generative or Discriminative? 15
cn
y
tnj
p
znj
l
J
N
Figure 6: The generative model for object recognition expressed as a di-
rected acyclic graph, for unlabelled images, in which the boxes denote ‘plates’
(i.e. independent replicated copies). Only the patch feature vectors {xnj } are
observed, corresponding to the shaded node. The image class labels cn and
patch class labels τ nj are latent variables.
Given the overall class for the image, each patch is then drawn from either one
of the foreground classes or the background (k = C + 1) class. The probability of
generating a patch from a particular class is governed by a set of parameters πk ,
one for each class, such that πk > 0, constrained by the subset of classes actually
present in the image. Thus
C+1
!−1 C+1
X Y
p(τ j |c) = cl πl (ck πk )τjk . (27)
l=1 k=1
Note that there is an overall undetermined scale to these parameters, which may be
removed by fixing one of them, e.g. πC+1 = 1.
For each class, the distribution of the patch feature vector x is governed by a
separate mixture of Gaussians which we denote by
C+1
Y
p(x|τ j ) = φk (xj ; λk )τjk (28)
k=1
where λk denotes the set of parameters (means, covariances and mixing coefficients)
associated with this mixture model.
If we assume N independent images, and for image n we have J patches drawn
16 Christopher M. Bishop and Julia Lasserre
Here we are assuming that each image has the same number J of patches, though
this restriction is easily relaxed if required.
The graph shown in Fig. 6 corresponds to unlabelled images in which only the
feature vectors {xnj } are observed, with both the image category and the classes
of each of the patches being latent variables. It is also possible to consider images
which are ‘weakly labelled’, that is each image is labelled according to the category
of object present in the image. This corresponds to the graphical model of Fig. 7 in
which the node cn is shaded.
cn
y
tnj
p
znj
l
J
N
Figure 7: Graphical model corresponding to Fig. 6 for weakly labelled images.
Of course, for a given size of data set, better performance is expected if all of
the images are ‘strongly labelled’, that is segmented images in which the region
occupied by the object or objects is known so that the patch labels τ nj become
observed variables. The graphical model for a set of strongly labelled images is
shown in Fig. 8.
Strong labelling requires hand segmentation of images, and so is a time consum-
ing and expensive process as compared with collection of the images themselves.
For a given level of effort it will always be possible to collect many unlabelled or
weakly labelled images for the same cost as a single strongly labelled image. Since
the variability of natural images and objects is so vast we will always be operating
in a regime in which the size of our data sets is statistically small (though they will
often be computationally large).
Generative or Discriminative? 17
cn
y
tnj
p
znj
l
J
N
For this reason there is great interest in augmenting expensive strongly labelled
images with lots of cheap weakly labelled or unlabelled images in order to better
characterize the different forms of variability. Although the two stage hierarchical
model shown in Fig. 6 appears to be more complicated than in the simple example
shown in Fig. 1, it does in fact fall within the same framework. In particular, for
labelled images the observed data is {xn , cn , τ nj }, while for ‘unlabelled’ images only
{xn } are observed. The experiments described here could readily be extended to
consider arbitrary combinations of strongly labelled, weakly labelled and unlabelled
images if desired.
If we let θ = {ψk , πk , λk } denote the full set of parameters in the model, then
we can consider a model of the form (22) in which the prior is given by (23) with
σ(α) defined by (24), and the terms p(θ) and p(θ) e taken to be constant.
We use conjugate gradients to optimize the parameters. Due to lack of space we
do not write down all the derivatives of the log likelihood function required by the
conjugate gradient algorithm. However, the correctness of the mathematical deriva-
tion of these gradients, as well as their numerical implementation, can easily be
verified by comparison against numerical differentiation (Bishop, 1995). The conju-
gate gradients is the most used technique when it comes to blending generative and
discriminative models, thanks to its flexibility. Indeed, because of the discrimina-
tive component p(cn |xn , θ) which contains a normalizing factor, an algorithm such
as EM would require much more work, as nothing is directly tractable anymore.
However, a comparison of the two methods is currently being investigated.
4.4. Results
We use 50 training images per class (giving 400 training images in total) of which
five images per class (a total of 40) were fully labelled, i.e., both the image and the
18 Christopher M. Bishop and Julia Lasserre
individual patches have class labels. All the other images are left totally unlabelled,
i.e. not even the category they belong to is given. Note that this kind of training
data is (1) very cheap to get and (2) very unusual for a discriminative model. The
test set consists of 100 images per class (giving a total of 800 images), the task is to
label each image.
Experiments are run five times with differing random initializations and the
results used to compute a mean and variance over the test set classification, which
are shown by ‘error bars’ in Fig. 9.
Note that, since there are eight balanced classes, random guessing would give
12.5% correct on average. Again we see that the best performance is obtained with
a blend between generative and discriminative extremes.
5. CONCLUSIONS
In this paper we have shown that ‘discriminative training’ for generative models
can be re-cast in terms of standard training methods applied to a modified model.
This new viewpoint opens the door to a wide range of new models which interpolate
smoothly between generative and discriminative approaches and which can benefit
from the advantages of both. The main drawback of this framework is that the
number of parameters in the model is doubled leading to greater computational
cost.
Although we have focussed on classification problems, the framework is equally
applicable to regression problems in which c corresponds to a set of continuous
variables.
Generative or Discriminative? 19
REFERENCES
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: University Press
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Berlin: Springer-Verlag
Bouchard, G. and Triggs, B. (2004). The trade-off between generative and discriminative
classifiers. IASC 16th International Symposium on Computational Statistics, Prague,
Czech Republic, 721–728.
Holub, A. and Perona, P. (2005). A discriminative framework for modelling object classes.
IEEE Conference on Computer Vision and Pattern Recognition, San Diego (California),
USA. IEEE Computer Society.
Jebara, T. (2004). Machine Learning: Discriminative and Generative. Dordrecht: Kluwer
Kapadia, S. (1998). Discriminative Training of Hidden Markov Models. PhD Thesis, Uni-
versity of Cambridge, UK.
Kasson, J. M. and Plouffe, W. (1992). An analysis of selected computer interchange color
spaces. ACM Transactions on Graphics 11, 373–405.
Minka, T. (2005). Discriminative models, not discriminative training. Tech. Rep., Microsoft
Research, Cambridge, UK.
Ng, A. Y. and Jordan, M. I. (2002). On discriminative vs. generative: A comparison of
logistic regression and naive Bayes. Advances in Neural Information Processing Sys-
tems 14, (T. G. Dietterich, S. Becker, and Z. Ghahramani, eds.) Cambridge, MA: The
MIT Press, 841–848.
Raina, R., Shen, Y., Ng, A. Y. and McCallum, A. (2003). Classification with hybrid gen-
erative/discriminative models. Advances in Neural Information Processing Systems 16
Cambridge, MA: The MIT Press, 545–552.
Ulusoy, I. and Bishop, C. M. (2005). Generative versus discriminative models for object
recognition. Proceedings IEEE International Conference on Computer Vision and Pat-
tern Recognition, CVPR., San Diego.
Varma, M. and Zisserman, A. (2005). A statistical approach to texture classification from
single images. IJCV 62, 61–81.
Winn, J., Criminisi, A. and Minka, T. (2005). Ob ject categorization by learned universal
visual dictionary. IEEE International Conference on Computer Vision, Beijing, China.
IEEE Computer Society.
Yakhnenko, O., Silvescu, A. and Honavar, V. (2005). Discriminatively trained Markov model
for sequence classification. 5th IEEE International Conference on Data Mining, Houston
(Texas), USA. IEEE Computer Society.
20 Christopher M. Bishop and Julia Lasserre
DISCUSSION
HERBERT K. H. LEE (University of California, Santa Cruz, USA)
Let me start by congratulating the authors for this paper. In terms of ‘Getting the
best of both worlds’, this paper can also be seen as crossing between machine learning
and statistics, combining useful elements from both. I can only hope that our two
communities continue to interact, deepening our connections. The perspectives are
often complementary, which is an underlying theme of this discussion.
One of the obstacles to working across disciplines is that a very different vocab-
ulary may be used to describe the same concepts. Statisticians reading this paper
may find it helpful to think of supervised learning as classification, and unsuper-
vised learning as clustering. Discriminative learning is thus classification based on a
probability model (which it typically is in statistics) while the generative approach
is clustering based on a model (such as a mixture model approach).
While machine learning and statistics are really quite similar, there is a key
difference in perspective. In machine learning, the main goal is typically to achieve
good predictions, and while a probability model may be used, it is not explicitly
required. In contrast, most statistical analyses see the probability model as the
core of the analysis, with the idea that optimal predictions will arise from accurate
selection and fitting of the model. In particular, Bayesian analyses rely crucially
on a fully-specified probability model. Thus one of the core points of this paper,
that of a unique likelihood function, should seem natural to a Bayesian. Yet it is an
important insight in the context of the machine learning literature. Bringing these
machine learning algorithms into a coherent probabilistic framework is a big step
forward, and one that is not always fully valued. This is an important contribution
by these authors and their collaborators.
Uncertainty about the likelihood function can be dealt with by embedding the
functions under consideration into a larger family with one or more additional pa-
rameters, and this is exactly what has been done here. This follows a strong tradition
in statistics, such as Box-Cox transformations and model averaging, but represents
a relatively untapped potential in machine learning. In contrast, it is more common
in machine learning to use implicit expansions of the model class (or of the fitting
algorithm, when the model may not be explicitly stated). Examples include bagging
(Breiman, 1996), where individual predictions from over-fit models are averaged over
bootstrap samples to reduce over-fitting, and boosting (Freund and Schapire, 1997),
where overly-simple models are combined to create an improved ensemble predic-
tion. Such implicit expansion can work well in practice, but it can be difficult to
understand or describe the expanded class of models, and hence difficult to leverage
related knowledge from the literature.
On a related note, the authors argue that a key benefit of using discriminative
training for generative models is that it improves performance when the model is
mis-specified, as vividly demonstrated by the example in Section 3. In practice, this
is quite useful, as our parametric models are typically only approximations to reality,
and the approximations can be quite poor. But this does leave open the possibility
of explicit model expansion. A larger parametric family may encompass a model
which is close enough to reality. Or taking things even further, one could move to a
fully nonparametric approach. Then it becomes less clear what the trade-offs are.
Many highly innovative and creative ideas have arisen in machine learning, and
the field of statistics has gained by importing some of these ideas. Statistics, in
turn, can offer a substantial literature that can be applied once a machine learning
Generative or Discriminative? 21
algorithm can be mapped to a probability model. From the model, one can draw
from the literature to better understand when the algorithm will work best, when
it might perform poorly, what diagnostics may be applicable, and possibly how to
further improve the algorithm. The key is connecting the algorithm to a probability
model, either finding the model which implicitly underlies the algorithm, or showing
that the algorithm approximates a particular probability model. These sorts of
connections benefit both fields.
Thus thinking more about Bayesian probability models, some possible further
directions for this current work come to mind. It would seem natural to put a prior
on α and to treat it as another parameter. At least from the experiments shown
so far, it appears that there may be some information about likely best ranges of
α, allowing the use of an informative prior, possibly in comparison to a flat prior.
In addition to the possibility of marginalizing over α, one could also estimate α to
obtain a ‘best’ fit.
Another natural possible direction would be to look at the full posterior, rather
than just getting a point estimate, such as the maximum a posteriori class estimate.
Knowing about the uncertainty of the classification can often be useful. It may also
be useful to move to a more hierarchical model, particularly for the image classes.
It would seem that images of horses and cows would be more similar to each other,
and images of bicycles and motorbikes would be similar to each other, but that these
two sets would be rather different from each other, and further different from faces
or leaves. Working within a hierarchical structure should be straightforward in a
fully Bayesian paradigm.
In terms of connections between machine learning and statistics, it seems unfor-
tunate that the machine learning literature takes little notice of related work in the
statistics literature. In particular, there has been extensive work on model-based
clustering, for example, Fraley and Raftery (2002) and even a poster presented at
this conference (Frühwirth-Schnatter and Pamminger, 2006). It would be great if
the world of machine learning were more cognizant of the statistical literature.
In summary, this paper presents a practical solution to a problem that is defi-
nitely in need of attention, and which has received relatively little attention in the
statistical literature. The likelihood-based approach is particularly promising. The
authors make a very positive contribution in helping to bridge the gap between
machine learning and statistics, and I hope to see more such work in the future.
with interpretations that are part of a generative process, often of a dynamic form.
Inevitably, our models are oversimplifications and we learn both from the activity
of parameter estimation and model choice and from careful examination and inter-
pretation of the posterior distribution of the parameters as well as from the form of
various predictive distributions, e.g., see Blei et al. (2003a,2003b), Erosheva (2003),
and Erosheva et al. (2004). When I teach statistical learning to graduate students in
the Machine Learning Department, I emphasize that the world of machine learning
would be enriched by taking on at least part of this broader perspective.
My second observation is closely related. While I very much appreciated the
BL’s goal and the boldness of their effort to formulate a different likelihood func-
tion to achieve an integration of the two perspectives, I think the effort would
be enhanced by consideration of some of the deeper implications of the subjective
Bayesian paradigm. As Bishop noted in his oral response to the discussion, machine
learning has been moving full force to adopt formal statistical ideas, and so it is
now common to see the incorporation of MAP and model averaging methodologies
directly into the classification setting. But as some of the other presentations at
this conference make clear, model assessment is a much more complex and subtle
activity which we can situate within the subjective Bayesian framework, cf., Draper
(1999). In particular, I commend a careful reading of the Lauritzen (2006) discus-
sion of the paper by Chakrabarti and Ghosh (2006), in which he emphasized that
we can only come to grips with the model choice problem by considering model
comparisons and specifications with respect to our own subjective Bayesian prior
distribution (thus taking advantage of the attendant coherence in the sense of de
Finetti (1937)) until such time as we need to step outside the formal framework and
reformulate our model class and likelihoods. Thus BL’s new replicate distribution
for θ0 possibly should be replaced by a real subjective prior distribution, perhaps
of a similar form, and then they could explore the formulation of the generative
model without introducing a likelihood that departs from the one that attempts to
describe the underlying phenomenon of interest. This, I suspect, would lead to an
even clearer formulation that is more fully rooted in both the machine learning and
statistical learning worlds.
field alone is vast, and the issue of vocabulary mentioned above is a further obstacle
to cross-fertilization.
It could be even be argued that the relative independence of the two fields has
brought some benefits. For example, the use of large, highly parameterized black-
box models trained on large data sets, of the kind which characterized much of the
applied work in neural networks in the 1990s, did not fit well with the culture of the
statistics community at the time, and met with scepticism from some quarters. Yet
these efforts have led to numerous large-scale applications of substantial commercial
importance.
Nevertheless, it seems clear that greater cross-fertilization between the two com-
munities would be desirable. Conference such as AI Statistics (Bishop 2003) explic-
itly seek to achieve this, and several text books also span the divide (Hastie 2001
and Bishop 2006).
Increasingly, the focus in the machine learning community is not just on well-
defined probabilistic models, but on fully Bayesian treatments in which distributions
over unknown variables are maintained and updated. However, almost any model
which has sufficient complexity to be of practical interest will not have a closed-
form analytical solution, and hence approximation techniques are essential. For
many years the only general purpose approximation framework available was that
of Monte Carlo sampling, which is computationally demanding and which does not
scale to large problems.
A crucial advance, therefore, has been the development of a wide range of pow-
erful deterministic inference techniques. These include variational inference, expec-
tation propagation, loopy belief propagation, and others. Like Markov chain Monte
Carlo, these techniques have their origins in physics. However, they have primarily
been developed and applied within the machine learning community. They comple-
ment naturally the recent advances in probabilistic graphical models, such as the de-
velopment of factor graphs. Also, they are computationally much more efficient than
Monte Carlo methods, and have permitted, for instance, a fully Bayesian treatment
of the player ranking problem in on-line computer games through the TrueSkillTM
system, in which millions of players are compared and ranked, with ranking updates
and team matching being done in real time (Herbrich 1996). Here the Bayesian
treatment, which maintains a distribution over player skill levels, leads to substan-
tially improved ranking accuracy compared to the traditional ELO method , which
used a point estimate and which can be viewed as a maximum likelihood technique.
This type of application would be inconceivable without the use of deterministic
approximation schemes. In our view, these methods represent the single most im-
portant advance in practical Bayesian statistics in the last 10 years.
de Finetti, B. (1937). La prévision: ses lois logiques, ses sources subjectives. Annales de
l’Institut Henri Poincaré 7, 1–68.
Draper D. (1999). Model uncertainty yes, discrete model averaging maybe. (discussion of
‘Bayesian model averaging: a tutorial,’ by Hoeting et al. ). Statist. Science 14, 405–409.
Erosheva, E. A. (2003). Bayesian estimation of the grade of membership model. Bayesian
Statistics 7 (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman,
A. F. M. Smith and M. West, eds.) Oxford: University Press, 501–510.
Erosheva, E. A., Fienberg, S. E., and Lafferty, J. (2004). Mixed-membership models of
scientific publications. Proc. National Acad. Sci. 97, 11885–11892.
Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and
density estimation. J. Amer. Statist. Assoc. 97, 611–631.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to aoosting. J. Computer and System Sciences 55, 119–139.
Frühwirth-Schnatter, S. and Pamminger C. (2006). Model-based clustering of discrete-
valued time deries sata. Tech. Rep., Johannes Kepler Universität Linz, Austria.
Lauritzen, S. (2006). Discussion of Chakrabarti and Ghosh. In this volume.