Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory
Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory
Abstract
From the Bayesian perspective, the category of conditional probabilities (a vari-
ant of the Kleisli category of the Giry monad, whose objects are measurable spaces
and arrows are Markov kernels) gives a nice framework for conceptualization and
analysis of many aspects of machine learning. Using categorical methods, we con-
struct models for parametric and nonparametric Bayesian reasoning on function
spaces, thus providing a basis for the supervised learning problem. In particular,
stochastic processes are arrows to these function spaces which serve as prior prob-
abilities. The resulting inference maps can often be analytically constructed in this
symmetric monoidal weakly closed category. We also show how to view general
stochastic processes using functor categories and demonstrate the Kalman filter as
an archetype for the hidden Markov model.
Contents
1 Introduction 2
6 Function Spaces 31
6.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 GPs via Joint Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . 41
10 Final Remarks 69
13 References 72
1 Introduction
Speculation on the utility of using categorical methods in machine learning (ML) has
been expounded by numerous people, including by the denizens at the n-category cafe
blog [5] as early as 2007. Our approach to realizing categorical ML is based upon viewing
ML from a probabilistic perspective and using categorical Bayesian probability. Several
recent texts (e.g., [2, 19]), along with countless research papers on ML have emphasized
the subject from the perspective of Bayesian reasoning. Combining this viewpoint with
the recent work [6], which provides a categorical framework for Bayesian probability, we
develop a category theoretic perspective on ML. The abstraction provided by category
theory serves as a basis not only for an organization of ones thoughts on the subject, but
also provides an efficient graphical method for model building in much the same way that
probabilistic graphical modeling (PGM) has provided for Bayesian network problems.
In this paper, we focus entirely on the supervised learning problem, i.e., the regression
or function estimation problem. The general framework applies to any Bayesian machine
1 INTRODUCTION 3
learning problem, however. For instance, the unsupervised clustering or density estimation
problems can be characterized in a similar way by changing the hypothesis space and
sampling distribution. For simplicity, we choose to focus on regression and leave the
other problems to the industrious reader. For us, then, the Bayesian learning problem
is to determine a function f ∶ X → Y which takes an input x ∈ X, such as a feature
vector, and associates an output (or class) f (x) with x. Given a measurement (x, y),
or a set of measurements {(xi , yi )}N i=1 where each yi is a labeled output (i.e., training
data), we interpret this problem as an estimation problem of an unknown function f
which lies in Y X , the space of all measurable functions1 from X to Y such that f (xi ) ≈ yi .
When Y is a vector space the space Y X is also a vector space that is infinite dimensional
when X is infinite. If we choose to allow all such functions (every function f ∈ Y X is
a valid model) then the problem is nonparametric. On the other hand, if we only allow
functions from some subspace V ⊂ Y X of finite dimension p, then we have a parametric
model characterized by a measurable map i ∶ R → Y X . The image of i is then the
p
space of functions which we consider as valid models of the unknown function for the
Bayesian estimation problem. Hence, the elements a ∈ R completely determine the valid
p
modeling functions i(a) ∈ Y X . Bayesian modeling splits the problem into two aspects:
(1) specification of the hypothesis space, which consist of the “valid” functions f , and (2)
a noisy measurement model such as yi = f (xi ) + i , where the noise component i is often
modeled by a Gaussian distribution. Bayesian reasoning with the hypothesis space taken
as Y X or any subspace V ⊂ Y X (finite or infinite dimensional) and the noisy measurement
model determining a sampling distribution can then be used to efficiently estimate (learn)
the function f without over fitting the data.
We cast this whole process into a graphical formulation using category theory, which
like PGM, can in turn be used as a modeling tool itself. In fact, we view the components of
these various models, which are just Markov kernels, as interchangeable parts. An impor-
tant piece of the any solving the ML problem with a Bayesian model consists of choosing
the appropriate parts for a given setting. The close relationship between parametric and
nonparametric models comes to the forefront in the analysis with the measurable map
i ∶ R → Y X connecting the two different types of models. To illustrate this point suppose
p
p
we are given a normal distribution P on R as a prior probability on the unknown param-
eters. Then the push forward measure2 of P by i is a Gaussian process, which is a basic
tool in nonparametric modeling. When composed with a noisy measurement model, this
provides the whole Bayesian model required for a complete analysis and an inference map
1
Recall that a σ-algebra ΣX on X is a collection of subsets of X that is closed under complements
and countable unions (and hence intersections); the pair (X, ΣX ) is called a measurable space and any
set A ∈ ΣX is called a measurable set of X. A measurable function f ∶ X → Y is defined by the property
that for any measurable set B in the σ-algebra of Y , we have that f −1 (B) is in the σ-algebra of X. For
example, all continuous functions are measurable with respect to the Borel σ-algebras.
2
A measure µ on a measurable space (X, ΣX ) is a nonnegative real-valued function µ∶ X → R≥0 such
that µ(∅) = 0 and µ(∪∞ i=1 Ai ) = ∑i=1 µ(Ai ). A probability measure is a measure where µ(X) = 1. In this
∞
paper, all measures are probability measures and the terminology “distribution” will be synonymous with
“probability measure.”
1 INTRODUCTION 4
can be analytically constructed.3 Consequently, given any measurement (x, y) taking the
inference map conditioned at (x, y) yields the updated prior probability which is another
p
normal distribution on R .
The ability to do Bayesian probability involving function spaces relies on the fact that
the category of measurable spaces, Meas, has the structure of a symmetric monoidal
closed category (SMCC). Through the evaluation map, this in turn provides the category
of conditional probabilities P with the structure of a symmetric monoidal weakly closed
category (SMwCC), which is necessary for modeling stochastic processes as probability
measures on function spaces. On the other hand, the ordinary product X × Y with
its product σ-algebra is used for the Bayesian aspect of updating joint (and marginal)
distributions. From a modeling viewpoint, the SMwCC structure is used for carrying
along a parameter space (along with its relationship to the output space through the
evaluation map). Thus we can describe training data and measurements as ordered pairs
(xi , yi ) ∈ X ⊗ Y , where X plays the role of a parameter space.
A few notes on the exposition. In this paper our intended audience consists of (1)
the practicing ML engineer with only a passing knowledge of category theory (e.g., know-
ing about objects, arrows and commutative diagrams), and (2) those knowledgeable of
category theory with an interest of how ML can be formulated within this context. For
the ML engineer familiar with Markov kernels, we believe that the presentation of P and
its applications can serve as an easier introduction to categorical ideas and methods than
many standard approaches. While some terminology will be unfamiliar, the examples
should provide an adequate understanding to relate the knowledge of ML to the cate-
gorical perspective. If ML researchers find this categorical perspective useful for further
developments or simply for modeling purposes, then this paper will have achieved its goal.
In the categorical framework for Bayesian probability, Bayes’ equation is replaced
by an integral equation where the integrals are defined over probability measures. The
analysis requires these integrals be evaluated on arbitrary measurable sets and this is
often possible using the three basic rules provided in Appendix A. Detailed knowledge of
measure theory is not necessary outside of understanding these three rules and the basics
of σ-algebras and measures, which are used extensively for evaluating integrals in this
paper. Some proofs require more advanced measure-theoretic ideas, but the proofs can
safely be avoided by the unfamiliar reader and are provided for the convenience of those
who might be interested in such details.
For the category theorist, we hope the paper makes the fundamental ideas of ML
transparent, and conveys our belief that Bayesian probability can be characterized cate-
gorically and usefully applied to fields such as ML. We believe the further development
of categorical probability can be motivated by such applications and in the final remarks
we comment on one such direction that we are pursuing.
These notes are intended to be tutorial in nature, and so contain much more detail
that would be reasonable for a standard research paper. As in this introductory section,
3
The inference map need not be unique.
1 INTRODUCTION 5
basic definitions will be given as footnotes, while more important definitions, lemmas and
theorems Although an effort has been made to make the exposition as self-contained as
possible, complete self-containment is clearly an unachievable goal. In the presentation,
we avoid the use of the terminology of random variables for two reasons: (1) formally
a random variable is a measurable function f ∶ X → Y and a probability measure P on
X gives rise to the distribution of the random variable f⋆ (P ) which is the push forward
measure of P . In practice the random variable f itself is more often than not impossible
to characterize functionally (consider the process of flipping a coin), while reference to
the random variable using a binomial distribution, or any other distribution, is simply
making reference to some probability measure. As a result, in practice the term “random
variable” is often not making reference to any measurable function f and the pushforward
measure of some probability measure P at all but rather is just referring to a probability
measure; (2) the term “random variable” has a connotation that, we believe, should be
de-emphasized in a Bayesian approach to modeling uncertainty. Thus while a random
variable can be modeled as a push forward probability measure within the framework
presented we feel no need to single them out as having any special relevance beyond the
remark already given. In illustrating the application of categorical Bayesian probability
we do however show how to translate the familiar language of random variables into the
unfamiliar categorical framework for the particular case of Gaussian distributions which
are the most important application for ML since Gaussian Processes are characterized on
finite subsets by Gaussian distributions. This provides a particularly nice illustration of
the non uniqueness of conditional sampling distribution and inference pairs given a joint
distribution.
(U ○ T )(C ∣ x) = ∫ U (C ∣ y) dTx .
y∈Y
The integral of any real valued measurable function f ∶ X → R with respect to any
measure P on X is
EP [f ] = ∫ f (x) dP, (1)
x∈X
called the P -expectation of f . Consequently the composite (U ○ T )(C ∣ x) is the Tx -
expectation of UC ,
(U ○ T )(C ∣ x) = ETx [UC ].
Let Meas denote the category of measurable spaces where the objects are measurable
spaces (X, ΣX ) and the arrows are measurable functions f ∶ X → Y . Every measurable
mapping f ∶ X → Y may be regarded as a P arrow
δf
X Y
defined by the Dirac (or one point) measure
δf ∶ X × ΣY → [0, 1]
1 If f (x) ∈ B
∶ (B ∣ x) ↦ {
0 If f (x) ∉ B.
8
A perfect probability measure P on Y is a probability measure such that for any measurable function
f ∶ Y → R there exist a real Borel set E ⊂ f (Y ) satisfying P (f −1 (E)) = 1.
9
Specifically, the subsequent Theorem 1 is a constructive procedure which requires perfect probability
measures. Corollary 2 then gives the inference map. Without the hypothesis of perfect measures a
pathological counterexample can be constructed as in [9, Problem 10.26]. The paper by Faden [11] gives
conditions on the existence of conditional probabilities and this constraint is explained in full detail in
[6]. Note that the class of perfect measures is quite broad and includes all probability measures defined
on Polish spaces.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 8
The relation between the dirac measure and the characteristic (indicator) function 1 is
δf (B ∣ x) = 1f −1 (B) (x)
1 if x ∈ B
δIdX (B ∣ x) = {
0 if x ∉ B
which is the identity morphism for X in P. Using standard notation we denote the identity
mapping on any object X by 1X = δIdX , or for brevity simply by 1 if the space X is clear
from the context. With these objects and arrows, law of composition, associativity, and
identity, standard measure-theoretic arguments show that P forms a category.
There is a distinguished object in P that play an important role in Bayesian probability.
For any set Y with the indiscrete σ-algebra ΣY = {Y, ∅}, there is a unique arrow from
any object X to Y since any arrow P ∶ X → Y is completely determined by the fact that
Px must be a probability measure on Y . Hence Y is a terminal object, and we denote
the unique arrow by !X ∶ X → Y . Up to isomorphism, the canonical terminal object is the
one-element set which we denote by 1 = {⋆} with the only possible σ-algebra. It follows
that any arrow P ∶ 1 → X from the terminal object to any space X is an (absolute)
probability measure on X, i.e., it is an “absolute” probability measure on X because
there is no variability (conditioning) possible within the singleton set 1 = {⋆}.
P
1 X
proceeding to formulate precisely what the term “product space” means in P, we describe
the categorical construct of a finite product space in any category.
Let C be an arbitary category and X, Y ∈ob C. We say the product of X and Y exists
if there is an object, which we denote by X × Y , along with two arrows pX ∶ X × Y → X
and pY ∶ X × Y → Y in C such that given any other object T in C and arrows f ∶ T → X
and g ∶ T → Y there is a unique C arrow ⟨f, g⟩∶ T → X × Y that makes the diagram
f ⟨f, g⟩ g (2)
X pX X ×Y pY Y
commute. If the given diagram is a product then we often write the product as a triple
(X × Y, pX , pY ). We must not let the notation deceive us; the object X × Y could just
as well be represented by PX,Y . The important point is that it is an object in C that we
need to specify in order to show that binary products exist. Products are an example of a
universal construction in categories. The term “universal” implies that these constructions
are unique up to a unique isomorphism. Thus if (PX,Y , pX , py ) and (QX,Y , qX , qY ) are
both products for the objects X and Y then there exist unique arrows α∶ PX,Y → QX,Y
and β∶ QX,Y → PX,Y in C such that β ○ α = 1PX,Y and α ○ β = 1QX,Y so that the objects PX,Y
and QX,Y are isomorphic.
If the product of all object pairs X and Y exist in C then we say binary products
exist in C. The existence of binary products implies the existence of arbitrary finite
products in C. So if {Xi }N i=1 is a finite set of objects in C then there is an object which we
denote by ∏i=1 Xi (in general, this need not be the cartesian product) as well as arrows
N
{pXj ∶ ∏N i=1 Xi → Xj }j=1 . Then if we are given an arbitrary T ∈ob C and a family of arrows
N
fj ∶ T → Xj in C there exists a unique C arrow ⟨f1 , . . . , fN ⟩ such that for every integer
j ∈ {1, 2, . . . , N } the diagram
fj ⟨f1 , . . . , fN ⟩
N
Xj
pXj ∏Xi
i=1
commutes. The arrows pXi defining a product space are often called the projection maps
due to the analogy with the cartesian products in the category of sets, Set.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 10
In Set, the product of two sets X and Y is the cartesian product X × Y consisting of
all pairs (x, y) of elements with x ∈ X and y ∈ Y along with the two projection mappings
πX ∶ X × Y → X sending (x, y) ↦ x and πY ∶ X × Y → Y sending (x, y) ↦ y. Given any
pair of functions f ∶ T → X × Y and g∶ T → X × Y the function ⟨f, g⟩∶ T → X × Y sending
t ↦ (f (t), g(t)) clearly makes Diagram 2 commute. But it is also the unique such function
because if γ∶ T → X × Y were any other function making the diagram commute then the
equations
(pX ○ γ)(t) = f (t) and (pY ○ γ)(t) = g(t) (3)
would also be satisfied. But since the function γ has codomain X × Y which consist
of ordered pairs (x, y) it follows that for each t ∈ T that γ(t) = ⟨γ1 (t), γ2 (t)⟩ for some
functions γ1 ∶ T → X and γ2 ∶ T → Y . Substituting γ = ⟨γ1 , γ2 ⟩ into equations 3 it follows
that
f (t) = (pX ○ (⟨γ1 , γ2 ⟩))(t) = pX (γ1 (t), γ2 (t)) = γ1 (t)
g(t) = (pY ○ (⟨γ1 , γ2 ⟩))(t) = pY (γ2 (t), γ2 (t)) = γ2 (t)
from which it follows γ = ⟨γ1 , γ2 ⟩ = ⟨f, g⟩ thereby proving that there exist at most one
such function T → X × Y making the requisite Diagram 2 commute. If the requirement of
the uniqueness of the arrow ⟨f, g⟩ in the definition of a product is dropped then we have
the definition of a weak product of X and Y .
Given the relationship between the categories P and Meas it is worthwhile to examine
products in Meas. Given X, Y ∈ob Meas the product X × Y is the cartesian product
X × Y of sets endowed with the smallest σ-algebra such that the two set projection maps
πX ∶ X × Y → X sending (x, y) ↦ x and πY ∶ X × Y → Y sending (x, y) ↦ y are measurable.
In other words, we take the smallest subset of the powerset of X × Y such that for all
A ∈ ΣX and for all B ∈ ΣY the preimages πX −1
(A) = A × Y and πY−1 (B) = X × B are
measurable. Since a σ-algebra requires that the intersection of any two measurable sets
is also measurable it follows that πX −1
(A) ∩ πY−1 (B) = A × B must also be measurable.
Measurable sets of the form A × B are called rectangles and generate the collection of
all measurable sets defining the σ-algebra ΣX×Y in the sense that ΣX×Y is equal to the
intersection of all σ-algebras containing the rectangles. When the σ-algebra on a set is
determined by the a family of maps {pk ∶ X×Y → Zk }k∈K , where K is some indexing set such
that all of these maps pk are measurable we say the σ-algebra is induced (or generated) by
the family of maps {pk }k∈K .10 The cartesian product X × Y with the σ-algebra induced
by the two projection maps πX and πY is easily verified to be a product of X and Y
since given any two measurable maps f ∶ Z → X and g∶ Z → Y the map ⟨f, g⟩∶ Z → X × Y
sending z ↦ (f (z), g(z)) is the unique measurable map satisfying the defining property
of a product for (X × Y, πX , πY ). This σ-algebra induced by the projection maps πX and
πY is called the product σ-algebra and the use of the notation X × Y in Meas will imply
the product σ-algebra on the set X × Y .
Having the product (X × Y, πX , πY ) in Meas and the fact that every measurable
function f ∈ar Meas determines an arrow δf ∈ar P, it is tempting to consider the triple
10
The terminology initial is also used in lieu of induced.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 11
P J Q (4)
X X ×Y Y
δπX δπ Y
commute. In particular, the tensor product measure defined on rectangles by (P ⊗Q)(A×
B) = P (A)Q(B) extends to a joint probability measure on X × Y by
or equivalently,
(P ⊗ Q)(ς) = ∫ x (ς)) dP
Q(Γ−1 ∀ς ∈ ΣX×Y . (6)
x∈X
Here x∶ Y → X is the constant function at x and Γx ∶ Y → X × Y is the associated graph
function, with y and Γy defined similarly. The fact that Q⊗P = P ⊗Q is Fubini’s Theorem;
by taking a rectangle ς = A × B ∈ ΣX×Y the equality of these two measures is immediate
since
(P ⊗ Q)(A × B) = ∫y∈Y P ( y (A × B)
Γ−1 ) dQ
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
⎧
⎪ A
⎪
⎪ iff y ∈ B
∅
=⎨
⎪
⎪
⎪
⎩ otherwise
= ∫y∈B P (A) dQ (7)
= P (A) ⋅ Q(B)
= ∫x∈A Q(B) dP
= ∫x∈X Q(Γ−1 x (A × B)) dP
= (Q ⊗ P )(A × B)
Using the fact that every measurable set ς in X × Y is a countable union of rectangles,
Fubini’s Theorem follows.
It is clear that in P the uniqueness condition required in the definition of a product
of X and Y will always fail unless at least one of X and Y is a terminal object 1, and
consequently only weak products exist in P. However it is the nonuniqueness of products
in P that makes this category interesting. Instead of referring to weak products in P
we shall abuse terminology and simply refer to them as products with the understanding
that all products in P are weak.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 12
PX
Jh
(8)
δπ X δπY
X X ×Y Y
where Jh is the uniquely determined joint distribution on the product space X ×Y defined
on the rectangles of the σ-algebra ΣX × ΣY by
Jh (A × B) = ∫ hB dPX . (9)
A
The marginal of Jh with respect to Y then satisfies δπY ○Jh = h○PX and the marginal of Jh
with respect to X is PX . By a symmetric argument, if we are given a probability measure
PY and conditional probability k∶ Y → X then we obtain a unique joint distribution Jk on
the product space X × Y given on the rectangles by
Jk (A × B) = ∫ kA dPY .
B
Jh Jk
PX PY
X ×Y (10)
δπ X δπY
h
X Y,
k
then we have that Jh = Jk if and only if the compatibility condition is satisfied on the
rectangles
∫ hB dPX = J(A × B) = ∫ kA dPY ∀A ∈ ΣX , ∀B ∈ ΣY .
A B
(11)
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 13
where ! represents the unique arrow from X → 1. If we are also given a probability measure
P ∶ 1 → X, then we can calculate the joint distribution determined by P and h = Q○! as
J(A × B) = ∫A (Q○!)B dP
= P (A) ⋅ Q(B)
so that J = P ⊗ Q. In this situation we say that the marginals P and Q are independent.
Thus in P independence corresponds to a special instance of a conditional—one that
factors through the terminal object.
J PY
(12)
δπ Y
X ×Y Y
f
Proof. Since ΣX and ΣY are both countably generated, it follows that ΣX×Y is countably
generated as well. Let G be a countable generating set for ΣX×Y . For each A ∈ G, define
a measure µA on Y by
µA (B) = J(A ∩ πY−1 B).
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 14
Then µA is absolutely continuous with respect to PY and hence we can let f̃A = dP dµA
Y
, the
Radon–Nikodym derivative. For each A ∈ G this Radon–Nikodym derivative is unique up
to a set of measure zero, say Â. Let N = ∪A∈A Â and E1 = N c . Then f̃A ∣E1 is unique for
all A ∈ A. Note that fX×Y = 1 and f∅ = 0 on E1 . The condition f̃A ≤ 1 on E1 for all A ∈ A
then follows.
For all B ∈ ΣY and any countable union ∪ni=1 Ai of disjoint sets of A we have
Thus the set function f̃∶ E × A → [0, 1] satisfies the condition that f̃(y, ⋅) is a proba-
bility measure on the algebra A. By the Caratheodory extension theorem there exist a
unique extension of f̃(y, ⋅) to a probability measure fˆ(y, ⋅)∶ ΣX×Y → [0, 1]. Now define a
set function f ∶ Y × ΣX×Y → [0, 1] by
fˆ(y, A) if y ∈ E
f (y, A) = { .
J(A) if y ∉ E
Since each A ∈ ΣX×Y can be written as the pointwise limit of an increasing sequence
{An }∞
n=1 of sets An ∈ A it follows that fA = limn→∞ fAn is measurable. From this we also
obtain the desired commutativity of the diagram
f ○ PY (A) = ∫Y fA dPY = ∫E fA dPY = limn→∞ ∫E f̃An dPY
= limn→∞ ∫Y f̃An dPY
= limn→∞ J(An )
= J(A)
3 THE BAYESIAN PARADIGM USING P 15
We can use the result from Theorem 1 to obtain a broader understanding of the
situation.
Corollary 2. Let X and Y be countably generated measurable spaces and J a joint distri-
bution on X × Y with marginal distributions PX and PY on X and Y , respectively. Then
there exist P arrows f and g such that the diagram
PX X ×Y PY
δπX δπ Y
g f
δπ Y ○ g
X Y
δπX ○ f
commutes and
∫U (δπY ○ g)V dPX = J(U × V ) = ∫V (δπX ○ f )U dPY .
f
Proof. From Theorem 1 there exist a P arrow Y Ð→ X × Y satisfying J = f ○ PY . Take
the composite δπX ○ f and note (δπX ○ f )U (y) = fy (U × Y ) giving
1
PH
S
H D
I
PH µ
I ○µ (14)
S
H D
I
where the solid lines indicate arrows given a priori, the dotted line indicates the arrow
determined using Theorem 1, and the dashed lines indicate the updating after a measure-
ment. Note that if there is no uncertainty in the measurement, then µ = δ{x} for some
x ∈ D, but in practice there is usually some uncertainty in the measurements themselves.
Consequently the posterior probability must be computed as a composite - so the posterior
probability of an event A ∈ ΣH given a measurement µ is (I ○ µ)(A) = ∫D IA (x) dµ.
Following the calculation of the posterior probability, the sampling distribution is then
updated, if required. The process can then repeat: using the posterior probability and
the updated sampling distribution the updated joint probability distribution on the prod-
uct space is determined and the corresponding (updated) inference map determined (for
computational purposes the “entire map” I need not be determined if the measurements
are deterministic). We can then continue to iterate as long as new measurements are
received. For some problems, such as with the standard urn problem with replacement of
balls, the sampling distribution does not change from iterate to iterate, but the inference
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 18
map is updated since the posterior probability on the hypothesis space changes with each
measurement.
Remark 3. Note that for countable spaces X and Y the compatibility condition reduces to
the standard Bayes equation since for any x ∈ X the singleton {x} ∈ ΣX and similarly any
element y ∈ Y implies {y} ∈ ΣY , so that the joint distribution J∶ 1 → X × Y on {x} × {y}
reduces to the equation
B R B
R B R R B B
Urn 1 Urn 2
You are given two draws and if you pull out a red ball you win a million dollars. You
are unable to see the two urns so you don’t know which urn you are drawing from and
the draw is done without replacement. The P diagram for both inference and calculating
sampling distributions is given by
11
This problem is taken from Peter Green’s tutorial on Bayesian Inference which can be viewed at
http://videolectures.net/mlss2011 green bayesian.
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 19
PU PB
S
U B
I
where the dashed arrows indicate morphisms to be calculated rather than morphisms
determined by modeling,
and
1 1
PU = δu1 + δu2 .
2 2
The sampling distribution is the binomial distribution given by
S({b} ∣ u1 ) = 2
5 S({r} ∣ u1 ) = 53
S({b} ∣ u2 ) = 3
4 S({r} ∣ u2 ) = 41 .
Suppose that on our first draw, we draw from one of the urns (which one is unknown)
and draw a blue ball. We ask the following questions:
1. (Inference) What is the probability that we made the draw from Urn 1 (Urn 2)?
2. (Prediction) What is the probability of drawing a red ball on the second draw (from
the same urn)?
3. (Decision) Given you have drawn a blue ball on the first draw should you switch
urns to increase the probability of drawing a red ball?
PU PB = S ○ PU
J
δπU δπB
U U ×B B
S
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 20
and then construct the inference map by requiring the compatibility condition, i.e., the
integral equation
= 23
40
and similarly
17
PB ({r}) =.
40
To solve the inference problem, we need to compute the values of the inference map
I using equation 17. This amounts to computing the joint distribution on all possible
measurable sets,
I({u1 }∣b) = 8
23 I({u2 }∣b) = 15
23
I({u1 }∣r) = 12
17 I({u2 }∣r) = 5
17
which answers question (1). The odds that one drew the blue ball from Urn 1 relative to
8
Urn 2 are 15 , so it is almost twice as likely that one made the draw from the second urn.
The Prediction Problem. Here we implicitly (or explicitly) need to construct the
product space U × B1 × B2 where Bi represents the ith drawing of a ball from the same
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 21
(unknown) urn. To do this we use the basic construction for joint distributions using a
regular conditional probability, S2 , which expresses the probability of drawing either a
red or a blue ball from the same urn as the first draw. This conditional probability is
given by
PB 2 = S 2 ○ J
J K
S2
To answer the prediction question we calculate the odds of drawing a red versus a blue
ball. Thus
K(U × {b} × {r}) = ∫ S2 ({r}∣(u, β))dJ, (18)
U ×{b}
where the right hand side follows from the definition (construction) of the iterated product
space (U × B1 ) × B2 . The computation of the expression 18 yields
r 11 11
= P r({r}∣{b}) = .
b 12 23
The Decision Problem To answer the decision problem we need to consider the
conditional probability of switching urns on the second draw which leads to the conditional
Ŝ2
U × B1 B2
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 22
given by
Ŝ2 ({b}∣(u1 , b)) = 43 Ŝ2 ({r}∣(u1 , b)) = 14
Ŝ2 ({b}∣(u2 , b)) = 52 Ŝ2 ({r}∣(u2 , b)) = 35
Ŝ2 ({b}∣(u1 , r)) = 43 Ŝ2 ({r}∣(u1 , r)) = 14
Ŝ2 ({b}∣(u2 , r)) = 25 Ŝ2 ({r}∣(u2 , r)) = 35 .
Carrying out the same computation as above we find the joint distribution K̂ on the
product space (U × B1 ) × B2 constructed from J and Ŝ2 yields
which shows that it doesn’t matter whether you switch or not - you get the same proba-
bility of drawing a red ball.
The probability of drawing a blue ball is
12
K̂(U × {b} × {b}) = = K(U × {b} × {b}),
40
so the odds of drawing a blue ball outweigh the odds of drawing a red ball by the ratio
12
11 . The odds are against you.
Front R R G
Back R G G
We shuffle the deck, pull out a card and expose one face which is red.12 The prediction
question is
12
This problem is taken from David MacKays tutorial on Information Theory which can be viewed
at http ∶ //videolectures.net/mlss09uk mackay it/.
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 23
S
PC
1 C I F
PF
with the sampling distribution given by
S({r}∣1) = 1 S({g}∣1) = 0
S({r}∣2) = 21 S({g}∣2) = 12
S({r}∣3) = 0 S({g}∣3) = 1.
The prior on C is PC = 31 δ1 + 31 δ2 + 13 δ3 . From this we can construct the joint distribution
on C × F
1
PC PF = S ○ PC
J
δπ C δπ F
C C ×F F.
Using
J(A × B) = ∫ S(B∣n)dPC ,
n∈A
we find
J({1} × {r}) = 31 J({1} × {g}) = 0
J({2} × {r}) = 16 J({2} × {g}) = 16
J({3} × {r}) = 0 J({3} × {g}) = 13 .
Now, like in the urn problem, to predict the next draw (flip of the card), it is necessary to
add another measurable set F2 and conditional probability S2 and construct the product
diagram and joint distribution K
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 24
PF2 = S2 ○ J
J K
δπC×F1 δπF2
C × F1 C × F1 × F2 F2 .
S2
The twist now arises in that the conditional probability S2 is not uniquely defined - what
are the values
S2 ({r}∣(1, g)) = ? S2 ({g}∣(1, g)) = ?
The answer is it doesn’t matter what we put down for these values since they have measure
J({1} × {g}) = 0. We can still compute the desired quantity of interest proceeding forth
with these arbitrarily chosen values on the point sets of measure zero. Thus we choose
S2 ({g}∣(1, r)) = 0 S2 ({r}∣(1, r)) = 1
S2 ({g}∣(1, g)) = 1 S2 ({r}∣(1, g)) = 0 doesn’t matter
S2 ({g}∣(2, r)) = 1 S2 ({r}∣(2, r)) = 0
S2 ({g}∣(2, g)) = 0 S2 ({r}∣(2, g)) = 1
S2 ({g}∣(3, r)) = 0 S2 ({r}∣(3, r)) = 1 doesn’t matter
S2 ({g}∣(3, g)) = 1 S2 ({r}∣(3, g)) = 0.
We chose the arbitrary values such that S2 is a deterministic mapping which seems ap-
propriate since flipping a given card uniquely determined the color on the other side.
Now we can solve the prediction problem by computing the joint measure values
K(C × {r} × {r}) = ∫C×{r} (S2 ){r} (n, c)dJ
= S2 ({r}∣(1, r)) ⋅ J({1} × {r}) + S2 ({r}∣(2, r)) ⋅ J({2} × {r})
= 1 ⋅ 13 + 0 ⋅ 16
= 31
and
K(C × {r} × {g}) = ∫C×{r} S2 ({g}∣(n, c))dJ
= S2 ({g}∣(1, r)) ⋅ J({1} × {r}) + S2 ({g}∣(2, r)) ⋅ J({2} × {r})
= 0 ⋅ 31 + 1 ⋅ 61
= 16 ,
so it is twice as likely to observe a red face upon flipping the card than seeing a green
face. Converting the odds of gr = 12 to a probability gives P r({r}∣{r}) = 23 .
To test one’s understanding of the categorical approach to Bayesian probability we
suggest the following problem.
5 THE TENSOR PRODUCT 25
Example 6. The Monty Hall Problem. You are a contestant in a game show in
which a prize is hidden behind one of three curtains. You will win a prize if you select
the correct curtain. After you have picked one curtain but befor the curtain is lifted, the
emcee lifts one of the other curtains, revealing a goat, and asks if you would like to switch
from your current selection to the remaining curtain. How will your chances change if
you switch?
There are three components which need modeled in this problem:
D(oor) = {1, 2, 3} The prize is behind this door.
C(hoice) = {1, 2, 3} The door you chose.
O(penddoor) = {1, 2, 3} The door Monty Hall opens
The prior on D is PD = 31 δd1 + 13 δd2 + 13 δd3 . Your selection of a curtain, say curtain 1, gives
the deterministic measure PC = δC1 . There is a conditional probability from the product
space D × C to O
P D ⊗ PC P O = S ○ PD ⊗ PC
J
δπD×C δπ O
D×C (D × C) × O O
where the conditional probability S((i, j), {k}) represents the probability that Monty
opens door k given that the prize is behind door i and you have chosen door j. If you
have chosen curtain 1 then we have the partial data given by
S((1, 1), {1}) = 0 S((1, 1), {2}) = 21 S((1, 1), {2}) = 21
S((2, 1), {1}) = 0 S((2, 1), {2}) = 0 S((2, 1), {3}) = 1
S((3, 1), {1}) = 0 S((3, 1), {2}) = 1 S((3, 1), {3}) = 0.
conclude that if Monty opens either curtain 2 or 3 it is in your best interest to switch
doors.
f∗ ΣW = {C ∈ 2Z ∣ f −1 (C) ∈ ΣW }. (20)
This is in contrast to the smallest σ-algebra on X × Y , defined in Section 2.1 so that the
two projection maps {πX ∶ X × Y → X, πY ∶ X × Y → Y } are measurable. Such a σ-algebra is
said to be induced by the projection maps, or simply referred to as the initial σ-algebra.
The following result on coinduced σ-algebras is used repeatedly.
fi
Xi Y
g ○ fi g
Y X
x Γx Γy y
X πX X ⊗Y X ⊗Y πY Y
Figure 3: The commutativity of these diagrams, together with the measurability of the
constant functions and constant graph functions, implies the projection maps πX and πY
are measurable.
By the measurability of the projection maps and the universal property of the product,
it follows the identity mapping on the set X × Y yields a measurable function
id
X ⊗Y X ×Y
and
(P ◯
⋊ Q)(ς) = ∫ Q(Γx−1 (ς)) dP ∀ς ∈ ΣX×Y
x∈X
which are equivalent on the product σ-algebra. Here we have introduced the new notation
of left tensor ◯
⋉ and right tensor ◯
⋊ because we can extend these definitions to be defined
on the tensor σ-algebra though in general the equivalence of these two expressions may no
longer hold true. These definitions can be extended to conditional probability measures
P ∶ Z → X and Q∶ Z → Y trivially by conditioning on a point z ∈ Z,
(P ◯
⋉ Q)(ς ∣ z) = ∫ y (ς)) dQz
P (Γ−1 ∀ς ∈ ΣX⊗Y (21)
y∈Y
5 THE TENSOR PRODUCT 28
and
(P ◯
⋊ Q)(ς ∣ z) = ∫ x (ς)) dPz
Q(Γ−1 ∀ς ∈ ΣX⊗Y (22)
x∈X
which are equivalent on the product σ-algebra but not on the tensor σ-algebra. However
in the special case when Z = X and P = 1X , then Equations 21 and 22 do coincide on
ΣX⊗Y because by Equation 21
(1X ◯
⋉ Q)(ς ∣ x) = ∫y∈Y δx (Γy−1 (ς)) dQx ∀ς ∈ ΣX⊗Y X
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
⎧
⎪ 1
⎪
⎪ iff (x, y) ∈ ς
=⎨
⎪
⎪
⎪ 0 otherwise (23)
⎩
= ∫y∈Y χΓ−1
x
(ς) (y) dQx
= Qx (Γx (ς)),
−1
while by Equation 22
In this case we denote the common conditional by ΓQ , called the graph of Q by analogy
to the graph of a function, and this map gives the commutative diagram in Figure 4.
1X Q
ΓQ
X X ⊗Y Y
δπX δπ Y
and
(δπY ○ ΓQ )(B ∣ x) = ∫(u,v)∈X⊗Y δπY (B ∣ (u, v)) d((ΓQ )x )
= ∫v∈Y δπY (B ∣ (x, v)) dQx (26)
= ∫v∈Y χB (v) dQx
= Q(A ∣ x).
which makes the diagram in Figure 5 commute and justifies the notation 1X ⊗ P (and
explains also why the notation ΓQ for the graph map was used to distinguish it from this
map).
δπ X δπ Z
X X ⊗Z Z
1X 1X ⊗ P P
X X ⊗Y Y
δπX δπ Y
P δΓ x
Z Y X ⊗Y,
where given a measurable set A ∈ ΣX⊗Y one pulls it back under the constant graph
function Γx and then applies the conditional P to the pair (Γ−1
x (A) ∣ z).
1. There is a bifunctor
◻ ∶ C ×C → C
∶ob (X, Y ) ↦ X ◻Y
∶ar (X, Y ) Ð→ (X ′ , Y ′ ) ↦ X ◻ Y Ð→ X ′ ◻ Y ′
(f,g) (f ◻g)
5 THE TENSOR PRODUCT 30
where IdC is the identity functor on C. Hence for every triple X, Y, Z of objects,
there is an isomorphism
aX,Y,Z ∶ (X ◻ Y ) ◻ Z Ð→ X ◻ (Y ◻ Z)
2. There is an object I ∈ C such that for every object X ∈ob C there is a left unit
isomorphism
lX ∶ 1 ◻ X Ð→ X.
and a right unit isomorphism
rX ∶ X ◻ 1 Ð→ X.
aX◻Y,W,Z
((X ◻ Y ) ◻ W ) ◻ Z (X ◻ Y ) ◻ (W ◻ Z)
aX◻Y,W,Z
(X ◻ (Y ◻ W )) ◻ Z aX,Y,W ◻Z
aX,Y ◻W,Z
IdX ◻ aY,W,Z
X ◻ ((Y ◻ W ) ◻ Z) X ◻ (Y ◻ (W ◻ Z))
commutes. This is called the associativity coherence condition.
rX lX
aX,Y,Z aY,X,Z
X
X ◻ (Y ◻ Z) Y ◻ (X ◻ Z)
sX,Y
X ◻Y Y ◻X
sX,Y ◻Z IdY ⊗ sX,Z
sY,X IdX
aY,Z,X
Y ◻ (Z ◻ X) Y ◻ (Z ◻ X) X ◻Y
The main example of a symmetric monoidal category is the category of sets, Set,
under the cartesian product with identity the terminal object 1 = {⋆}. Similarly, for the
categories Meas and P, the tensor product ⊗ along with the terminal object 1 acting as
the identity element make both (Meas, ⊗, 1) and (P, ⊗, 1) symmetric monoidal categories
with the above conditions straightforward to verify. This provides a good exercise for the
reader new to categorical methods.
6 Function Spaces
For X, Y ∈ob Meas let Y X denote the set of all measurable functions from X to Y endowed
with the σ-algebra induced by the set of all point evaluation maps {evx }x∈X , where
evx
Y X Ð→ Y
f ↦ f (x).
ΣY X = σ ( ⋃ evx−1 ΣY ) , (28)
x∈X
and observe that for every ⌜f ⌝ ∈ Y X the right hand Meas diagram in Figure 7 is commu-
tative as a set mapping, f = evX,Y ○ Γf .
evX,Y
YX X ⊗YX Y
⌜f ⌝ Γf ≅ IdX ⊗ ⌜f ⌝ f
1 X ≅X ⊗1
Figure 7: The defining characteristic property of the evaluation function ev for graphs.
By rotating the diagram in Figure 7 and also considering the constant graph functions
Γx , the right hand side of the diagram in Figure 8 also commutes for every x ∈ X.
13
Having defined Y X to be the set of all measurable functions f ∶ X → Y it seems contradictory to then
define evx as acting on “points” ⌜f ⌝∶ 1 → Y X rather than the functions f themselves! The apparent self
contradictory definition arises because we are interspersing categorical language with set theory; when
defining a set function, like evx , it is implied that it acts on points which are defined as “global elements”
1 → Y X . A global element is a map with domain 1. This is the categorical way of defining points rather
than using the elementhood operator “∈”. Thus, to be more formal, we could have defined evx , where
x∶ 1 → X is any global element, by evx ○ ⌜f ⌝ = ⌜f (x)⌝∶ 1 → Y , where f (x) = f ○ x.
6 FUNCTION SPACES 33
Γf Γx
X X ⊗YX YX
f evX,Y evx
Figure 8: The commutativity of both triangles, the measurability of f and evx , and the
induced σ-algebra of X ⊗ Y X implies the measurability of ev.
Since f and Γf are measurable, as are evx and Γx , it follows by Lemma 7 that evX,Y is
measurable since the constant graph functions generate the σ-algebra of X ⊗ Y X . More
generally, given any measurable function f ∶ X ⊗ Z → Y there exists a unique measurable
map f˜∶ Z → Y X defined by f˜(z) = ⌜f (⋅, z)⌝∶ 1 → Y X where f (⋅, z)∶ X → Y sends x ↦ f (x, z).
This map f˜ is measurable because the σ-algebra is generated by the point evalutation maps
evx and the diagram
evx
YX Y
f˜ f
Γx
Z X ⊗Z
commutes so that f˜−1 (evx−1 (B)) = (f ○ Γx )−1 (B) ∈ ΣZ .
Conversely given any measurable map g∶ Z → Y X , it follows the composite
evX,Y ○ (IdX ⊗ g)
X ⊗Z Y
f
or the diagram in Figure 9.
6 FUNCTION SPACES 34
evX,Y
YX X ⊗YX Y
f˜ IdX ⊗ f˜ f
Z X ⊗Z
Figure 9: The evaluation function ev sets up a bijective correspondence between the two
measurable maps f and f˜.
The measurable map f˜ is called the adjunct of f and vice versa, so that f˜˜ = f . Whether
we use the tilde notation for the map X ⊗ Z → Y or the map Z → Y X is irrelevant, it
simply indicates it’s the map uniquely determined by the other map.
The map evX,Y , which we will usually abbreviate to simply ev with the pair (X, Y )
obvious from context, is called a universal arrow because of this property; it mediates the
relationship between the two maps f and f˜. In the language of category theory using
functors, for a fixed object X in Meas, the collection of maps {evX,Y }Y ∈ob Meas form the
components of a natural transformation evX,− ∶ (X ⊗ ⋅) ○ X → IdMeas . In this situation we
say the pair of functors {X⊗ , X } forms an adjunction denoted X⊗ ⊣ X . This adjunction
X ⊗ ⊣ X is the defining property of a closed category. We previously showed Meas was
symmetric monoidal and combined with the closed category structure we conclude that
Meas is a symmetric monoidal closed category (SMCC). Subsequently we will show that
P satisfies a weak version of SMCC, where uniqueness cannot be obtained.
The Graph Map. Given the importance of graph functions when working with tensor
spaces we define the graph map
Γ⋅ ∶ Y X → (X ⊗ Y )X
∶ ⌜f ⌝ ↦ ⌜Γf ⌝.
Γf
X X ⊗Y.
Γ⋅
YX (X ⊗ Y )X
Y X ⊗Y
Γx
Figure 10: The relationship between the graph map, point evaluations, and constant
graph maps.
We have used the notation ev ˆ x simply to distinguish this map from the map evx which
has a different domain and codomain. The σ-algebra of (X ⊗ Y )X is determined by these
point evaluation maps ev ˆ x so that they are measurable. The maps evx and Γx are both
measurable and hence their composite Γx ○ evx = ⟨x, evx ⟩ is also measurable.
To prove the measurability of the graph map we use the dual to Lemma 7 obtained
by reversing all the arrows in that lemma to give
f
X Y
gi ○ f gi
Zi
P
1 YX
P
Z YX
Just as we did for the category Meas, we seek a bijective correspondence between two
P maps, a stochastic process P and a corresponding conditional probability measure P .
In the P case, however, the two morphisms do not uniquely determine each other, and
we are only able to obtain a symmetric monoidal weakly closed category (SMwCC).
In Section 5.2 the tensor product 1X ⊗ P was defined, and by replacing the space “Y ”
in that definition to be a function space Y X we obtain the tensor product map
1X ⊗ P ∶ X ⊗ Z → X ⊗ Y X
δev
YX X ⊗YX Y
P 1X ⊗ P P
Z X ⊗Z
Figure 11: The defining characteristic property of the evaluation function ev for tensor
products of conditionals in P.
Thus
P (B ∣ (x, z)) = ∫(u,f )∈X⊗Y X (δev )B (u, f ) d(1X ⊗ P )(x,z)
= ∫f ∈Y X δev (B ∣ Γx (f )) dPz
= ∫f ∈Y X χB (evx (f )) dPz
= P (evx−1 (B) ∣ z)
and every parameterized stochastic process determines a conditional probability
P ∶ X ⊗ Z → Y.
Conversely, given a conditional probability P ∶ X ⊗ Z → Y , we wish to define a param-
eterized stochastic process P ∶ Z → Y X . We might be tempted to define such a stochastic
process by letting
P (evx−1 (B) ∣ z) = P (B ∣ (x, z)), (31)
but this does not give a well-defined measure for each z ∈ Z. Recall that a probability
measure cannot be unambiguously defined on an arbitrary generating set for the σ-algebra.
We can, however, uniquely define a measure on a π-system14 and then use Dynkin’s π-λ
theorem to extend to the entire σ-algebra (e.g., see [10]). This construction requires the
following definition.
Definition 10. Given a measurable space (X, ΣX ), we can define an equivalence relation
on X where x ∼ y if x ∈ A ⇔ y ∈ A for all A ∈ ΣX . We call an equivalence class of this
relation an atom of X. For an arbitrary set A ⊂ X, we say that A is
● separated if for any two points x, y ∈ A, there is some B ∈ ΣX with x ∈ B and y ∉ B
● unseparated if A is contained in some atom of X.
This notion of separation of points is important for finding a generating set on which
we can define a parameterized stochastic process. The key lemma which we state here
without proof15 is the following.
14
A π-system on X is a nonempty collection of subsets of X that is closed under finite intersections.
15
This lemma and additional work on symmetric monoidal weakly closed structures on P will appear
in a future paper.
6 FUNCTION SPACES 38
Remark 12. Even when such an expression does provide a well-defined measure as in
the case of finite spaces, it does not yield a unique P . Appendix B provides an elementary
example illustrating the failure of the bijective correspondence property in this case. Also
observe that the proposed defining Equation 31 can be extended to
n
P (∩ni=1 evx−1i (Bi ) ∣ z) = ∏ P (Bi ∣ (xi , z))
i=1
which does provide a well-defined measure by Lemma 11. However it still does not provide
a bijective correspondence which is clear as the right hand side implies an independence
condition which a stochastic process need not satisfy. However it does provide for a bi-
jective correspondence if we impose an additional independence condition/assumption.
Alternatively, by imposing the additional condition that for each z ∈ Z, Pz is a Gaussian
Processes we can obtain a bijective correspondence. In Section 6.3 we illustrate in detail
how a joint normal distribution on a finite dimensional space gives rise to a stochastic
process, and in particular a GP.
Often, we are able to exploit the weak correspondence and use the conditional prob-
ability P ∶ X → Y rather than the stochastic process P ∶ 1 → Y X . While carrying less
information, the conditional probability is easier to reason with because of our famil-
iarity with Bayes’ rule (which uses conditional probabilities) and our unfamiliarity with
measures on function spaces.
Intuitively it is easier to work with the conditional probability P as we can represent
the graph of such functions. In Figure 6.1 the top diagram shows a prior probability
P ∶ 1 → R[0,10] , which is a stochastic process, depicted by representing its adjunct illus-
trating its expected value as well as its 2σ error bars on each coordinate. The bottom
diagram in the same figure illustrates a parameterized stochastic process where the param-
eterization is over four measurements. Using the above notation, Z = ∏4i=1 (X × Y )i and
P (⋅ ∣ {(xi , yi )}4i=1 ) is a posterior probability measure given four measurements {xi , yi }4i=1 .
These diagrams were generated under the hypothesis that the process is a GP.
6 FUNCTION SPACES 39
Figure 12: The top diagram shows a (prior) stochastic process represented by its adjunct
P ∶ [0, 10] → R and characterized by its expected value and covariance. The bottom dia-
gram shows a parameterized stochastic process (the same process), also expressed by its
adjunct, where the parameterization is over four measurements.
6 FUNCTION SPACES 40
x ι
1 X0 X
Y ι ∶ Y X → Y X0
∶ ⌜f ⌝ ↦ ⌜f ○ ι⌝,
P P
1 YX 1 YX
Y X0 Y
1. k(x, z) ≥ 0,
X µ Σ Σ12
J =[ ] ∼ N ([ 1 ] , [ 11 ])
Y µ2 Σ21 Σ22
with Σ11 and Σ22 nonsingular.
Represented categorically, these random variables X and Y determine distributions
which we represent by P1 and P2 on two measurable spaces X = R and Y = R for some
m n
finite integers m and n, and the various relationships between the P maps is given by the
diagram in Figure 15.
1
P1 ∼ N (µ1 , Σ11 ) J P2 ∼ N (µ2 , Σ22 )
X ×Y
δπ X δπY
S
X Y
I
and the overline notation on the terms “S” and “I” is used to emphasize that the transpose
of both of these conditionals are GPs given by a bijective correspondence in Figure 16.
evX,Y
YX X ⊗YX Y
S ΓS S
1 X ⊗1
Figure 16: The defining characteristic property of the evaluation function ev for graphs.
In the random variable description, these conditionals S x and I y are often represented
simply by
µY∣X = µ2 + Σ21 Σ−1
11 (x − µ1 ) µX∣Y = µ1 + Σ12 Σ22 (y − µ2 )
−1
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 43
and
ΣY∣X = Σ22 − Σ21 Σ−1
11 Σ12 ΣX∣Y = Σ11 − Σ12 Σ−1
22
It is easily verified that this pair {S, I} forms a sampling distribution/inference map
pair; i.e., the joint distribution can be expressed in terms of the prior X and sampling
distribution S or in terms of the prior Y and inference map I. It is clear from this
example that what one calls the sampling distribution and inference map depends upon
the perspective of what is being estimated.
In subsequent developments, we do not assume a joint normal distribution on the
spaces X and Y . If such an assumption is reasonable, then the following constructions
are greatly simplified by the structure expressed in Figure 15. As noted previously, it is
knowledge of the relationship between the distributions P1 and P2 which characterize the
joint and, is the main modeling problem. Thus the two perspectives on the problem are
to find the conditionals, or equivalently, find the prior on Y X which specifies a function
X → Y along with the noise model which is “built into” the sampling distribution.
a left or right tensor product is a Dirac measure, then both the left and right tensors
coincide and the choice of right or left tensor is irrelevant. In this case, we denote the
common probability measure by δx ⊗ P . Moreover, a simple calculation shows the prior
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 44
ΓP (⋅ ∣ x) d d is measurement data
S
X ⊗Y X X ⊗Y
Figure 17: The generic nonparametric Bayesian model for stochastic processes.
By a nonparametric (Bayesian) model, we mean any model which fits into the scheme
of Figure 17. For all of our analysis purposes we take P ∼ GP(m, k). A data measurement
d, corresponding to a collection of sample data {xi , yi } is, in ML applications, generally
taken as a Dirac measure, d = δ(x,y) . As in all Bayesian problems, the measurement data
{xi , yi }N
i=1 can be analyzed either sequentially or as a single batch of data. For analysis
purpose in Section 8, we consider the data one point at a time (sequentially).
1 ⊗ δΓ δev
X ⊗YX X ⊗ (X ⊗ Y )X X ⊗Y
Snf = composite
Using the commutativity of Figure 10, the noise free sampling distribution can also be
written as Snf (U ∣ (x, f )) = χΓ−1 ˆ −1
⋅ (ev x (U ))
(f ).
Precomposing the sampling distribution with this prior probability measure the com-
posite
(Snf ○ ΓP (⋅ ∣ x))(U ) = ∫(u,f )∈X⊗Y X Snf (U ∣ (u, f )) d(ΓP (⋅ ∣ x)) for U ∈ ΣX⊗Y
´¹¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
−1
=P Γx
= ∫f ∈Y X Snf (U ∣ Γx (f )) dP (34)
= ∫f ∈Y X χΓ−1 ˆ −1
⋅ (ev x (U ))
(f ) dP
= P (Γ⋅ (ev
−1 ˆ x (U )))
−1
P ∼ GP(m, k) Γ⋅ ev
ˆx
1 YX (X ⊗ Y )X X ⊗Y
δevx δπY
P evx−1 ∼ N (m(x), k(x, x))
Under this assumption P ∼ GP(m, k) the expected value of the probability measure
(Snf ○ ΓP (⋅ ∣ x)) on the real vector space X ⊗ Y is
E(Snf ○ΓP (⋅∣x)) [IdX⊗Y ] = −1 ˆ −1 )
∫(u,v)∈X⊗Y (u, v) d(P (Γ⋅ ev x
= ∫g∈(X⊗Y )X x ev
ˆ (g) d(P Γ−1 )
⋅
= ∫f ∈Y X x
ev
ˆ (Γ(f )) dP
= ∫f ∈Y X (x, f (x)) dP
= (x, m(x)),
where the last equation follows because on the two components of the vector valued
integral, ∫f ∈Y X f (x) dP = m(x) and ∫f ∈Y X x dP = x as the integrand is constant. The
variance is18
E(Snf ○ΓP (⋅∣x)) [(IdX⊗Y − E(Snf ○ΓP (⋅∣x)) [IdX⊗Y ])2 ] = E(Snf ○ΓP (⋅∣x)) [(IdX⊗Y − (x, m(x)))2 ],
which when expanded gives
= ∫(u,v)∈X⊗Y (IdX⊗Y − (x, m(x)))2 (u, v) d(P (Γ−1 ⋅ evx ))
−1
My ∼ N (y, σ 2 )
1 Y
Because the state y in Equation 35 is arbitrary, this additive noise model is represen-
M
tative of the P map Y Ð→ Y defined by
M (B ∣ y) = My (B) ∀y ∈ Y, ∀B ∈ ΣY .
18
The squaring operator in the variance is defined component wise on the vector space X ⊗ Y .
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 47
Given a GP P ∼ GP(f, k) on Y X , it follows that for any x ∈ X, P evx−1 ∼ N (f (x), k(x, x))
and for any B ∈ ΣY , the composition
N
YX YX
P ∼ GP(f, k)
1 δevx δevx
Figure 21: Construction of the generic Markov kernel N for modeling the Gaussian addi-
tive measurement noise.
With this Gaussian additive noise measurement model N our sampling distribution
Snf can easily be modified by incorporating the additional map N into the sequence in
Figure 18 to yield the Gaussian additive noise sampling distribution model Sn shown in
Figure 22.
X ⊗YX X ⊗YX X ⊗ (X ⊗ Y )X X ⊗Y
1X ⊗ N 1X ⊗ δΓ⋅ δev
Sn = composite
Figure 22: The sampling distribution model in P with additive Gaussian noise.
= N (evx (Γ−1
−1
x (U )) ∣ f )
= Nf evx−1 (Γx−1 (U )).
δevx δπ Y
Nf (evx−1 (⋅)) ∼ N (f (x), σ 2 )
while the term ((1X ⊗ N ) ○ ΓP (⋅ ∣ x)) = ΓN ○P (⋅ ∣ x) follows from the commutativity of the
diagram in Figure 24, where, as shown in Equation 36, M ○ P evx−1 ∼ N (m, k(x, x) + σ 2 )
which implies N ○ P ∼ GP(m, k + kN ).
1 1
δx ΓP (⋅ ∣ x)P
δπ X δπY X
X X ⊗Y X YX ΓN ○P (⋅ ∣ x)
1X 1X ⊗ N N
δπ X δπY X
X X ⊗Y X YX X ⊗YX
Figure 24: The composite of the prior and noise measurement model is the graph of a GP
at x.
Using the fact Sn ○ ΓP (⋅ ∣ x) = Snf ○ ΓN ○P (⋅ ∣ x), the expected value of the composite
Sn ○ ΓP (⋅ ∣ x) is readily shown to be
E(Sn ○ΓP (⋅∣x)) [(IdX⊗Y − E(Sn ○ΓP (⋅∣x)) [IdX⊗Y ])2 ] = (0, k(x, x) + σ 2 ).
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 50
i∶ R Ð→ Y X
p
p
where R has the product σ-algebra with respect to the canonical projection maps onto
the measurable space R with the Borel σ-algebra. Note that i(a) ∈ Y X corresponds (via
the SMwCC structure) to a function i(a)∶ X → Y .19 This parametric map i determines
the deterministic P arrow δi ∶ R → Y X , which in turn determines the deterministic tensor
p
ΓP (⋅ ∣ x) d
X ⊗R X ⊗YX X ⊗Y
p
1X ⊗ δi Sn
In the ML literature, one generally assumes complete certainty with regards to the
input state x ∈ X. However, there are situations in which complete knowledge of the
input state x is itself uncertain. This occurs in object recognition problems where x is
a feature vector which may be only partially observed because of obscuration and such
data is the only training data available.
For real world modeling applications there must be a noise model component asso-
ciated with a parametric model for it to make sense. For example we could estimate
an unknown function as a constant function, and hence have the 1 parameter model
i∶ R → Y X given by i(a) = a, the constant function on X with value a. Despite how crude
this approximation may be, we can still obtain a “best” such Bayesian approximation to
the function given measurement data where “best” is defined in the Bayesian probabilis-
tic sense - given a prior and a measurement the posterior gives the best estimate under
19
Note that the function i(a) is unique by our construction of the transpose of the function i(a) ∈ Y X .
The non-uniqueness aspect of the SMwCC structure only arises in the other direction - given a conditional
probability measure there may be multiple functions satisfying the required commutativity condition.
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 51
the given modeling assumptions. Without a noise component, however, we cannot even
account for the fact our data is different than our model which, for analysis and prediction
purposes, is a worthless model.
Example 14. Affine Parametric Model Let X = R and p = n + 1. The affine para-
n
metric model is given by considering the valid hypotheses to consist of affine functions
Fa ∶ X → Y
(39)
∶ x ↦ ∑nj=1 aj xj + an+1
i ∶ Rn+1 Ð→ Y X
∶ a ↦ i(a) = Fa
Separating Hyperplane
5.0
4.5
4.0
3.5
3.0
2.5
2.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
In this particular example where the class labels are integer valued, the resulting
function we are estimating will not be integer valued but, as usual, approximated by real
values.
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 52
Such parametric models are useful to avoid over fitting data because the number of
parameters are finite and fixed with respect to the number of measurements in contrast
to nonparametric methods in which each measurement serves as a parameter defining the
updated probability measure on Y X .
More generally, for any parametric map i take the canonical basis vectors ej , which
are the j th unit vector in R , and let the image of the basis elements {ej }pj=1 under the
p
Separating Ellipse
5.0
4.5
4.0
3.5
3.0
2.5
2.0
-1 0 1 2 3 4
Returning to the general construction of the Bayesian model for the parametric model
we take the Gaussian additive noise model, Equation 38, and expand the diagram in
Figure 25 to the diagram in Figure 28, where the parametric model sampling distribution
can be readily determined on rectangles A × B ∈ ΣX⊗Y by
⎛ ⎞
S(A × B ∣ (x, a)) = N ⎜evx (Γx (A × B)) ∣ i(a) ⎟
⎜ −1 −1
⎟
⎝ ±⎠
=Fa
= NFa evx−1 (Γ−1
x (A × B))
(y−Fa (x))2
= δx (A) ⋅ √2πσ
1
∫y∈B e
−
2σ 2 dy.
ΓP (⋅ ∣ x)
Sn
1X ⊗ δi 1X ⊗ N 1X ⊗ δΓ δev
X ⊗R
p
X ⊗YX X ⊗YX X ⊗ (X ⊗ Y )X X ⊗Y
Figure 28: The parametric model sampling distribution as a composite of four compo-
nents.
Here we have used the fact NFa evx−1 ∼ N (Fa (x), σ 2 ) which follows from Equation 37
and the property that a GP evaluated on any coordinate is a normal distribution with
the mean and variance evaluated at that coordinate.
8 CONSTRUCTING INFERENCE MAPS 54
S x = δevx
YX Y
Figure 29: The noise free sampling distributions S x given the prior δx ⊗ P with the dirac
measure on the X component.
Using the property that δevx (B ∣ f ) = 1evx−1 (B) (f ) for all B ∈ ΣY and f ∈ Y X , the resulting
deterministic sampling distributions (one for each x ∈ X) are given by
This special case of the prior δx ⊗P , which is the most important one for many ML applica-
tions and the one implicitly assumed in ML textbooks, permits a complete mathematical
analysis.
Given the probability measure P ∼ GP(m, k) and S x = δevx , it follows the composite
is the pushforward probability measure
S x ○ P = P evx−1 , (42)
which is the special case of Figure 14 with X0 = {x}. Using the fact that P projected
onto any coordinate is a normal distribution as shown in Figure 30, it follows that the
expected mean is
EP evx−1 (IdY ) = EP (evx )
= m(x)
while the expected variance is
These are precisely specified by the characterization P evx−1 ∼ N (m(x), k(x, x)).
1
P ∼ GP(m, k) P evx−1 ∼ N (m(x), k(x, x))
S x = δevx
YX Y
Ix
Figure 30: The composite of the prior distribution P ∼ GP(m, k) and the sampling
distribution S x give the coordinate projections as priors on Y .
Recall that the corresponding inference map I x is any P map satisfying the necessary
and sufficient condition of Equation 11, i.e., for all A ∈ ΣY X and B ∈ ΣY ,
Since the σ-algebra of Y X is generated by elements evz−1 (A), for z ∈ Y and A ∈ ΣY , we can
take A = evz−1 (A) in the above expression to obtain the equivalent necessary and sufficient
condition on I x of
From Equation 41, S x (B ∣ f ) = 1evx−1 (B) (f ), so substituting this value into the left hand
side of this equation reduces that term to P (evx−1 (B) ∩ evz−1 (A)). Rearranging the order
of the terms it follows the condition on the inference map I x is
Since the left hand side of this expression is integrated with respect to the pushforward
probability measure P evx−1 it is equivalent to
∫y∈B I (evz (A) ∣ y) d(P evx ) = ∫f ∈ evx−1 (B) I (evz (A) ∣ evx (f )) dP
x −1 −1 x −1
unique. However we can require that the posterior PY1 X be a GP specified by updated
mean and covariance functions m1 and k 1 respectively, which depend upon the condition-
ing value y, so PY1 X ∼ GP(m1 , k 1 ). To determine PY1 X , and hence the desired inference
map I x , we make a hypothesis about the updated mean and covariance functions m1 and
k 1 characterizing PY1 X given a measurement at the pair (x, y) ∈ X × Y . Let us assume the
updated mean function is of the form
k(z, x)
m1 (z) = m(z) + (y − m(x)) (44)
k(x, x)
k(w, x)k(x, z)
k 1 (w, z) = k(w, z) − . (45)
k(x, x)
To prove these updated functions suffice to specify the inference map I x (⋅ ∣ y) = PY1 X ∼
GP(m1 , k 1 ) satisfying the necessary and sufficient condition we simply evaluate
by substituting I x (⋅ ∣ f (x)) = P 1 (m1 , k 1 ) and verify that it yields P (evx−1 (B) ∩ evz−1 (A)).
Since I x evz−1 (⋅ ∣ f (x)) = PY1 X evx−1 is a normal distribution of mean
k(z, x)
m1 (z) = m(x) + ⋅ (f (x) − m(x))
k(x, x)
and covariance
k(z, x)2
k 1 (z, z) = k(z, z) −
k(x, x)
it follows that
⎛ 1 −(m1 (z)−v)2 ⎞
∫f ∈ ev−1 (B) I x
ev −1
(A ∣ ev x (f )) dP = ∫f ∈ ev−1 (B) √ ∫ e 2k1 (z,z) dv dP
⎝ 2πk 1 (z, z) v∈A ⎠
z
x x
and equals
−(m(z)+ (y−m(x))−v)
k(z,x) 2
⎛ 1 k(x,x) ⎞
∫y∈B √
1 (z,z)
∫ e 2k dv dP evx−1 .
⎝ 2πk 1 (z, z) v∈A ⎠
8 CONSTRUCTING INFERENCE MAPS 57
where
y m(x)
u=( ) u=( )
v m(z)
and
k[x, x] k[x, z]
Ω=( ),
k[z, x] k[z, z]
which we recognize as a normal distribution N (u, Ω).
On the other hand, we claim that
P δY ι δevx ×evz
1 YX Y X0 Yx × Yz
Figure 31: Proving the joint distribution δevx ×evz ○ δY ι ○ P = P (evx−1 (⋅) ∩ evz−1 (⋅))) is a
normal distribution N (u, Ω).
21
Formally the arguments should be numbered in the given probability measure as P (evx−1 (#1) ∩
evz−1 (#2))because evx−1 (A) ∩ evz−1 (B) ≠ evx−1 (B) ∩ evz−1 (A). However the subscripts can be used to
identify which component measurable sets are associated with each argument.
8 CONSTRUCTING INFERENCE MAPS 58
The diagram in Figure 31 commutes because δπYx ○ δevx ×evz = δx and δπYz ○ δevx ×evz = δz
while, using (evx × evz ) ○ Y ι = (evx , evz ),
Moreover, the covariance k of P (evx−1 (⋅) ∩ evz−1 (⋅))) is represented by the matrix Ω
because by definition of P , in terms of m and k, its restriction to Y X0 ≅ Yx × Yz has
covariance k ∣X0 ≅ Ω.
Consequently the necessary and sufficient condition for I x = PY1 X ∼ GP(m1 , k 1 ) to
be an inference map is satisfied by the projection of PY1 X onto any single coordinate z
which corresponds to the restriction of PY1 X via the deterministic map Y ι ∶ Y X → Y X0
with X0 = {z} as in Figure 14. But this procedure immediately extends to all finite
subsets X0 ⊂ X using matrix algebra and consequently we conclude that the necessary
and sufficient condition for I x to be an inference map for the prior P and the noise free
sampling distribution S x is satisfied.
Writing the prior GP as P ∼ GP(m0 , k 0 ) the recursive updating equations are
k i (z, xi )
mi+1 (z ∣ (xi , yi )) = mi (z) + (yi − mi (xi )) for i = 0, . . . , N − 1 (46)
k (xi , xi )
i
and
k i (w, xi )k i (xi , z)
k i+1 ((w, z) ∣ (xi , yi )) = k i (w, z) − for i = 0, . . . , N − 1 (47)
k i (xi , xi )
where the terms on the left denote the posterior mean and covariance functions of mi and
k i given a new measurement (xi , yi ). These expressions coincide with the standard formu-
las written for N arbitrary measurements {(xi , yi )}Ni=1 , with X0 = (x0 , . . . , xN −1 ) a finite
−1
where K(w, X0 ) is the row vector with components k(w, xi ), K(X0 , X0 ) is the matrix
with components k(xi , xj ), and K(X0 , z) is a column vector with components k(xi , z).22
The notation m̃ and k̃ is used to differentiate these standard expressions from ours above.
Equations 48 and 49 are a computationally efficient way to keep track of the updated
22
When the points are not independent then one can use a perturbation method or other procedure
to avoid degeneracy.
8 CONSTRUCTING INFERENCE MAPS 59
mean and covariance functions. One can easily verify the recursive equations determine
the standard equations using induction.
A review of the derivation of PY1 X indicates that the posterior PY1 X ∼ GP(m1 , k 1 ) is
actually parameterized by the measurement (x1 , y1 ) because the above derivation holds for
any measurement (x1 , y1 ) and this pair of values uniquely determines m1 and k 1 through
the Equations 46 and 47, or equivalently Equations 48 and 49, for a single measurement.
By the SMwCC structure of P each parameterized GP PY1 X can be put into the
bijective correspondence shown in Figure 32, where
PY1 X
X ⊗Y YX
X ⊗ (X ⊗ Y ) Y
PY1 X
1
P ∼ N (m, k) δx ○ N ○ P ∼ N (m(x), k(x, x) + kN (x, x))
N ○P ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=κ(x,x)
N δevx
YX YX Y
Sx
⇓ Decomposition
1 1
P ∼ N (m, k) N ○P
N ○P δx ○ N ○ P
N δx
YX YX YX Y
I∗ Inf
Figure 33: Splitting the Gaussian additive noise Bayesian model (top diagram) into two
separate Bayesian models (bottom two diagrams) and composing the inference maps for
these two simple Bayesian models gives the inference map for the original Gaussian addi-
tive Bayesian model.
Observe that the composition of the two bottom diagrams is the top diagram. The
bottom diagram on the right is a noise free Bayesian model with GP prior N ○ P and
sampling distribution δx whose inference map Inf we have already determined analytically
in Section 8.1. Given a measurement y ∈ Y at x ∈ X, the inference map is given by the
updating Equations 44 and 45 for the mean and covariance functions characterizing the GP
on Y X . The resulting posterior GP on Y X can then be viewed as a measurement on Y X for
the bottom left diagram, which is a Bayesian model with prior P and sampling distribution
N . The inference map I⋆ for this diagram is the identity map on Y X , I⋆ = δIdY X . This is
easy to verify using Bayes product rule (Equation 13), ∫a∈A N (B ∣ a) dP = ∫f ∈B δIdY X (A ∣
f ) d(N ○ P ), for any A, B ∈ ΣY X . Composition of these two inference maps, Inf and I⋆
then yields the resulting inference map for the Gaussian additive noise Bayesian model.
With this observation both of the recursive updating schemes given by Equations 46
and 47 are valid for the Gaussian additive noise model with k replaced by κ. The corre-
sponding standard expressions for the noisy model are then
m̃(z ∣ X0 ) = m(z) + K(z, X0 )K(X0 , X0 )−1 (y − m(X0 ))
and
κ̃((w, z) ∣ X0 ) = κ(w, z) − K(w, X0 )K(X0 , X0 )−1 K(X0 , z),
where the quantities like K(w, X0 ) are as defined previously (following Equation 49)
except now k is replaced by κ. For w ≠ z and neither among the measurements X0 these
8 CONSTRUCTING INFERENCE MAPS 61
p
δi N δevx
R YX YX Y
Spx
Figure 34: The Gaussian additive noise parametric sampling distributions Spx viewed as
a family of sampling distributions, one for each x ∈ X.
function k, it follows that the composite Spx ○P ∼ N (Fm (x), k(x, x)+σ 2 ) while the inference
map Ipx satisfies, for all B ∈ ΣY and all A ∈ ΣRp ,
To determine this inference map Ipx it is necessary to require the parametric map
i ∶ R Ð→ Y X
p
∶ a ↦ ia
be an injective linear homomorphism. Under this condition, which can often be achieved
simply by eliminating redundant modeling parameters, we can explicitly determine the
inference map for the parameterized model, denoted Ipx , by decomposing it into two
inference maps as displayed in the diagram in Figure 35.
1
P Snx ○ P i−1
P i−1
δi Snx
R
n
YX Y Spx = Snx ○ δi
I⋆ Inx
Ipx
Figure 35: The inference map for the parametric model is a composite of two inference
maps.
We first show the stochastic process P i−1 is a GP and by taking the sampling distri-
bution Snx = δevx ○N as the noisy measurement model we can use the result of the previous
section to provide us with the inference map Inx in Figure 35.
Lemma 16. Let k be the matrix representation of the covariance function k. The the
push forward of P N (m, k) by i is a GP P i−1 ∼ GP(im , k̂), where k̂(u, v) = uT kv.
Proof. We need to show that the push forward of P i−1 by the restriction map Y ι ∶ Y X Ð→ Y X0
is a normal distribution for any finite subspace ι ∶ X0 ↪ X. Consider the commutate dia-
gram in Figure 36, where Yx is a copy of Y , X0 = (x1 , . . . , xn′ ), and
evx1 × . . . × evxn′ ∶ Y X0 → ∏ Yx
x∈X0
1
P
P i−1
n
δi
R YX
δY ι
δY ι ○i
δevx1 ×...×evxn′
Y X0 ∏x∈X0 Yx
from which it follows that the composite map δevx1 ×...×evxn′ ○δY ι ○P i−1 ∼ N (X0T m, X0T kX0 ).
P P i−1
n
δi
R YX
I⋆
with the sampling distribution for this Bayesian problem as δi . Let eTj = (0, . . . , 0, 1, 0, . . . , 0)
be the j th unit vector of R and let iej = fj ∈ Y X . The elements {fj }pj=1 form the compo-
p
nents of a basis for the image of i by the assumed injective property of i. Let this finite
basis have a dual basis {fj∗ }pj=1 so that fk∗ (fj ) = δk (j).
Consider the measurable map
f1∗ × . . . × fp∗ ∶ Y X → R
p
Using the linearity of the parameter space R it follows a = ∑pi=1 ai ei and consequently
p
and hence (f1∗ × . . . × fp∗ ) ○ i = idRp in Meas. Now it follows the corresponding inference
map I⋆ = δf1∗ ×...×fp∗ because the necessary and sufficient condition for I⋆ is given, for all
evz−1 (B) ∈ ΣY X (which generate ΣY X ) and all A ∈ ΣRn , by
On the other hand, using I⋆ = δf1∗ ×...×fp∗ , the right hand term of Equation 51 also reduces
to the same expression since
∫g∈evz−1 (B) δf1∗ ×...×fp∗ (A ∣ g) d(P i ) = ∫a∈i−1 (evz−1 (B)) 1((f1∗ ×...×fp∗ )○i)−1 (A) (a) dP
−1
which is the push forward measure of the GP Inx (⋅ ∣ y) ∼ GP(i1m , κ1 ) where (as defined
previously) κ = k + kN and
κ(z, x)
i1m (z) = im (z) + (y − im (x)) (53)
κ(x, x)
and
κ(u, x)κ(x, v)
κ1 (u, v) = κ(u, v) − . (54)
κ(x, x)
This GP projected onto any finite subspace ι ∶ X0 ↪ X is a normal distribution and, for
X0 = {x1 , x2 , . . . , xn }, it follows that
Inx (● ∣ y) ∼ GP(i1m , κ1 ) δY ι
1 YX Y X0
∏i=1 Yi ≅ R
n n
9 STOCHASTIC PROCESSES AS POINTS 65
where X0 is now viewed as the ordered set X0 = (x1 , . . . , xn ) and K(X0 , x) is the n-vector
with components κ(xj , x).
Iterating this updating procedure for N measurements {(xi , yi )}N i=1 the N
th posterior
coincides with the analogous noisy measurment inference updating Equations 48 and 49
with κ in place of k.
⋆ iff t1 ≤ t2
homT (t1 , t2 ) = {
∅ otherwise
The functor category P T has as objects functors F ∶ (T, ≤) → P which play an im-
portant role in the theory of stochastic processes, and we formally give the following
definition.
From the modeling perspective we look at the image of the functor F ∈ob P T in the
category P so given any sequence of ordered points {ti }∞i=1 in T their image under F is
shown in Figure 37, where Fti ,ti+1 = F(≤) is a P arrow.
for x ∈ F(ti ) (the “state” of the process at time ti ) and B ∈ ΣF (ti+2 ) . This equation
is called the Chapman-Kolomogorov relation and can be used, in the non categorical
characterization, to define a Markov process.
The important aspect to note about this definition of a Markov model is that the
measurable spaces F(ti ) can be distinct from the other measurable spaces F(tj ), for
j ≠ i, and of course the arrows Fti ,ti+1 are in general distinct. This simple definition
of a Markov transformation as a functor captures the property of an evolving process
being “memoryless” since if we know where the process F is at ti , say x ∈ F(ti ), then its
expectation at ti+1 (as well as higher order moments) can be determined without regard
to its “state” prior to ti .
The arrows of the functor category P T are natural transformations η ∶ F → G, for
F, G ∈ob P T , and hence satisfy the commutativity relation given in Figure 38 for every
t1 , t2 ∈ T with t1 ≤ t2 .
ηt1
F(t1 ) G(t1 )
ηt2
F(t2 ) G(t2 )
The functor category P T has a terminal object 1 mapping t ↦ 1 for every t ∈ T and
this object 1 ∈ob P T allows us to generalize the definition of a stochastic process.23
Definition 18. Let X be any category. A stochastic process is a point in the category
P X , i.e., a P X arrow η ∶ 1 → F for some F ∈ob P X .24
≤ ≤ ≤ ...
t1 t2 t3
under the stochastic process µ ∶ 1 → F gives the commutative diagram in Figure 39.
One can also observe that GPs can be defined using this generalized definition of a
stochastic process. For X a measurable space it follows for any finite subset X0 ⊂ X we
have the inclusion map ι ∶ X0 ↪ X which is a measurable function, using the subspace
σ-algbra for X0 , and we are led back to Diagram 14 with the stochastic process P ∶ 1 → Ŷ ,
where Ŷ is as defined in the paragraph above following Definition 18, which satisfies the
appropriate restriction property defining a GP.
These simple examples illustrate that different stochastic processes can be obtained by
either varying the structure of the category X and/or by placing additional requirements
on the projection maps, e.g., requiring the projections be normal distributions on finite
subspaces of the exponent category X.
µt1 dt1
St1
F(t1 ) Yt1
which characterizes a Bayesian model, where Yt1 is a copy of a Y which is a data mea-
surement space, St1 is interpreted as a measurement model and dt1 is an actual data
measurement on the “state” space F(t1 ). This determines an inference map It1 so that
10 FINAL REMARKS 69
given a measurement dt1 the posterior probability on F(t1 ) is It1 ○ dt1 . Putting the two
measurement models together with the Markov transformation model F we obtain the
following diagram in Figure 40.
µ̂t1 = I ○ dt1 1
µt1
Ft1 ,t2 ○ µ̂t1
Ft1 ,t2
F(t1 ) F(t2 )
dt1
St1 It1 St2 It2
Yt1 Yt2
This is the hidden Markov process in which given a prior probability µt1 on the space
Ft1 we can use the measurement dt1 to update the prior to the posterior µ̂t1 = It1 ○ dt1 on
F(t1 ). The posterior then composes with Ft1 ,t2 to give the prior Ft1 ,t2 ○ µ̂t1 on F(t2 ), and
now the process can be repeated indefinitely. The Kalman filter is an example in which the
Markov map Ft1 ,t2 describe the linear dynamics of some system under consideration (as
in tracking a satellite), while the sampling distributions St1 model the noisy measurement
process which for the Kalman filter is Gaussian additive noise. Of course one can easily
replace the linear dynamic by a nonlinear dynamic and the Gaussian additive noise model
by any other measurement model, obtaining an extended Kalman filter, and the above
form of the diagram does not change at all, only the P maps change.
10 Final Remarks
In closing, we would like to make a few comments on the use of category theory for ML,
where the largest potential payoff lies in exploiting the abstract framework that categorical
language provides. This section assumes a basic familiarity with monads and should be
viewed as only providing conceptual directions for future research which we believe are
relevant for the mathematical development of learning systems. Further details on the
theory of monads can be found in most category theory books, while the basics as they
relate to our discussion below can be found in our previous paper [6], in which we provide
the simplest possible example of a decision rule on a discrete space.
Seemingly all aspects of ML including Dirichlet distributions and unsupervised learn-
ing (clustering) can be characterized using the category P. As an elementary example,
10 FINAL REMARKS 70
mixture models can be developed by consideration of the space of all (perfect) probability
measures PX on a measurable space X endowed with the coarsest σ-algebra such that
the evaluation maps evB ∶ PX → [0, 1] given by evB (P ) = P (B), for all B ∈ ΣX , are
measurable. This actually defines the object mapping of a functor P ∶ P → Meas which
sends a measurable space X to the space PX of probability measures on X. On arrows,
P sends the P-arrow f ∶ X → Y to the measurable function Pf ∶ PX → PY defined
pointwise on ΣY by
Pf (P )(B) = ∫ fB dP.
X
This functor is called the Giry monad, denoted G, and the Kleisli category K(G) of the
Giry monad is equivalent to P.25 The reason we have chosen to present the material
from the perspective of P rather that K(G) is that the existing literature on ML uses
Markov kernels rather than the equivalent arrows in K(G). The Giry monad determines
the nondeterministic P mapping
PX εX X
given by εX (P, B) = evB (P ) = P (B) for all P ∈ P(X) and all B ∈ ΣX . Using this
construction, any probability measure P on PX then yields a mixture of probability
measures on X through the composite map
P εX ○ P = A mixture model.
PX εX X
2. Integration with respect to a push forward measure can be pulled back. Suppose
f ∶ X → Y is any measurable function, P is a probability measure on X, and
φ ∶ Y → R is any measurable function. Then
To prove this simply show that it holds for φ = 1B , the characteristic function at
B, then extend it to any simple function, and finally use the monotone convergence
theorem to show it holds for any measurable function.
13 REFERENCES 72
P =Q
(X, ΣX ) (Y, ΣY )
since
P ({a}∣1) = 0 = Q({a}∣1)
P ({b}∣1) = .5 = Q({b}∣1)
P ({c}∣1) = .5 = Q({c}∣1)
and
P ({a}∣2) = 0 = Q({a}∣2)
P ({b}∣2) = .5 = Q({b}∣2)
P ({c}∣2) = .5 = Q({c}∣2)
Since P ≠ Q the uniqueness condition required for the closedness property fails and only
the existence condition is satisfied.
13 References
[1] S. Abramsky, R. Blute, and P. Panangaden, Nuclear and trace ideals in tensored-∗
categories. Journal of Pure and Applied Algebra, Vol. 143, Issue 1-3, 1999, pp 3-47.
[2] David Barber, Bayesian Reasoning and Machine Learning, Cambridge University
Press, 2012.
13 REFERENCES 73
[3] N. N. Cencov, Statistical decision rules and optimal inference, Volume 53 of Trans-
lations of Mathematical Monographs, American Mathematical Society, 1982.
[4] Bob Coecke and Robert Speckens, Picturing classical and quantum Bayesian in-
ference, Synthese, June 2012, Volume 186, Issue 3, pp 651-696. http://link.
springer.com/article/10.1007/s11229-011-9917-5
[5] David Corfield, Category Theory in Machine Learning, n-category cafe blog.
http://golem.ph.utexas.edu/category/2007/09/category_theory_in_
machine_lea.html
[6] Jared Culbertson and Kirk Sturtz, A Categorical Foundation for Bayesian Proba-
bility, Applied Categorical Structures, 2013. http://link.springer.com/article/
10.1007/s10485-013-9324-9.
[8] E.E. Doberkat, Kleisi morphism and randomized congruences for the Giry monad, J.
Pure and Applied Algebra, Vol. 211, pp 638-664, 2007.
[9] R.M. Dudley, Real Analysis and Probability, Cambridge Studies in Advanced Math-
ematics, no. 74, Cambridge University Press, 2002
[10] R. Durrett, Probability: Theory and Examples, 4th ed., Cambridge University Press,
New York, 2010.
[11] A.M. Faden. The Existence of Regular Conditional Probabilities: Necessary and
Sufficient Conditions. The Annals of Probability, 1985, Vol. 13, No. 1, 288-298.
[13] Tobias Fritz, A presentation of the category of stochastic matrices, 2009. http:
//arxiv.org/pdf/0902.2554.pdf
[15] E.T. Jaynes, Probability Theory: The Logic of Science, Cambridge University Press,
2003.
[16] F.W. Lawvere, The category of probabilistic mappings. Unpublished seminar notes
1962.
[18] X. Meng, Categories of convex sets and of metric spaces, with applications to stochas-
tic programming and related areas, Ph.D. Thesis, State University of New York at
Buffalo, 1988.
[19] Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
[20] C.E. Rasmussen and C.K.I.. Williams, Gaussian Processes for Machine Learning.
MIT Press, 2006.