0% found this document useful (0 votes)
5 views

Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory

This paper presents a framework for Bayesian machine learning using category theory, specifically focusing on the category of conditional probabilities. It constructs models for both parametric and nonparametric Bayesian reasoning on function spaces, emphasizing the role of stochastic processes and inference maps. The authors aim to bridge the gap between categorical methods and practical machine learning applications, providing insights into supervised learning problems and the use of graphical formulations for model building.

Uploaded by

Carlos Abeli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Culbertson and Sturtz - 2013 - Bayesian machine learning via category theory

This paper presents a framework for Bayesian machine learning using category theory, specifically focusing on the category of conditional probabilities. It constructs models for both parametric and nonparametric Bayesian reasoning on function spaces, emphasizing the role of stochastic processes and inference maps. The authors aim to bridge the gap between categorical methods and practical machine learning applications, providing insights into supervised learning problems and the use of graphical formulations for model building.

Uploaded by

Carlos Abeli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Bayesian Machine Learning via Category Theory

Jared Culbertson and Kirk Sturtz


December 6, 2013
arXiv:1312.1445v1 [math.CT] 5 Dec 2013

Abstract
From the Bayesian perspective, the category of conditional probabilities (a vari-
ant of the Kleisli category of the Giry monad, whose objects are measurable spaces
and arrows are Markov kernels) gives a nice framework for conceptualization and
analysis of many aspects of machine learning. Using categorical methods, we con-
struct models for parametric and nonparametric Bayesian reasoning on function
spaces, thus providing a basis for the supervised learning problem. In particular,
stochastic processes are arrows to these function spaces which serve as prior prob-
abilities. The resulting inference maps can often be analytically constructed in this
symmetric monoidal weakly closed category. We also show how to view general
stochastic processes using functor categories and demonstrate the Kalman filter as
an archetype for the hidden Markov model.

Keywords: Bayesian machine learning, categorical probability, Bayesian probability

Contents
1 Introduction 2

2 The Category of Conditional Probabilities 6


2.1 (Weak) Product Spaces and Joint Distributions . . . . . . . . . . . . . . . . 8
2.2 Constructing a Joint Distribution Given Conditionals . . . . . . . . . . . . . 12
2.3 Constructing Regular Conditionals given a Joint Distribution . . . . . . . . 13

3 The Bayesian Paradigm using P 15

4 Elementary applications of Bayesian probability 18

5 The Tensor Product 25


5.1 Graphs of Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 A Tensor Product of Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Symmetric Monoidal Categories . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1 INTRODUCTION 2

6 Function Spaces 31
6.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 GPs via Joint Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . 41

7 Bayesian Models for Function Estimation 43


7.1 Nonparametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.1 Noise Free Measurement Model . . . . . . . . . . . . . . . . . . . . . . 44
7.1.2 Gaussian Additive Measurement Noise Model . . . . . . . . . . . . . 46
7.2 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Constructing Inference Maps 54


8.1 The noise free inference map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.2 The noisy measurement inference map . . . . . . . . . . . . . . . . . . . . . . 59
8.3 The inference map for parametric models . . . . . . . . . . . . . . . . . . . . 61

9 Stochastic Processes as Points 65


9.1 Markov processes via Functor Categories . . . . . . . . . . . . . . . . . . . . 65
9.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

10 Final Remarks 69

11 Appendix A: Integrals over probability measures. 71

12 Appendix B: The weak closed structure in P 72

13 References 72

1 Introduction
Speculation on the utility of using categorical methods in machine learning (ML) has
been expounded by numerous people, including by the denizens at the n-category cafe
blog [5] as early as 2007. Our approach to realizing categorical ML is based upon viewing
ML from a probabilistic perspective and using categorical Bayesian probability. Several
recent texts (e.g., [2, 19]), along with countless research papers on ML have emphasized
the subject from the perspective of Bayesian reasoning. Combining this viewpoint with
the recent work [6], which provides a categorical framework for Bayesian probability, we
develop a category theoretic perspective on ML. The abstraction provided by category
theory serves as a basis not only for an organization of ones thoughts on the subject, but
also provides an efficient graphical method for model building in much the same way that
probabilistic graphical modeling (PGM) has provided for Bayesian network problems.
In this paper, we focus entirely on the supervised learning problem, i.e., the regression
or function estimation problem. The general framework applies to any Bayesian machine
1 INTRODUCTION 3

learning problem, however. For instance, the unsupervised clustering or density estimation
problems can be characterized in a similar way by changing the hypothesis space and
sampling distribution. For simplicity, we choose to focus on regression and leave the
other problems to the industrious reader. For us, then, the Bayesian learning problem
is to determine a function f ∶ X → Y which takes an input x ∈ X, such as a feature
vector, and associates an output (or class) f (x) with x. Given a measurement (x, y),
or a set of measurements {(xi , yi )}N i=1 where each yi is a labeled output (i.e., training
data), we interpret this problem as an estimation problem of an unknown function f
which lies in Y X , the space of all measurable functions1 from X to Y such that f (xi ) ≈ yi .
When Y is a vector space the space Y X is also a vector space that is infinite dimensional
when X is infinite. If we choose to allow all such functions (every function f ∈ Y X is
a valid model) then the problem is nonparametric. On the other hand, if we only allow
functions from some subspace V ⊂ Y X of finite dimension p, then we have a parametric
model characterized by a measurable map i ∶ R → Y X . The image of i is then the
p

space of functions which we consider as valid models of the unknown function for the
Bayesian estimation problem. Hence, the elements a ∈ R completely determine the valid
p

modeling functions i(a) ∈ Y X . Bayesian modeling splits the problem into two aspects:
(1) specification of the hypothesis space, which consist of the “valid” functions f , and (2)
a noisy measurement model such as yi = f (xi ) + i , where the noise component i is often
modeled by a Gaussian distribution. Bayesian reasoning with the hypothesis space taken
as Y X or any subspace V ⊂ Y X (finite or infinite dimensional) and the noisy measurement
model determining a sampling distribution can then be used to efficiently estimate (learn)
the function f without over fitting the data.
We cast this whole process into a graphical formulation using category theory, which
like PGM, can in turn be used as a modeling tool itself. In fact, we view the components of
these various models, which are just Markov kernels, as interchangeable parts. An impor-
tant piece of the any solving the ML problem with a Bayesian model consists of choosing
the appropriate parts for a given setting. The close relationship between parametric and
nonparametric models comes to the forefront in the analysis with the measurable map
i ∶ R → Y X connecting the two different types of models. To illustrate this point suppose
p

p
we are given a normal distribution P on R as a prior probability on the unknown param-
eters. Then the push forward measure2 of P by i is a Gaussian process, which is a basic
tool in nonparametric modeling. When composed with a noisy measurement model, this
provides the whole Bayesian model required for a complete analysis and an inference map
1
Recall that a σ-algebra ΣX on X is a collection of subsets of X that is closed under complements
and countable unions (and hence intersections); the pair (X, ΣX ) is called a measurable space and any
set A ∈ ΣX is called a measurable set of X. A measurable function f ∶ X → Y is defined by the property
that for any measurable set B in the σ-algebra of Y , we have that f −1 (B) is in the σ-algebra of X. For
example, all continuous functions are measurable with respect to the Borel σ-algebras.
2
A measure µ on a measurable space (X, ΣX ) is a nonnegative real-valued function µ∶ X → R≥0 such
that µ(∅) = 0 and µ(∪∞ i=1 Ai ) = ∑i=1 µ(Ai ). A probability measure is a measure where µ(X) = 1. In this

paper, all measures are probability measures and the terminology “distribution” will be synonymous with
“probability measure.”
1 INTRODUCTION 4

can be analytically constructed.3 Consequently, given any measurement (x, y) taking the
inference map conditioned at (x, y) yields the updated prior probability which is another
p
normal distribution on R .
The ability to do Bayesian probability involving function spaces relies on the fact that
the category of measurable spaces, Meas, has the structure of a symmetric monoidal
closed category (SMCC). Through the evaluation map, this in turn provides the category
of conditional probabilities P with the structure of a symmetric monoidal weakly closed
category (SMwCC), which is necessary for modeling stochastic processes as probability
measures on function spaces. On the other hand, the ordinary product X × Y with
its product σ-algebra is used for the Bayesian aspect of updating joint (and marginal)
distributions. From a modeling viewpoint, the SMwCC structure is used for carrying
along a parameter space (along with its relationship to the output space through the
evaluation map). Thus we can describe training data and measurements as ordered pairs
(xi , yi ) ∈ X ⊗ Y , where X plays the role of a parameter space.

A few notes on the exposition. In this paper our intended audience consists of (1)
the practicing ML engineer with only a passing knowledge of category theory (e.g., know-
ing about objects, arrows and commutative diagrams), and (2) those knowledgeable of
category theory with an interest of how ML can be formulated within this context. For
the ML engineer familiar with Markov kernels, we believe that the presentation of P and
its applications can serve as an easier introduction to categorical ideas and methods than
many standard approaches. While some terminology will be unfamiliar, the examples
should provide an adequate understanding to relate the knowledge of ML to the cate-
gorical perspective. If ML researchers find this categorical perspective useful for further
developments or simply for modeling purposes, then this paper will have achieved its goal.
In the categorical framework for Bayesian probability, Bayes’ equation is replaced
by an integral equation where the integrals are defined over probability measures. The
analysis requires these integrals be evaluated on arbitrary measurable sets and this is
often possible using the three basic rules provided in Appendix A. Detailed knowledge of
measure theory is not necessary outside of understanding these three rules and the basics
of σ-algebras and measures, which are used extensively for evaluating integrals in this
paper. Some proofs require more advanced measure-theoretic ideas, but the proofs can
safely be avoided by the unfamiliar reader and are provided for the convenience of those
who might be interested in such details.
For the category theorist, we hope the paper makes the fundamental ideas of ML
transparent, and conveys our belief that Bayesian probability can be characterized cate-
gorically and usefully applied to fields such as ML. We believe the further development
of categorical probability can be motivated by such applications and in the final remarks
we comment on one such direction that we are pursuing.
These notes are intended to be tutorial in nature, and so contain much more detail
that would be reasonable for a standard research paper. As in this introductory section,
3
The inference map need not be unique.
1 INTRODUCTION 5

basic definitions will be given as footnotes, while more important definitions, lemmas and
theorems Although an effort has been made to make the exposition as self-contained as
possible, complete self-containment is clearly an unachievable goal. In the presentation,
we avoid the use of the terminology of random variables for two reasons: (1) formally
a random variable is a measurable function f ∶ X → Y and a probability measure P on
X gives rise to the distribution of the random variable f⋆ (P ) which is the push forward
measure of P . In practice the random variable f itself is more often than not impossible
to characterize functionally (consider the process of flipping a coin), while reference to
the random variable using a binomial distribution, or any other distribution, is simply
making reference to some probability measure. As a result, in practice the term “random
variable” is often not making reference to any measurable function f and the pushforward
measure of some probability measure P at all but rather is just referring to a probability
measure; (2) the term “random variable” has a connotation that, we believe, should be
de-emphasized in a Bayesian approach to modeling uncertainty. Thus while a random
variable can be modeled as a push forward probability measure within the framework
presented we feel no need to single them out as having any special relevance beyond the
remark already given. In illustrating the application of categorical Bayesian probability
we do however show how to translate the familiar language of random variables into the
unfamiliar categorical framework for the particular case of Gaussian distributions which
are the most important application for ML since Gaussian Processes are characterized on
finite subsets by Gaussian distributions. This provides a particularly nice illustration of
the non uniqueness of conditional sampling distribution and inference pairs given a joint
distribution.

Organization. The paper is organized as follows: The theory of Bayesian probability in


P is first addressed and applied to elementary problems on finite spaces where the detailed
solutions to inference, prediction and decision problems are provided. If one understands
the “how and why” in solving these problems then the extension to solving problems
in ML is a simple step as one uses the same basic paradigm with only the hypothesis
space changed to a function space. Nonparametric modeling is presented next, and then
the parametric model can seen as a submodel of the nonparametric model. We then
proceed to give a general definition of stochastic process as a special type of arrow in
a functor category P X , and by varying the category X or placing conditions on the
projection maps onto subspaces one obtains the various types of stochastic processes such
as Markov processes or GP. Finally, we remark on the area where category theory may
have the biggest impact on applications for ML by integrating the probabilistic models
with decision theory into one common framework.
The results presented here derived from a categorical analysis of the ML problem(s)
will come as no surprise to ML professionals. We acknowledge and thank our colleagues
who are experts in the field who provided assistance and feedback.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 6

2 The Category of Conditional Probabilities


The development of a categorical basis for probability was initiated by Lawvere [16],
and further developed by Giry [14] using monads to characterize the adjunction given
in Lawvere’s original work. The Kleisli category of the Giry monad G is what Lawvere
called the category of probabilistic mappings and what we shall refer to as the category
of conditional probabilities.4 Further progess was given in the unpublished dissertation of
Meng [18] which provides a wealth of information and provides a basis for thinking about
stochastic processes from a categorical viewpoint. While this work does not address the
Bayesian perspective it does provide an alternative “statistical viewpoint” toward solving
such problems using generalized metrics. Additional interesting work on this category is
presented in a seminar by Voevodsky, in Russian, available in an online video [22]. The
extension of categorical probability to the Bayesian viewpoint is given in the paper [6],
though Lawvere and Peter Huber were aware of a similar approach in the 1960’s.5 Coecke
and Speckens [4] provide an alternative graphical language for Bayesian reasoning under
the assumption of finite spaces which they refer to as standard probability theory. In such
spaces the arrows can be represented by stochastic matrices [13]. More recently Fong [12]
has provided further applications of the category of conditional probabilities to Causal
Theories for Bayesian networks.
Much of the material in this section is directly from [6], with some additional expla-
nation where necessary. The category6 of conditional probabilities, which we denote by
P, has countably generated7 measurable spaces (X, ΣX ) as objects and an arrow between
two such objects
T
(X, ΣX ) (Y, ΣY )
is a Markov kernel (also called a regular conditional probability) assigning to each element
x ∈ X and each measurable set B ∈ ΣY the probability of B given x, denoted T (B ∣
x). The term “regular” refers to the fact that the function T is conditioned on points
rather than measurable sets A ∈ ΣX . When (X, ΣX ) is a countable set (either finite or
countably infinite) with the discrete σ-algebra then every singleton {x} is measurable and
the term “regular” is unnecessary. More precisely, an arrow T ∶ X → Y in P is a function
T ∶ ΣY × X → [0, 1] satisfying
4
Monads had not yet been developed at the time of Lawvere’s work. However the adjunction con-
struction he provided was the Giry monad on measurable spaces.
5
In a personal communication Lawvere related that he and Peter Huber gave a seminar in Zurich
around 1965 on “Bayesian sections.” This refers to the existence of inference maps in the Eilenberg–
Moore category of G-algebras. These inference maps are discussed in Section 3, although we discuss them
only in the context of the category P.
6
A category is a collection of (1) objects and (2) morphisms (or arrows) between the objects (including
a required identity morphism for each object), along with a prescribed method for associative composition
of morphisms.
7
A space (X, ΣX ) is countably generated if there exist a countable set of measurable sets {Ai }∞ i=1
which generated the σ-algebra ΣX .
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 7

1. for all B ∈ ΣY , the function T (B ∣ ⋅)∶ X → [0, 1] is measurable, and


2. for all x ∈ X, the function T (⋅ ∣ x)∶ ΣY → [0, 1] is a perfect probability measure8 on
Y.
For technical reasons it is necessary that the probability measures in (2) constitute an
equiperfect family of probability measures to avoid pathological cases which prevent the
existence of inference maps necessary for Bayesian reasoning.9
The notation T (B ∣ x) is chosen as it coincides with the standard notation “p(H ∣ D)”
of conditional probability theory. For an arrow T ∶ (X, ΣX ) → (Y, ΣY ), we occasionally
denote the measurable function T (B ∣ ⋅)∶ ΣY → [0, 1] by TB and the probability measure
T (⋅ ∣ x)∶ ΣY → [0, 1] by Tx . Hereafter, for notational brevity we write a measurable space
(X, ΣX ) simply as X when referring to a generic σ-algebra ΣX .
Given two arrows
T U
X Y Z
the composition U ○ T ∶ ΣZ × X → [0, 1] is marginalization over Y defined by

(U ○ T )(C ∣ x) = ∫ U (C ∣ y) dTx .
y∈Y

The integral of any real valued measurable function f ∶ X → R with respect to any
measure P on X is
EP [f ] = ∫ f (x) dP, (1)
x∈X
called the P -expectation of f . Consequently the composite (U ○ T )(C ∣ x) is the Tx -
expectation of UC ,
(U ○ T )(C ∣ x) = ETx [UC ].
Let Meas denote the category of measurable spaces where the objects are measurable
spaces (X, ΣX ) and the arrows are measurable functions f ∶ X → Y . Every measurable
mapping f ∶ X → Y may be regarded as a P arrow
δf
X Y
defined by the Dirac (or one point) measure
δf ∶ X × ΣY → [0, 1]
1 If f (x) ∈ B
∶ (B ∣ x) ↦ {
0 If f (x) ∉ B.
8
A perfect probability measure P on Y is a probability measure such that for any measurable function
f ∶ Y → R there exist a real Borel set E ⊂ f (Y ) satisfying P (f −1 (E)) = 1.
9
Specifically, the subsequent Theorem 1 is a constructive procedure which requires perfect probability
measures. Corollary 2 then gives the inference map. Without the hypothesis of perfect measures a
pathological counterexample can be constructed as in [9, Problem 10.26]. The paper by Faden [11] gives
conditions on the existence of conditional probabilities and this constraint is explained in full detail in
[6]. Note that the class of perfect measures is quite broad and includes all probability measures defined
on Polish spaces.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 8

The relation between the dirac measure and the characteristic (indicator) function 1 is

δf (B ∣ x) = 1f −1 (B) (x)

and this property is used ubiquitously in the analysis of integrals.


Taking the measurable mapping f to be the identity map on X gives for each object
δIdX
X the morphism X Ð→ X given by

1 if x ∈ B
δIdX (B ∣ x) = {
0 if x ∉ B

which is the identity morphism for X in P. Using standard notation we denote the identity
mapping on any object X by 1X = δIdX , or for brevity simply by 1 if the space X is clear
from the context. With these objects and arrows, law of composition, associativity, and
identity, standard measure-theoretic arguments show that P forms a category.
There is a distinguished object in P that play an important role in Bayesian probability.
For any set Y with the indiscrete σ-algebra ΣY = {Y, ∅}, there is a unique arrow from
any object X to Y since any arrow P ∶ X → Y is completely determined by the fact that
Px must be a probability measure on Y . Hence Y is a terminal object, and we denote
the unique arrow by !X ∶ X → Y . Up to isomorphism, the canonical terminal object is the
one-element set which we denote by 1 = {⋆} with the only possible σ-algebra. It follows
that any arrow P ∶ 1 → X from the terminal object to any space X is an (absolute)
probability measure on X, i.e., it is an “absolute” probability measure on X because
there is no variability (conditioning) possible within the singleton set 1 = {⋆}.

P
1 X

Figure 1: The representation of a probability measure in P.

We refer to any arrow P ∶ 1 → X with domain 1 as either a probability measure or a


distribution on X. If X is countable then X is isomorphic in P to a discrete space
m = {0, 1, 2, . . . , m − 1} with the discrete σ-algebra where the integer m corresponds to
the number of atoms in the σ-algebra ΣX . Consequently every finite space is, up to
isomorphism, just a discrete space and therefore every distribution P ∶ 1 → X is of the
form P = ∑m−1
i=0 pi δi where ∑i=0 pi = 1.
m−1

2.1 (Weak) Product Spaces and Joint Distributions


In Bayesian probability, determining the joint distribution on a “product space” is often
the problem to be solved. In many applications for which Bayesian reasoning in appropri-
ate, the problem reduces to computing a particular marginal or conditional probability;
these can be obtained in a straightforward way if the joint distribution is known. Before
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 9

proceeding to formulate precisely what the term “product space” means in P, we describe
the categorical construct of a finite product space in any category.
Let C be an arbitary category and X, Y ∈ob C. We say the product of X and Y exists
if there is an object, which we denote by X × Y , along with two arrows pX ∶ X × Y → X
and pY ∶ X × Y → Y in C such that given any other object T in C and arrows f ∶ T → X
and g ∶ T → Y there is a unique C arrow ⟨f, g⟩∶ T → X × Y that makes the diagram

f ⟨f, g⟩ g (2)

X pX X ×Y pY Y

commute. If the given diagram is a product then we often write the product as a triple
(X × Y, pX , pY ). We must not let the notation deceive us; the object X × Y could just
as well be represented by PX,Y . The important point is that it is an object in C that we
need to specify in order to show that binary products exist. Products are an example of a
universal construction in categories. The term “universal” implies that these constructions
are unique up to a unique isomorphism. Thus if (PX,Y , pX , py ) and (QX,Y , qX , qY ) are
both products for the objects X and Y then there exist unique arrows α∶ PX,Y → QX,Y
and β∶ QX,Y → PX,Y in C such that β ○ α = 1PX,Y and α ○ β = 1QX,Y so that the objects PX,Y
and QX,Y are isomorphic.
If the product of all object pairs X and Y exist in C then we say binary products
exist in C. The existence of binary products implies the existence of arbitrary finite
products in C. So if {Xi }N i=1 is a finite set of objects in C then there is an object which we
denote by ∏i=1 Xi (in general, this need not be the cartesian product) as well as arrows
N

{pXj ∶ ∏N i=1 Xi → Xj }j=1 . Then if we are given an arbitrary T ∈ob C and a family of arrows
N

fj ∶ T → Xj in C there exists a unique C arrow ⟨f1 , . . . , fN ⟩ such that for every integer
j ∈ {1, 2, . . . , N } the diagram

fj ⟨f1 , . . . , fN ⟩

N
Xj
pXj ∏Xi
i=1

commutes. The arrows pXi defining a product space are often called the projection maps
due to the analogy with the cartesian products in the category of sets, Set.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 10

In Set, the product of two sets X and Y is the cartesian product X × Y consisting of
all pairs (x, y) of elements with x ∈ X and y ∈ Y along with the two projection mappings
πX ∶ X × Y → X sending (x, y) ↦ x and πY ∶ X × Y → Y sending (x, y) ↦ y. Given any
pair of functions f ∶ T → X × Y and g∶ T → X × Y the function ⟨f, g⟩∶ T → X × Y sending
t ↦ (f (t), g(t)) clearly makes Diagram 2 commute. But it is also the unique such function
because if γ∶ T → X × Y were any other function making the diagram commute then the
equations
(pX ○ γ)(t) = f (t) and (pY ○ γ)(t) = g(t) (3)
would also be satisfied. But since the function γ has codomain X × Y which consist
of ordered pairs (x, y) it follows that for each t ∈ T that γ(t) = ⟨γ1 (t), γ2 (t)⟩ for some
functions γ1 ∶ T → X and γ2 ∶ T → Y . Substituting γ = ⟨γ1 , γ2 ⟩ into equations 3 it follows
that
f (t) = (pX ○ (⟨γ1 , γ2 ⟩))(t) = pX (γ1 (t), γ2 (t)) = γ1 (t)
g(t) = (pY ○ (⟨γ1 , γ2 ⟩))(t) = pY (γ2 (t), γ2 (t)) = γ2 (t)
from which it follows γ = ⟨γ1 , γ2 ⟩ = ⟨f, g⟩ thereby proving that there exist at most one
such function T → X × Y making the requisite Diagram 2 commute. If the requirement of
the uniqueness of the arrow ⟨f, g⟩ in the definition of a product is dropped then we have
the definition of a weak product of X and Y .
Given the relationship between the categories P and Meas it is worthwhile to examine
products in Meas. Given X, Y ∈ob Meas the product X × Y is the cartesian product
X × Y of sets endowed with the smallest σ-algebra such that the two set projection maps
πX ∶ X × Y → X sending (x, y) ↦ x and πY ∶ X × Y → Y sending (x, y) ↦ y are measurable.
In other words, we take the smallest subset of the powerset of X × Y such that for all
A ∈ ΣX and for all B ∈ ΣY the preimages πX −1
(A) = A × Y and πY−1 (B) = X × B are
measurable. Since a σ-algebra requires that the intersection of any two measurable sets
is also measurable it follows that πX −1
(A) ∩ πY−1 (B) = A × B must also be measurable.
Measurable sets of the form A × B are called rectangles and generate the collection of
all measurable sets defining the σ-algebra ΣX×Y in the sense that ΣX×Y is equal to the
intersection of all σ-algebras containing the rectangles. When the σ-algebra on a set is
determined by the a family of maps {pk ∶ X×Y → Zk }k∈K , where K is some indexing set such
that all of these maps pk are measurable we say the σ-algebra is induced (or generated) by
the family of maps {pk }k∈K .10 The cartesian product X × Y with the σ-algebra induced
by the two projection maps πX and πY is easily verified to be a product of X and Y
since given any two measurable maps f ∶ Z → X and g∶ Z → Y the map ⟨f, g⟩∶ Z → X × Y
sending z ↦ (f (z), g(z)) is the unique measurable map satisfying the defining property
of a product for (X × Y, πX , πY ). This σ-algebra induced by the projection maps πX and
πY is called the product σ-algebra and the use of the notation X × Y in Meas will imply
the product σ-algebra on the set X × Y .
Having the product (X × Y, πX , πY ) in Meas and the fact that every measurable
function f ∈ar Meas determines an arrow δf ∈ar P, it is tempting to consider the triple
10
The terminology initial is also used in lieu of induced.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 11

(X × Y, δπX , δπY ) as a potential product in P. However taking this triple fails to be


a product space of X and Y in P because the uniqueness condition fails; given two
probability measures P ∶ 1 → X and Q∶ 1 → Y there are many joint distributions J making
the diagram
1

P J Q (4)

X X ×Y Y
δπX δπ Y
commute. In particular, the tensor product measure defined on rectangles by (P ⊗Q)(A×
B) = P (A)Q(B) extends to a joint probability measure on X × Y by

(P ⊗ Q)(ς) = ∫ y (ς)) dQ ∀ς ∈ ΣX×Y


P (Γ−1 (5)
y∈Y

or equivalently,
(P ⊗ Q)(ς) = ∫ x (ς)) dP
Q(Γ−1 ∀ς ∈ ΣX×Y . (6)
x∈X
Here x∶ Y → X is the constant function at x and Γx ∶ Y → X × Y is the associated graph
function, with y and Γy defined similarly. The fact that Q⊗P = P ⊗Q is Fubini’s Theorem;
by taking a rectangle ς = A × B ∈ ΣX×Y the equality of these two measures is immediate
since
(P ⊗ Q)(A × B) = ∫y∈Y P ( y (A × B)
Γ−1 ) dQ
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶

⎪ A

⎪ iff y ∈ B

=⎨



⎩ otherwise
= ∫y∈B P (A) dQ (7)
= P (A) ⋅ Q(B)
= ∫x∈A Q(B) dP
= ∫x∈X Q(Γ−1 x (A × B)) dP
= (Q ⊗ P )(A × B)
Using the fact that every measurable set ς in X × Y is a countable union of rectangles,
Fubini’s Theorem follows.
It is clear that in P the uniqueness condition required in the definition of a product
of X and Y will always fail unless at least one of X and Y is a terminal object 1, and
consequently only weak products exist in P. However it is the nonuniqueness of products
in P that makes this category interesting. Instead of referring to weak products in P
we shall abuse terminology and simply refer to them as products with the understanding
that all products in P are weak.
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 12

2.2 Constructing a Joint Distribution Given Conditionals


We now show how marginals and conditionals can be used to determine joint distributions
in P. Given a conditional probability measure h∶ X → Y and a probability measure
PX ∶ 1 → X on X, consider the diagram

PX
Jh
(8)
δπ X δπY
X X ×Y Y

where Jh is the uniquely determined joint distribution on the product space X ×Y defined
on the rectangles of the σ-algebra ΣX × ΣY by

Jh (A × B) = ∫ hB dPX . (9)
A

The marginal of Jh with respect to Y then satisfies δπY ○Jh = h○PX and the marginal of Jh
with respect to X is PX . By a symmetric argument, if we are given a probability measure
PY and conditional probability k∶ Y → X then we obtain a unique joint distribution Jk on
the product space X × Y given on the rectangles by

Jk (A × B) = ∫ kA dPY .
B

However if we are given PX , PY , h, k as indicated in the diagram

Jh Jk
PX PY
X ×Y (10)

δπ X δπY
h
X Y,
k

then we have that Jh = Jk if and only if the compatibility condition is satisfied on the
rectangles
∫ hB dPX = J(A × B) = ∫ kA dPY ∀A ∈ ΣX , ∀B ∈ ΣY .
A B
(11)
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 13

In the extreme case, suppose we have a conditional h∶ X → Y which factors through


the terminal object 1 as
h
X Y
! 1 Q

where ! represents the unique arrow from X → 1. If we are also given a probability measure
P ∶ 1 → X, then we can calculate the joint distribution determined by P and h = Q○! as

J(A × B) = ∫A (Q○!)B dP
= P (A) ⋅ Q(B)

so that J = P ⊗ Q. In this situation we say that the marginals P and Q are independent.
Thus in P independence corresponds to a special instance of a conditional—one that
factors through the terminal object.

2.3 Constructing Regular Conditionals given a Joint Distribu-


tion
The following result is the theorem from which the inference maps in Bayesian probabil-
ity theory are constructed. The fact that we require equiperfect families of probability
measures is critical for the construction.

Theorem 1. Let X and Y be countably generated measurable spaces and (X × Y, ΣX×Y )


the product in Meas with projection map πY . If J is a joint distribution on X × Y with
marginal PY = δπY ○ J on Y , then there exists a P arrow f that makes the diagram

J PY
(12)
δπ Y
X ×Y Y
f

commute and satisfies


∫A×B δπY C dJ = ∫C fA×B dPY .
Moreover, this f is the unique P-morphism with these properties, up to a set of PY -
measure zero.

Proof. Since ΣX and ΣY are both countably generated, it follows that ΣX×Y is countably
generated as well. Let G be a countable generating set for ΣX×Y . For each A ∈ G, define
a measure µA on Y by
µA (B) = J(A ∩ πY−1 B).
2 THE CATEGORY OF CONDITIONAL PROBABILITIES 14

Then µA is absolutely continuous with respect to PY and hence we can let f̃A = dP dµA
Y
, the
Radon–Nikodym derivative. For each A ∈ G this Radon–Nikodym derivative is unique up
to a set of measure zero, say Â. Let N = ∪A∈A Â and E1 = N c . Then f̃A ∣E1 is unique for
all A ∈ A. Note that fX×Y = 1 and f∅ = 0 on E1 . The condition f̃A ≤ 1 on E1 for all A ∈ A
then follows.
For all B ∈ ΣY and any countable union ∪ni=1 Ai of disjoint sets of A we have

∫B∩E1 f̃∪ni=1 Ai dPY = J ((∪ni=1 Ai ) ∩ πY−1 B)


= ∑ni=1 J(Ai ∩ πY−1 B)
= ∫B∩E1 ∑ni=1 f̃Ai dPY ,
with the last equality following from the Monotone Convergence Theorem and the fact
that all of the f̃Ai are nonnegative. From the uniqueness of the Radon–Nikodym derivative
it follows n
f̃∪ni=1 Ai = ∑ f̃Ai PY -a.e.
i=1
Since there exist only a countable number of finite collection of sets of A we can find a
set E ⊂ E1 of PY -measure one such that the normalized set function f̃⋅ (y)∶ A → [0, 1] is
finitely additive on E.
These facts altogether show there exists a set E ∈ ΣY with PY -measure one where for
all y ∈ E,
1. 0 ≤ f̃A (y) ≤ 1 ∀A ∈ A,
2. f̃∅ (y) = 0 and f̃X×Y (y) = 1, and
3. for any finite collection {Ai }ni=1 of disjoint sets of A we have f̃∪ni=1 Ai (y) = ∑ni=1 f̃Ai (y).

Thus the set function f̃∶ E × A → [0, 1] satisfies the condition that f̃(y, ⋅) is a proba-
bility measure on the algebra A. By the Caratheodory extension theorem there exist a
unique extension of f̃(y, ⋅) to a probability measure fˆ(y, ⋅)∶ ΣX×Y → [0, 1]. Now define a
set function f ∶ Y × ΣX×Y → [0, 1] by

fˆ(y, A) if y ∈ E
f (y, A) = { .
J(A) if y ∉ E
Since each A ∈ ΣX×Y can be written as the pointwise limit of an increasing sequence
{An }∞
n=1 of sets An ∈ A it follows that fA = limn→∞ fAn is measurable. From this we also
obtain the desired commutativity of the diagram
f ○ PY (A) = ∫Y fA dPY = ∫E fA dPY = limn→∞ ∫E f̃An dPY
= limn→∞ ∫Y f̃An dPY
= limn→∞ J(An )
= J(A)
3 THE BAYESIAN PARADIGM USING P 15

We can use the result from Theorem 1 to obtain a broader understanding of the
situation.

Corollary 2. Let X and Y be countably generated measurable spaces and J a joint distri-
bution on X × Y with marginal distributions PX and PY on X and Y , respectively. Then
there exist P arrows f and g such that the diagram

PX X ×Y PY
δπX δπ Y
g f
δπ Y ○ g
X Y

δπX ○ f

commutes and
∫U (δπY ○ g)V dPX = J(U × V ) = ∫V (δπX ○ f )U dPY .
f
Proof. From Theorem 1 there exist a P arrow Y Ð→ X × Y satisfying J = f ○ PY . Take
the composite δπX ○ f and note (δπX ○ f )U (y) = fy (U × Y ) giving

∫V (δπX ○ f )U dPY = ∫V fU ×Y dPY


= J(U × Y ∩ πY−1 V )
= J(U × V )
g
Similarly using a P arrow X Ð→ X × Y satisfying J = g ○ PX gives

∫U (δπY ○ g)V dPX = J(U × V ).

Note that if the joint distribution J is defined by a probability measure PX and a


conditional h∶ X → Y using Diagram 8, then using the above result and notation it follows
h = δπY ○ g.

3 The Bayesian Paradigm using P


The categorical paradigm of Bayesian probability can be compactly summarized with
as follows. Let D and H be measurable spaces, which model a data and hypothesis
3 THE BAYESIAN PARADIGM USING P 16

space, respectively. For example, D might be a Euclidean space corresponding to some


measurements that are being taken and H a parameterization of some decision that needs
to be made.

1
PH

S
H D
I

Figure 2: The generic Bayesian model.

The notation S is used to emphasize the fact we think of S as a sampling distribution


on D. In the context of Bayesian probability the (perfect) probability measure PH is
often called a prior probability or, for brevity, just a prior. Given a prior P and sampling
distribution S the joint distribution J∶ 1 → H × D can be constructed using Definition 9.
Using the marginal PD = S ○ PH on D it follows by Corollary 2.2 there exist an arrow
f ∶ D → H ×D satisfying J = f ○PD . Composing this arrow f with the coordinate projection
δπH gives an arrow I = δπH ○ f ∶ D → H which we refer to as the inference map, and it
satisfies
∫ IA dPD = J(A × B) = ∫ SB dPH ∀A ∈ ΣH , and ∀B ∈ ΣD
B A
(13)
which is called the product rule.
With the above in mind we formally define a Bayesian model to consist of
(i) two measurable spaces H and D representing hypotheses and data, respectively,

(ii) a probability measure PH on the H space called the prior probability,

(iii) a P arrow S∶ H → D called the sampling distribution,


The sampling distribution S and inference map I are often written as PD∣Y and PH∣D ,
respectively, although using the notation P⋅∣⋅ for all arrows in the category which are neces-
sarily conditional probabilities is notationally redundant and nondistinguishing (requiring
the subscripts to distinguish arrows).
Given this model and a measurement µ, which is often just a point mass on D (i.e.,
µ = δd ∶ 1 → D), there is an update procedure that incorporates this measurement and
the prior probability. Thus the measurement µ can itself be viewed as a probability
measure on D, and the “posterior” probability measure can be calculated as P̂H = I ○ µ
on H provided the measurement µ is absolutely continuous with respect to PD , which we
write as µ ≪ PD . Informally, this means that the observed measurement is considered
“possible” with respect to prior assumptions.
Let us expand upon this condition µ ≪ PD more closely. We know from Theorem 1 that
the inference map I is uniquely determined by PH and S up to a set of PD -measure zero.
3 THE BAYESIAN PARADIGM USING P 17

In general, there is no reason a priori that an arbitrary (perfect) probability measurement


µ∶ 1 → D is required to be absolutely continuous with respect to PD . If µ is not absolutely
continuous with respect to PD , then a different choice of inference map I ′ could yield
a different posterior probability—i.e., we could have I ○ µ ≠ I ′ ○ µ. Thus we make the
assumption that measurement probabilities on D are absolutely continuous with respect
to the prior probability PD on D.
In practice this condition is often not met. For example the probability measure PD
may be a normal distribution on R and consequently PD ({y}) = 0 for any point y ∈ R.
Since Dirac measurements do not satisfy δy ≪ PD , this could create a problem. However,
it is clear that the Dirac measures can be approximated arbitrarily closely by a limiting
process of sharply peaked normal distributions which do satisfy this absolute continuity
condition. Thus while the absolute continuity condition may not be satisfied precisely the
error in approximating the measurement by assuming a Dirac measure is negligible. Thus
it is standard to assume that measurements belong to a particular class of probability
measures on D which are broad enough to approximate measurements and known to be
absolutely continuous with respect to the prior.
In summary, the Bayesian process works in the following way. Given a prior probability
PH and sampling distribution S one determines the inference map I. (For computational
purposes the construction of the entire map I is in general not necessary.) Once a mea-
surement µ∶ 1 → D is taken, we then calculate the posterior probability by I ○ µ. This
updating procedure can be characterized by the diagram

PH µ
I ○µ (14)

S
H D
I
where the solid lines indicate arrows given a priori, the dotted line indicates the arrow
determined using Theorem 1, and the dashed lines indicate the updating after a measure-
ment. Note that if there is no uncertainty in the measurement, then µ = δ{x} for some
x ∈ D, but in practice there is usually some uncertainty in the measurements themselves.
Consequently the posterior probability must be computed as a composite - so the posterior
probability of an event A ∈ ΣH given a measurement µ is (I ○ µ)(A) = ∫D IA (x) dµ.
Following the calculation of the posterior probability, the sampling distribution is then
updated, if required. The process can then repeat: using the posterior probability and
the updated sampling distribution the updated joint probability distribution on the prod-
uct space is determined and the corresponding (updated) inference map determined (for
computational purposes the “entire map” I need not be determined if the measurements
are deterministic). We can then continue to iterate as long as new measurements are
received. For some problems, such as with the standard urn problem with replacement of
balls, the sampling distribution does not change from iterate to iterate, but the inference
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 18

map is updated since the posterior probability on the hypothesis space changes with each
measurement.
Remark 3. Note that for countable spaces X and Y the compatibility condition reduces to
the standard Bayes equation since for any x ∈ X the singleton {x} ∈ ΣX and similarly any
element y ∈ Y implies {y} ∈ ΣY , so that the joint distribution J∶ 1 → X × Y on {x} × {y}
reduces to the equation

S({y} ∣ x)PX ({x}) = J({x} × {y}) = I({x} ∣ y)PY ({y}) (15)

which in more familiar notation is the Bayesian equation

P (y ∣ x)P (x) = P (x, y) = P (x ∣ y)P (y). (16)

4 Elementary applications of Bayesian probability


Before proceeding to show how the category P can be can be applied to ML where the
unknowns are functions, we illustrate its use to solve inference, prediction, and decision
processes in the more familiar setting where the unknown parameter(s) are real values.
We present two elementary problems illustrating basic model building using categorical
diagrams, much like that used in probabilistic graphical models for Bayesian networks,
which can serve to clarify the modeling aspect of any probabilistic problem.
To illustrate the inference-sampling distribution relationship and how we make com-
putations in the category P, we consider first an urn problem where we have discrete
σ-algebras. The discreteness condition is not critical as we will eventually see - it only
makes the analysis and computational aspect easier.
Example 4. Million dollar draw.11

B R B

R B R R B B

Urn 1 Urn 2

You are given two draws and if you pull out a red ball you win a million dollars. You
are unable to see the two urns so you don’t know which urn you are drawing from and
the draw is done without replacement. The P diagram for both inference and calculating
sampling distributions is given by
11
This problem is taken from Peter Green’s tutorial on Bayesian Inference which can be viewed at
http://videolectures.net/mlss2011 green bayesian.
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 19

PU PB
S
U B
I
where the dashed arrows indicate morphisms to be calculated rather than morphisms
determined by modeling,

U = {u1 , u2 } = {Urn 1, Urn 2}


B = {b, r} = {blue, red}

and
1 1
PU = δu1 + δu2 .
2 2
The sampling distribution is the binomial distribution given by

S({b} ∣ u1 ) = 2
5 S({r} ∣ u1 ) = 53
S({b} ∣ u2 ) = 3
4 S({r} ∣ u2 ) = 41 .

Suppose that on our first draw, we draw from one of the urns (which one is unknown)
and draw a blue ball. We ask the following questions:

1. (Inference) What is the probability that we made the draw from Urn 1 (Urn 2)?

2. (Prediction) What is the probability of drawing a red ball on the second draw (from
the same urn)?

3. (Decision) Given you have drawn a blue ball on the first draw should you switch
urns to increase the probability of drawing a red ball?

To solve these problems, we implicitly or explicitly construct the joint distribution J


via the standard construction given PU and the conditional S

PU PB = S ○ PU
J

δπU δπB
U U ×B B

S
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 20

and then construct the inference map by requiring the compatibility condition, i.e., the
integral equation

∫u∈U S(B∣u)dPU = J(B × H) = ∫c∈B I(H∣c)dPB ∀B ∈ ΣB ∀H ∈ ΣU (17)

is satisfied. Since our problem is discrete the integral reduces to a sum.


Our first step is to calculate the prior on B which is the composite PB = S ○ PU , from
which we calculate
PB ({b}) = (S ○ PU )({b})
= ∫v∈U S({b}∣v)dPU
= ∫v∈U S({b}∣v)d( 2 δu1 + 2 δu2 )
1 1

= S({b}∣u1 ) ⋅ PU ({u1 }) + S({b}∣u2 ) ⋅ PU ({u2 })


= 5 ⋅ 2 + 4 ⋅ 2
2 1 3 1

= 23
40

and similarly
17
PB ({r}) =.
40
To solve the inference problem, we need to compute the values of the inference map
I using equation 17. This amounts to computing the joint distribution on all possible
measurable sets,

∫{u1 } S({b}∣u)dPU = J({u1 } × {b}) = ∫{b} I({u1 }∣c)dPB


∫{u2 } S({b}∣u)dPU = J({u2 } × {b}) = ∫{b} I({u2 }∣c)dPB
∫{u1 } S({r}∣u)dPU = J({u1 } × {r}) = ∫{r} I({u1 }∣c)dPB
∫{u2 } S({r}∣u)dPU = J({u2 } × {r}) = ∫{r} I({u2 }∣c)dPB

which reduce to the equations

S({b}∣u1 ) ⋅ PU ({u1 }) = I({u1 }∣b) ⋅ PB ({b})


S({b}∣u2 ) ⋅ PU ({u2 }) = I({u2 }∣b) ⋅ PB ({b})
S({r}∣u1 ) ⋅ PU ({u1 }) = I({u1 }∣r) ⋅ PB ({r})
S({r}∣u2 ) ⋅ PU ({u2 }) = I({u2 }∣r) ⋅ PB ({r}).

Substituting values for S, PB , and PI one determines

I({u1 }∣b) = 8
23 I({u2 }∣b) = 15
23

I({u1 }∣r) = 12
17 I({u2 }∣r) = 5
17

which answers question (1). The odds that one drew the blue ball from Urn 1 relative to
8
Urn 2 are 15 , so it is almost twice as likely that one made the draw from the second urn.
The Prediction Problem. Here we implicitly (or explicitly) need to construct the
product space U × B1 × B2 where Bi represents the ith drawing of a ball from the same
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 21

(unknown) urn. To do this we use the basic construction for joint distributions using a
regular conditional probability, S2 , which expresses the probability of drawing either a
red or a blue ball from the same urn as the first draw. This conditional probability is
given by

S2 ({b}∣(u1 , b)) = 14 S2 ({r}∣(u1 , b)) = 34


S2 ({b}∣(u2 , b)) = 23 S2 ({r}∣(u2 , b)) = 13
S2 ({b}∣(u1 , r)) = 21 S2 ({r}∣(u1 , r)) = 12
S2 ({b}∣(u2 , r)) = 1 S2 ({r}∣(u2 , r)) = 0.
Now we construct the joint distribution K on the product space (U × B1 ) × B2

PB 2 = S 2 ○ J
J K

δπU ×B1 δπB2


U × B1 U × B1 × B2 B2 .

S2

To answer the prediction question we calculate the odds of drawing a red versus a blue
ball. Thus
K(U × {b} × {r}) = ∫ S2 ({r}∣(u, β))dJ, (18)
U ×{b}

where the right hand side follows from the definition (construction) of the iterated product
space (U × B1 ) × B2 . The computation of the expression 18 yields

K(U × {b} × {r}) = ∫U ×{b} S2 ({r}∣(u, β))dJ


= S({r}∣(u1 , b)) ⋅ J({u1 } × {b}) + S({r}∣(u2 , b)) ⋅ J({u2 } × {b})
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
= 43 = 15 = 31 = 38
= 11
40 .

Similarly K(U × {b} × {b}) = 12


40 . So the odds are

r 11 11
= P r({r}∣{b}) = .
b 12 23
The Decision Problem To answer the decision problem we need to consider the
conditional probability of switching urns on the second draw which leads to the conditional

Ŝ2
U × B1 B2
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 22

given by
Ŝ2 ({b}∣(u1 , b)) = 43 Ŝ2 ({r}∣(u1 , b)) = 14
Ŝ2 ({b}∣(u2 , b)) = 52 Ŝ2 ({r}∣(u2 , b)) = 35
Ŝ2 ({b}∣(u1 , r)) = 43 Ŝ2 ({r}∣(u1 , r)) = 14
Ŝ2 ({b}∣(u2 , r)) = 25 Ŝ2 ({r}∣(u2 , r)) = 35 .
Carrying out the same computation as above we find the joint distribution K̂ on the
product space (U × B1 ) × B2 constructed from J and Ŝ2 yields

K̂(U × {b} × {r}) = ∫U ×{b} Ŝ2 ({r}∣(u, β))dJ


= Sˆ2 ({r}∣(u1 , b))J({u1 } × {b}) + Sˆ2 ({r}∣(u2 , b))J({u2 } × {b})
= 41 ⋅ 15 + 35 ⋅ 38
= 11
40 ,

which shows that it doesn’t matter whether you switch or not - you get the same proba-
bility of drawing a red ball.
The probability of drawing a blue ball is
12
K̂(U × {b} × {b}) = = K(U × {b} × {b}),
40
so the odds of drawing a blue ball outweigh the odds of drawing a red ball by the ratio
12
11 . The odds are against you.

Here is an example illustrating that the regular conditional probabilities (inference or


sampling distributions) are defined only up to sets of measure zero.

Example 5. We have a rather bland deck of three cards as shown

Front R R G

Card 1 Card 2 Card 3

Back R G G

We shuffle the deck, pull out a card and expose one face which is red.12 The prediction
question is
12
This problem is taken from David MacKays tutorial on Information Theory which can be viewed
at http ∶ //videolectures.net/mlss09uk mackay it/.
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 23

What is the probability the other side of the card is red?


To answer this note that this card problem is identical to the urn problem with urns
being cards and balls becoming the colored sides of each card. Thus we have an analogous
model in P for this problem. Let
C(ard) = {1, 2, 3}
F (ace Color) = {r, g}.
We have the P diagram

S
PC
1 C I F

PF
with the sampling distribution given by
S({r}∣1) = 1 S({g}∣1) = 0
S({r}∣2) = 21 S({g}∣2) = 12
S({r}∣3) = 0 S({g}∣3) = 1.
The prior on C is PC = 31 δ1 + 31 δ2 + 13 δ3 . From this we can construct the joint distribution
on C × F
1

PC PF = S ○ PC
J

δπ C δπ F
C C ×F F.

Using
J(A × B) = ∫ S(B∣n)dPC ,
n∈A
we find
J({1} × {r}) = 31 J({1} × {g}) = 0
J({2} × {r}) = 16 J({2} × {g}) = 16
J({3} × {r}) = 0 J({3} × {g}) = 13 .
Now, like in the urn problem, to predict the next draw (flip of the card), it is necessary to
add another measurable set F2 and conditional probability S2 and construct the product
diagram and joint distribution K
4 ELEMENTARY APPLICATIONS OF BAYESIAN PROBABILITY 24

PF2 = S2 ○ J
J K

δπC×F1 δπF2
C × F1 C × F1 × F2 F2 .

S2
The twist now arises in that the conditional probability S2 is not uniquely defined - what
are the values
S2 ({r}∣(1, g)) = ? S2 ({g}∣(1, g)) = ?
The answer is it doesn’t matter what we put down for these values since they have measure
J({1} × {g}) = 0. We can still compute the desired quantity of interest proceeding forth
with these arbitrarily chosen values on the point sets of measure zero. Thus we choose
S2 ({g}∣(1, r)) = 0 S2 ({r}∣(1, r)) = 1
S2 ({g}∣(1, g)) = 1 S2 ({r}∣(1, g)) = 0 doesn’t matter
S2 ({g}∣(2, r)) = 1 S2 ({r}∣(2, r)) = 0
S2 ({g}∣(2, g)) = 0 S2 ({r}∣(2, g)) = 1
S2 ({g}∣(3, r)) = 0 S2 ({r}∣(3, r)) = 1 doesn’t matter
S2 ({g}∣(3, g)) = 1 S2 ({r}∣(3, g)) = 0.

We chose the arbitrary values such that S2 is a deterministic mapping which seems ap-
propriate since flipping a given card uniquely determined the color on the other side.
Now we can solve the prediction problem by computing the joint measure values
K(C × {r} × {r}) = ∫C×{r} (S2 ){r} (n, c)dJ
= S2 ({r}∣(1, r)) ⋅ J({1} × {r}) + S2 ({r}∣(2, r)) ⋅ J({2} × {r})
= 1 ⋅ 13 + 0 ⋅ 16
= 31

and
K(C × {r} × {g}) = ∫C×{r} S2 ({g}∣(n, c))dJ
= S2 ({g}∣(1, r)) ⋅ J({1} × {r}) + S2 ({g}∣(2, r)) ⋅ J({2} × {r})
= 0 ⋅ 31 + 1 ⋅ 61
= 16 ,

so it is twice as likely to observe a red face upon flipping the card than seeing a green
face. Converting the odds of gr = 12 to a probability gives P r({r}∣{r}) = 23 .
To test one’s understanding of the categorical approach to Bayesian probability we
suggest the following problem.
5 THE TENSOR PRODUCT 25

Example 6. The Monty Hall Problem. You are a contestant in a game show in
which a prize is hidden behind one of three curtains. You will win a prize if you select
the correct curtain. After you have picked one curtain but befor the curtain is lifted, the
emcee lifts one of the other curtains, revealing a goat, and asks if you would like to switch
from your current selection to the remaining curtain. How will your chances change if
you switch?
There are three components which need modeled in this problem:
D(oor) = {1, 2, 3} The prize is behind this door.
C(hoice) = {1, 2, 3} The door you chose.
O(penddoor) = {1, 2, 3} The door Monty Hall opens

The prior on D is PD = 31 δd1 + 13 δd2 + 13 δd3 . Your selection of a curtain, say curtain 1, gives
the deterministic measure PC = δC1 . There is a conditional probability from the product
space D × C to O

P D ⊗ PC P O = S ○ PD ⊗ PC
J

δπD×C δπ O
D×C (D × C) × O O

where the conditional probability S((i, j), {k}) represents the probability that Monty
opens door k given that the prize is behind door i and you have chosen door j. If you
have chosen curtain 1 then we have the partial data given by
S((1, 1), {1}) = 0 S((1, 1), {2}) = 21 S((1, 1), {2}) = 21
S((2, 1), {1}) = 0 S((2, 1), {2}) = 0 S((2, 1), {3}) = 1
S((3, 1), {1}) = 0 S((3, 1), {2}) = 1 S((3, 1), {3}) = 0.

Complete the table, as necessary, to compute the inference conditional, D × C ←Ð O, and


I

conclude that if Monty opens either curtain 2 or 3 it is in your best interest to switch
doors.

5 The Tensor Product


Given any function f ∶ X → Y the graph of f is defined as the set function
Γf ∶ X Ð→ X ×Y
∶ x ↦ (x, f (x)).
5 THE TENSOR PRODUCT 26

By our previous notation Γf = ⟨IdX , f ⟩. If g∶ Y → X is any function we also refer to the


set function
Γg ∶ Y Ð→ X × Y
∶ y ↦ (g(y), y)
as a graph function.
Any fixed x ∈ X determines a constant function x∶ Y → X sending every y ∈ Y to
x. These functions are always measurable and consequently determine “constant” graph
functions Γx ∶ Y → X ×Y . Similarly, every fixed y ∈ Y determines a constant graph function
Γy ∶ X → X ×Y . Together, these constant graph functions can be used to define a σ-algebra
on the set X × Y which is finer (larger) than the product σ-algebra ΣX×Y . Let X ⊗ Y
denote the set X × Y endowed with the largest σ-algebra structure such that all the
constant graph functions Γx ∶ X → X ⊗ Y and Γy ∶ Y → X ⊗ Y are measurable. We say this
σ-algebra X ⊗ Y is coinduced by the maps {Γx ∶ X → X × Y }x∈X and {Γy ∶ Y → X × Y }y∈Y .
Explicitly, this σ-algebra is given by

ΣX⊗Y = ⋂ Γx∗ ΣY ∩ ⋂ Γy ∗ ΣX , (19)


x∈X y∈Y

where for any function f ∶ W → Z,

f∗ ΣW = {C ∈ 2Z ∣ f −1 (C) ∈ ΣW }. (20)

This is in contrast to the smallest σ-algebra on X × Y , defined in Section 2.1 so that the
two projection maps {πX ∶ X × Y → X, πY ∶ X × Y → Y } are measurable. Such a σ-algebra is
said to be induced by the projection maps, or simply referred to as the initial σ-algebra.
The following result on coinduced σ-algebras is used repeatedly.

Lemma 7. Let the σ-algebra of Y be coinduced by a collection of maps {fi ∶ Xi → Y }i∈I .


Then any map g∶ Y → Z is measurable if and only if the composition g ○ fi is measurable
for each i ∈ I.

Proof. Consider the diagram

fi
Xi Y

g ○ fi g

If B ∈ ΣZ then g −1 (B) ∈ ΣY if and only if fi−1 (g −1 (B)) ∈ ΣX .


This result is used frequently when Y in the above diagram is replaced by a tensor
product space X ⊗ Y . For example, using this lemma it follows that the projection maps
5 THE TENSOR PRODUCT 27

πY ∶ X ⊗ Y → Y and πX ∶ X ⊗ Y → X are both measurable because the diagrams in Figure 3


commute.

Y X

x Γx Γy y

X πX X ⊗Y X ⊗Y πY Y

Figure 3: The commutativity of these diagrams, together with the measurability of the
constant functions and constant graph functions, implies the projection maps πX and πY
are measurable.

By the measurability of the projection maps and the universal property of the product,
it follows the identity mapping on the set X × Y yields a measurable function

id
X ⊗Y X ×Y

called the restriction of the σ-algebra. In contrast, the identity function X × Y → X ⊗ Y


is not necessarily measurable. Given any probability measure P on X ⊗ Y the restriction
mapping induces the pushforward probability measure δid ○ P = P (id−1 (⋅)) on the product
σ-algebra.

5.1 Graphs of Conditional Probabilities


The tensor product of two probability measures P ∶ 1 → X and Q∶ 1 → Y was defined in
Equations 5 and 6 as the joint distribution on the product σ-algebra by either of the
expressions
(P ◯
⋉ Q)(ς) = ∫ P (Γ−1 y (ς)) dQ ∀ς ∈ ΣX×Y
y∈Y

and
(P ◯
⋊ Q)(ς) = ∫ Q(Γx−1 (ς)) dP ∀ς ∈ ΣX×Y
x∈X
which are equivalent on the product σ-algebra. Here we have introduced the new notation
of left tensor ◯
⋉ and right tensor ◯
⋊ because we can extend these definitions to be defined
on the tensor σ-algebra though in general the equivalence of these two expressions may no
longer hold true. These definitions can be extended to conditional probability measures
P ∶ Z → X and Q∶ Z → Y trivially by conditioning on a point z ∈ Z,

(P ◯
⋉ Q)(ς ∣ z) = ∫ y (ς)) dQz
P (Γ−1 ∀ς ∈ ΣX⊗Y (21)
y∈Y
5 THE TENSOR PRODUCT 28

and
(P ◯
⋊ Q)(ς ∣ z) = ∫ x (ς)) dPz
Q(Γ−1 ∀ς ∈ ΣX⊗Y (22)
x∈X
which are equivalent on the product σ-algebra but not on the tensor σ-algebra. However
in the special case when Z = X and P = 1X , then Equations 21 and 22 do coincide on
ΣX⊗Y because by Equation 21

(1X ◯
⋉ Q)(ς ∣ x) = ∫y∈Y δx (Γy−1 (ς)) dQx ∀ς ∈ ΣX⊗Y X
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶

⎪ 1

⎪ iff (x, y) ∈ ς
=⎨


⎪ 0 otherwise (23)

= ∫y∈Y χΓ−1
x
(ς) (y) dQx
= Qx (Γx (ς)),
−1

while by Equation 22

⋊ Q)(ς ∣ x) = ∫u∈X Qx (Γu−1 (ς)) d (δIdX )x


(1X ◯ ∀ς ∈ ΣX⊗Y X
´¹¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¶ (24)
=δx
= Qx (Γ−1
x (U)).

In this case we denote the common conditional by ΓQ , called the graph of Q by analogy
to the graph of a function, and this map gives the commutative diagram in Figure 4.

1X Q
ΓQ

X X ⊗Y Y
δπX δπ Y

Figure 4: The tensor product of a conditional with an identity map in P.

The commutativity of the diagram in Figure 4 follows from

(δπX ○ ΓQ )(A ∣ x) = ∫(u,v)∈X⊗Y δπX (A ∣ (u, v)) d (ΓQ )x


² −1 =QΓx
= ∫v∈Y δπX (A ∣ Γx (v)) dQx (25)
= ∫v∈Y δx (A)dQx
= δx (A) ∫Y dQx
= 1X (A ∣ x)
5 THE TENSOR PRODUCT 29

and
(δπY ○ ΓQ )(B ∣ x) = ∫(u,v)∈X⊗Y δπY (B ∣ (u, v)) d((ΓQ )x )
= ∫v∈Y δπY (B ∣ (x, v)) dQx (26)
= ∫v∈Y χB (v) dQx
= Q(A ∣ x).

5.2 A Tensor Product of Conditionals


Given any conditional P ∶ Z → Y in P we can define a tensor product 1X ⊗ P by

(1X ⊗ P )(A ∣ (x, z)) = P (Γ−1


x (A) ∣ z) ∀A ∈ ΣX⊗Y

which makes the diagram in Figure 5 commute and justifies the notation 1X ⊗ P (and
explains also why the notation ΓQ for the graph map was used to distinguish it from this
map).

δπ X δπ Z
X X ⊗Z Z

1X 1X ⊗ P P

X X ⊗Y Y
δπX δπ Y

Figure 5: The tensor product of conditional 1X and P in P.

This tensor product 1X ⊗ P essentially comes from the diagram

P δΓ x
Z Y X ⊗Y,

where given a measurable set A ∈ ΣX⊗Y one pulls it back under the constant graph
function Γx and then applies the conditional P to the pair (Γ−1
x (A) ∣ z).

5.3 Symmetric Monoidal Categories


A category C is said to be a monoidal category if it possesses the following three properties:

1. There is a bifunctor
◻ ∶ C ×C → C
∶ob (X, Y ) ↦ X ◻Y
∶ar (X, Y ) Ð→ (X ′ , Y ′ ) ↦ X ◻ Y Ð→ X ′ ◻ Y ′
(f,g) (f ◻g)
5 THE TENSOR PRODUCT 30

which is associative up to isomorphism,

◻ (◻ × IdC ) ≅ ◻(IdC × ◻)∶ C × C × C → C

where IdC is the identity functor on C. Hence for every triple X, Y, Z of objects,
there is an isomorphism

aX,Y,Z ∶ (X ◻ Y ) ◻ Z Ð→ X ◻ (Y ◻ Z)

which is natural in X, Y, Z. This condition is called the associativity axiom.

2. There is an object I ∈ C such that for every object X ∈ob C there is a left unit
isomorphism
lX ∶ 1 ◻ X Ð→ X.
and a right unit isomorphism

rX ∶ X ◻ 1 Ð→ X.

These two conditions are called the unity axioms.

3. For every quadruple of objects X, Y, W, Z the diagram

aX◻Y,W,Z
((X ◻ Y ) ◻ W ) ◻ Z (X ◻ Y ) ◻ (W ◻ Z)

aX◻Y,W,Z

(X ◻ (Y ◻ W )) ◻ Z aX,Y,W ◻Z

aX,Y ◻W,Z

IdX ◻ aY,W,Z
X ◻ ((Y ◻ W ) ◻ Z) X ◻ (Y ◻ (W ◻ Z))
commutes. This is called the associativity coherence condition.

If C is a monoidal category under a bifunctor ◻ and identity 1 it is denoted (C, ◻, 1).


A monoidal category (C, ◻, 1) is symmetric if for every pair of objects X, Y there exist an
isomorphism
sX,Y ∶ X ◻ Y Ð→ Y ◻ X (27)
which is natural in X and Y , and the three diagrams in Figure 6 commute.
6 FUNCTION SPACES 31

sX,Y ◻ IdZ sX,1


(X ◻ Y ) ◻ Z (Y ◻ X) ◻ Z X ◻I I ◻X

rX lX
aX,Y,Z aY,X,Z
X

X ◻ (Y ◻ Z) Y ◻ (X ◻ Z)
sX,Y
X ◻Y Y ◻X
sX,Y ◻Z IdY ⊗ sX,Z
sY,X IdX
aY,Z,X
Y ◻ (Z ◻ X) Y ◻ (Z ◻ X) X ◻Y

Figure 6: The additional conditions required for a symmetric monoidal category.

The main example of a symmetric monoidal category is the category of sets, Set,
under the cartesian product with identity the terminal object 1 = {⋆}. Similarly, for the
categories Meas and P, the tensor product ⊗ along with the terminal object 1 acting as
the identity element make both (Meas, ⊗, 1) and (P, ⊗, 1) symmetric monoidal categories
with the above conditions straightforward to verify. This provides a good exercise for the
reader new to categorical methods.

6 Function Spaces
For X, Y ∈ob Meas let Y X denote the set of all measurable functions from X to Y endowed
with the σ-algebra induced by the set of all point evaluation maps {evx }x∈X , where
evx
Y X Ð→ Y
f ↦ f (x).

Explicitly, the σ-algebra on Y X is given by

ΣY X = σ ( ⋃ evx−1 ΣY ) , (28)
x∈X

where for any function f ∶ W → Z we have

f −1 ΣZ = {B ∈ 2W ∣ ∃C ∈ ΣZ with f −1 (C) = B} (29)

and σ(B) denotes the σ-algebra generated by any collection B of subsets.


6 FUNCTION SPACES 32

Formally we should use an alternative notation such as ⌜f ⌝ to distinguish between


the measurable function f ∶ X → Y and the point ⌜f ⌝∶ 1 → Y X of the function space Y X .13
However, it is common practice to let the context define which arrow we are referring to
and we shall often follow this practice unless the distinction is critical to avoid ambiguity
or awkward expressions.
An alternative notation to Y X is ∏x∈X Yx where each Yx is a copy of Y . The relation-
ship between these representations is that in the former we view the elements as functions
f while in the latter we view the elements as the indexed images of a function, {f (x)}x∈X .
Either representation determines the other since a function is uniquely specified by its
values.
Because the σ-algebra structure on tensor product spaces was defined precisely so that
the constant graph functions were all measurable, it follows that in particular the constant
graph functions Γf ∶ X → X ⊗ Y X sending x ↦ (x, f ) are measurable. (The graph function
symbol Γ⋅ is overloaded and will need to be specified directly (domain and codomain)
when the context is not clear.)
Define the evaluation function
evX,Y
X ⊗YX Ð→ Y (30)
(x, f ) ↦ f (x)

and observe that for every ⌜f ⌝ ∈ Y X the right hand Meas diagram in Figure 7 is commu-
tative as a set mapping, f = evX,Y ○ Γf .

evX,Y
YX X ⊗YX Y

⌜f ⌝ Γf ≅ IdX ⊗ ⌜f ⌝ f

1 X ≅X ⊗1

Figure 7: The defining characteristic property of the evaluation function ev for graphs.

By rotating the diagram in Figure 7 and also considering the constant graph functions
Γx , the right hand side of the diagram in Figure 8 also commutes for every x ∈ X.
13
Having defined Y X to be the set of all measurable functions f ∶ X → Y it seems contradictory to then
define evx as acting on “points” ⌜f ⌝∶ 1 → Y X rather than the functions f themselves! The apparent self
contradictory definition arises because we are interspersing categorical language with set theory; when
defining a set function, like evx , it is implied that it acts on points which are defined as “global elements”
1 → Y X . A global element is a map with domain 1. This is the categorical way of defining points rather
than using the elementhood operator “∈”. Thus, to be more formal, we could have defined evx , where
x∶ 1 → X is any global element, by evx ○ ⌜f ⌝ = ⌜f (x)⌝∶ 1 → Y , where f (x) = f ○ x.
6 FUNCTION SPACES 33

Γf Γx
X X ⊗YX YX

f evX,Y evx

Figure 8: The commutativity of both triangles, the measurability of f and evx , and the
induced σ-algebra of X ⊗ Y X implies the measurability of ev.

Since f and Γf are measurable, as are evx and Γx , it follows by Lemma 7 that evX,Y is
measurable since the constant graph functions generate the σ-algebra of X ⊗ Y X . More
generally, given any measurable function f ∶ X ⊗ Z → Y there exists a unique measurable
map f˜∶ Z → Y X defined by f˜(z) = ⌜f (⋅, z)⌝∶ 1 → Y X where f (⋅, z)∶ X → Y sends x ↦ f (x, z).
This map f˜ is measurable because the σ-algebra is generated by the point evalutation maps
evx and the diagram

evx
YX Y

f˜ f

Γx
Z X ⊗Z
commutes so that f˜−1 (evx−1 (B)) = (f ○ Γx )−1 (B) ∈ ΣZ .
Conversely given any measurable map g∶ Z → Y X , it follows the composite

evX,Y ○ (IdX ⊗ g)

is a measurable map. This sets up a bijective correspondence between measurable func-


tions denoted by

Z YX

X ⊗Z Y
f
or the diagram in Figure 9.
6 FUNCTION SPACES 34

evX,Y
YX X ⊗YX Y

f˜ IdX ⊗ f˜ f

Z X ⊗Z

Figure 9: The evaluation function ev sets up a bijective correspondence between the two
measurable maps f and f˜.

The measurable map f˜ is called the adjunct of f and vice versa, so that f˜˜ = f . Whether
we use the tilde notation for the map X ⊗ Z → Y or the map Z → Y X is irrelevant, it
simply indicates it’s the map uniquely determined by the other map.
The map evX,Y , which we will usually abbreviate to simply ev with the pair (X, Y )
obvious from context, is called a universal arrow because of this property; it mediates the
relationship between the two maps f and f˜. In the language of category theory using
functors, for a fixed object X in Meas, the collection of maps {evX,Y }Y ∈ob Meas form the
components of a natural transformation evX,− ∶ (X ⊗ ⋅) ○ X → IdMeas . In this situation we
say the pair of functors {X⊗ , X } forms an adjunction denoted X⊗ ⊣ X . This adjunction
X ⊗ ⊣ X is the defining property of a closed category. We previously showed Meas was
symmetric monoidal and combined with the closed category structure we conclude that
Meas is a symmetric monoidal closed category (SMCC). Subsequently we will show that
P satisfies a weak version of SMCC, where uniqueness cannot be obtained.

The Graph Map. Given the importance of graph functions when working with tensor
spaces we define the graph map

Γ⋅ ∶ Y X → (X ⊗ Y )X
∶ ⌜f ⌝ ↦ ⌜Γf ⌝.

Thus Γ⋅ (⌜f ⌝) = ⌜Γf ⌝ gives the name of the graph

Γf
X X ⊗Y.

The measurability of Γ⋅ follows in part from the commutativity of the diagram in


ˆ x ∶ (X ⊗ Y )X → X ⊗ Y denotes the standard point evaluation
Figure 10, where the map ev
map sending g ↦ (x, g(x)).
6 FUNCTION SPACES 35

Γ⋅
YX (X ⊗ Y )X

evx ⟨x, evx ⟩ ev


ˆx

Y X ⊗Y
Γx

Figure 10: The relationship between the graph map, point evaluations, and constant
graph maps.

We have used the notation ev ˆ x simply to distinguish this map from the map evx which
has a different domain and codomain. The σ-algebra of (X ⊗ Y )X is determined by these
point evaluation maps ev ˆ x so that they are measurable. The maps evx and Γx are both
measurable and hence their composite Γx ○ evx = ⟨x, evx ⟩ is also measurable.
To prove the measurability of the graph map we use the dual to Lemma 7 obtained
by reversing all the arrows in that lemma to give

Lemma 8. Let the σ-algebra of Y be induced by a collection of maps {gi ∶ Y → Zi }i∈I .


Then any map f ∶ X → Y is measurable if and only if the composition gi ○ f is measurable
for each i ∈ I.

Proof. Consider the diagram

f
X Y

gi ○ f gi

Zi

The necessary condition is obvious. Conversely if gi ○ f is measurable for each i ∈ I then


f −1 (gi−1 (B)) ∈ ΣX . Because the σ-algebra ΣY is generated by the measurable sets gi−1 (B)
it follows that every measurable U ∈ ΣY also satisfies f −1 (U ) ∈ ΣX so f is measurable.
Applying this lemma to the diagram in Figure 10 with the maps gi corresponding to the
point evaluation maps evx and the map f being the graph map Γ⋅ proves the graph map
is indeed measurable.
The measurability of both of the maps ev and Γ⋅ yield corresponding P maps δev and
δΓ⋅ that play a role in the construction of sampling distributions defined on any hypothesis
spaces that involves function spaces.
6 FUNCTION SPACES 36

6.1 Stochastic Processes


Having defined function spaces Y X , we are now in a position to define stochastic processes
using categorical language. The elementary definition given next suffices to develop all
the basic concepts one usually associates with traditional ML and allows for relatively
elegant proofs. Subsequently, using the language of functors, a more general definition
will be given and for which the following definition can be viewed as a special instance.

Definition 9. A stochastic process is a P map

P
1 YX

representing a probability measure on the function space Y X . A parameterized stochastic


process is a P map

P
Z YX

representing a family of stochastic processes parameterized by Z.

Just as we did for the category Meas, we seek a bijective correspondence between two
P maps, a stochastic process P and a corresponding conditional probability measure P .
In the P case, however, the two morphisms do not uniquely determine each other, and
we are only able to obtain a symmetric monoidal weakly closed category (SMwCC).
In Section 5.2 the tensor product 1X ⊗ P was defined, and by replacing the space “Y ”
in that definition to be a function space Y X we obtain the tensor product map

1X ⊗ P ∶ X ⊗ Z → X ⊗ Y X

given by (using the same formula as in Section 5.2)

(1X ⊗ P )(U ∣ (x, z)) = P (Γx−1 (U) ∣ z)

For a given parameterized stochastic process P ∶ Z → Y X we obtain the tensor prod-


uct 1X ⊗ P , and composing this map with the deterministic P map determined by the
evaluation map we obtain the composite P in the diagram in Figure 11.
6 FUNCTION SPACES 37

δev
YX X ⊗YX Y

P 1X ⊗ P P

Z X ⊗Z

Figure 11: The defining characteristic property of the evaluation function ev for tensor
products of conditionals in P.

Thus
P (B ∣ (x, z)) = ∫(u,f )∈X⊗Y X (δev )B (u, f ) d(1X ⊗ P )(x,z)
= ∫f ∈Y X δev (B ∣ Γx (f )) dPz
= ∫f ∈Y X χB (evx (f )) dPz
= P (evx−1 (B) ∣ z)
and every parameterized stochastic process determines a conditional probability
P ∶ X ⊗ Z → Y.
Conversely, given a conditional probability P ∶ X ⊗ Z → Y , we wish to define a param-
eterized stochastic process P ∶ Z → Y X . We might be tempted to define such a stochastic
process by letting
P (evx−1 (B) ∣ z) = P (B ∣ (x, z)), (31)
but this does not give a well-defined measure for each z ∈ Z. Recall that a probability
measure cannot be unambiguously defined on an arbitrary generating set for the σ-algebra.
We can, however, uniquely define a measure on a π-system14 and then use Dynkin’s π-λ
theorem to extend to the entire σ-algebra (e.g., see [10]). This construction requires the
following definition.
Definition 10. Given a measurable space (X, ΣX ), we can define an equivalence relation
on X where x ∼ y if x ∈ A ⇔ y ∈ A for all A ∈ ΣX . We call an equivalence class of this
relation an atom of X. For an arbitrary set A ⊂ X, we say that A is
● separated if for any two points x, y ∈ A, there is some B ∈ ΣX with x ∈ B and y ∉ B
● unseparated if A is contained in some atom of X.
This notion of separation of points is important for finding a generating set on which
we can define a parameterized stochastic process. The key lemma which we state here
without proof15 is the following.
14
A π-system on X is a nonempty collection of subsets of X that is closed under finite intersections.
15
This lemma and additional work on symmetric monoidal weakly closed structures on P will appear
in a future paper.
6 FUNCTION SPACES 38

Lemma 11. The class of subsets of Y X


n
{xi }ni=1 is separated in X,
E = ∅ ∪ {⋂ evx−1i (Ai ) ∣ }
i=1 Ai ∈ ΣY is nonempty and proper

is a π-system which generates the evaluation σ-algebra on Y X .

We can now define many parameterized stochastic processes “adjoint” to P , with


the only requirement being that Equation 31 is satisfied. This is not a deficiency in P,
however, but rather shows that we have ample flexibility in this category.

Remark 12. Even when such an expression does provide a well-defined measure as in
the case of finite spaces, it does not yield a unique P . Appendix B provides an elementary
example illustrating the failure of the bijective correspondence property in this case. Also
observe that the proposed defining Equation 31 can be extended to
n
P (∩ni=1 evx−1i (Bi ) ∣ z) = ∏ P (Bi ∣ (xi , z))
i=1

which does provide a well-defined measure by Lemma 11. However it still does not provide
a bijective correspondence which is clear as the right hand side implies an independence
condition which a stochastic process need not satisfy. However it does provide for a bi-
jective correspondence if we impose an additional independence condition/assumption.
Alternatively, by imposing the additional condition that for each z ∈ Z, Pz is a Gaussian
Processes we can obtain a bijective correspondence. In Section 6.3 we illustrate in detail
how a joint normal distribution on a finite dimensional space gives rise to a stochastic
process, and in particular a GP.

Often, we are able to exploit the weak correspondence and use the conditional prob-
ability P ∶ X → Y rather than the stochastic process P ∶ 1 → Y X . While carrying less
information, the conditional probability is easier to reason with because of our famil-
iarity with Bayes’ rule (which uses conditional probabilities) and our unfamiliarity with
measures on function spaces.
Intuitively it is easier to work with the conditional probability P as we can represent
the graph of such functions. In Figure 6.1 the top diagram shows a prior probability
P ∶ 1 → R[0,10] , which is a stochastic process, depicted by representing its adjunct illus-
trating its expected value as well as its 2σ error bars on each coordinate. The bottom
diagram in the same figure illustrates a parameterized stochastic process where the param-
eterization is over four measurements. Using the above notation, Z = ∏4i=1 (X × Y )i and
P (⋅ ∣ {(xi , yi )}4i=1 ) is a posterior probability measure given four measurements {xi , yi }4i=1 .
These diagrams were generated under the hypothesis that the process is a GP.
6 FUNCTION SPACES 39

Figure 12: The top diagram shows a (prior) stochastic process represented by its adjunct
P ∶ [0, 10] → R and characterized by its expected value and covariance. The bottom dia-
gram shows a parameterized stochastic process (the same process), also expressed by its
adjunct, where the parameterization is over four measurements.
6 FUNCTION SPACES 40

6.2 Gaussian Processes


To further explicate the use of stochastic processes we consider the special case of a
stochastic process that has proven to be of extensive use for modeling in ML problems.
To be able to compute integrals, notably expectations, we will assume hereafter that Y = R
and X = Rn for some integer n, or a compact subset thereof with the standard Borel σ-
algebras. We use the bold notation x to denote a vector in X. Because ML applications
often simply stress scalar valued functions, we have take Y = R and write elements in Y
as y. At any rate, the generalization to an arbitrary Euclidean space amounts to carrying
around vector notation and using vector valued integrals in the following.
For any finite subset X0 ⊂ X the set X0 can be given the subspace σ-algebra which is
the induced σ-algebra of the inclusion map ι∶ X0 ↪ X. Given any measurable f ∶ X → Y
the restriction of f to X0 is f ∣X0 = f ○ ι and “substitution” of an element x ∈ X into
f ∣X0 is precomposition by the point x∶ 1 → X giving the commutative Meas diagram in
Figure 13, where the composite f ∣X0 (x) = f ○ι○x is equivalent to the map evx (⌜f ⌝)∶ 1 → Y .

x ι
1 X0 X

f (x) = f ○ ι ○ x = evx (⌜f ⌝) f ∣X0 f

Figure 13: The substitution/evaluation relation.

Thus the inclusion map ι induces a measurable map

Y ι ∶ Y X → Y X0
∶ ⌜f ⌝ ↦ ⌜f ○ ι⌝,

which in turn induces the deterministic map δY ι ∶ Y X → Y X0 in P. For any probability


measure P on the function space Y X , we have the composite of P arrows shown in the left
diagram of Figure 14. For a singleton set X0 = {x} this diagram reduces to the diagram
on the right in Figure 14.

P P
1 YX 1 YX

P ι−1 δY ι P evx−1 δevx

Y X0 Y

Figure 14: The defining property of a Gaussian Process is the commutativity of a P


diagram.
6 FUNCTION SPACES 41

Given m ∈ Y X and k a bivariate function k∶ X × X → R, let m∣X0 = m ○ ι ∈ Y X0 denote


the restriction of m to X0 and similiarly let k∣X0 = k ○ (ι × ι) denote the restriction of k to
X0 × X 0 .
Definition 13. A Gaussian process on Y X is a probability measure P on the func-
tion space Y X , denoted P ∼ GP(m, k), such that for all finite subsets X0 of X the
push forward probability measure P ι−1 is a (multivariate) Gaussian distribution denoted
P ι−1 ∼ N (m∣X0 , k∣X0 ).
A bivariate function k satisfying the condition in the definition is called the covariance
function of the Gaussian process P while the function m is the expected value. A Gaussian
process is completely specified by its mean and covariance functions. These two functions
are defined pointwise by

m(x) ≜ EP [evx ] = ∫ (evx )(⌜f ⌝) dP = ∫ f (x) dP (32)


f ∈Y X f ∈Y X

and by the vector valued integral

k(x, x′ ) ≜ EP [(evx − EP [evx ])(evx′ − EP [evx′ ])]


(33)
= ∫ (f (x) − m(x)) (f (x′ ) − m(x′ )) dP.
T
f ∈Y X

Abstractly, if P is given, then we could determine m and k by these two equations.


However in practice it is the two functions, m and k which are used to specify a GP
P rather than P determining m and k. For general stochastic processes higher order
moments EP [evxj ], with j > 1, are necessary to characterize the process.
For the covariance function k we make the following assumptions for all x, z ∈ X,

1. k(x, z) ≥ 0,

2. k(x, z) = k(z, x), and

3. k(x, x)k(z, z) − k(x, z)2 ≥ 0.

6.3 GPs via Joint Normal Distributions16


A simple illustration of a GP as a probability measure on a function space can be given
by consideration of a joint normal distribution. Here we relate the familiar presentation
of multivariate normal distributions as expressed in the language of random variables
into the categorical framework and language, and illustrate that the resulting conditional
distributions correspond to a GP.
Let X and Y represent two vector valued real random variables having a joint normal
distribution
16
This section is not required for an understanding of subsequent material but only provided for
purposes of linking familiar concepts and ideas with the less familiar categorical perspective.
6 FUNCTION SPACES 42

X µ Σ Σ12
J =[ ] ∼ N ([ 1 ] , [ 11 ])
Y µ2 Σ21 Σ22
with Σ11 and Σ22 nonsingular.
Represented categorically, these random variables X and Y determine distributions
which we represent by P1 and P2 on two measurable spaces X = R and Y = R for some
m n

finite integers m and n, and the various relationships between the P maps is given by the
diagram in Figure 15.

1
P1 ∼ N (µ1 , Σ11 ) J P2 ∼ N (µ2 , Σ22 )
X ×Y
δπ X δπY
S
X Y
I

Figure 15: The categorical characterization of a joint normal distribution.

Here S and I are the conditional distributions

11 (x − µ1 ), Σ22 − Σ21 Σ11 Σ12 )


S x ∼ N (µ2 + Σ21 Σ−1 −1

I y ∼ N (µ1 + Σ12 Σ−1


22 (y − µ2 ), Σ11 − Σ12 Σ22 Σ21 )
−1

and the overline notation on the terms “S” and “I” is used to emphasize that the transpose
of both of these conditionals are GPs given by a bijective correspondence in Figure 16.

evX,Y
YX X ⊗YX Y

S ΓS S

1 X ⊗1

Figure 16: The defining characteristic property of the evaluation function ev for graphs.

In the random variable description, these conditionals S x and I y are often represented
simply by
µY∣X = µ2 + Σ21 Σ−1
11 (x − µ1 ) µX∣Y = µ1 + Σ12 Σ22 (y − µ2 )
−1
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 43

and
ΣY∣X = Σ22 − Σ21 Σ−1
11 Σ12 ΣX∣Y = Σ11 − Σ12 Σ−1
22

It is easily verified that this pair {S, I} forms a sampling distribution/inference map
pair; i.e., the joint distribution can be expressed in terms of the prior X and sampling
distribution S or in terms of the prior Y and inference map I. It is clear from this
example that what one calls the sampling distribution and inference map depends upon
the perspective of what is being estimated.
In subsequent developments, we do not assume a joint normal distribution on the
spaces X and Y . If such an assumption is reasonable, then the following constructions
are greatly simplified by the structure expressed in Figure 15. As noted previously, it is
knowledge of the relationship between the distributions P1 and P2 which characterize the
joint and, is the main modeling problem. Thus the two perspectives on the problem are
to find the conditionals, or equivalently, find the prior on Y X which specifies a function
X → Y along with the noise model which is “built into” the sampling distribution.

7 Bayesian Models for Function Estimation


We now have all the necessary tools to build several Bayesian models, both parametric
and nonparametric, which illustrate the model building process for ML using CT. To say
we are building Bayesian models means we are constructing the two P arrows, PH and
S, corresponding to (1) the prior probability, and (2) the sampling distribution of the
diagram in Figure 2. The sampling distribution will generally be a composite of several
simple P arrows. We start with the nonparametric models which are in a modeling sense
more basic than the parametric models involving a fixed finite number of parameters to be
determined. The inference maps I for all of the models will be constructed in Section 8.

7.1 Nonparametric Models


In estimation problems where the unknown quantity of interest is a function f ∶ X → Y ,
our hypothesis space H will be the function space Y X . However, simply expressing the
hypothesis space as Y X appears untenable because, in supervised learning, we never
measure Y X directly, but only measure a finite number of sampling points {(xi , yi )}N i=1
satisfying some measurement model such as yi = f (xi ) +  where f is an “ideal” function
we seek to determine.
With precise knowledge of the input state x and assuming a generic stochastic process
P ∶ 1 → Y X , we are led to propose either the left δx ◯
⋉ P or right δx ◯
⋊ P tensor product
as a prior on the hypothesis space X ⊗ Y . However, when one of the components in
X

a left or right tensor product is a Dirac measure, then both the left and right tensors
coincide and the choice of right or left tensor is irrelevant. In this case, we denote the
common probability measure by δx ⊗ P . Moreover, a simple calculation shows the prior
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 44

δx ⊗ P = ΓP (⋅ ∣ x), the graph of P at x. Thus our proposed model, in analogy to the


generic Bayesian model, is given by the diagram in Figure 17.17

ΓP (⋅ ∣ x) d d is measurement data
S
X ⊗Y X X ⊗Y

Figure 17: The generic nonparametric Bayesian model for stochastic processes.

By a nonparametric (Bayesian) model, we mean any model which fits into the scheme
of Figure 17. For all of our analysis purposes we take P ∼ GP(m, k). A data measurement
d, corresponding to a collection of sample data {xi , yi } is, in ML applications, generally
taken as a Dirac measure, d = δ(x,y) . As in all Bayesian problems, the measurement data
{xi , yi }N
i=1 can be analyzed either sequentially or as a single batch of data. For analysis
purpose in Section 8, we consider the data one point at a time (sequentially).

7.1.1 Noise Free Measurement Model


In the noise free measurement model, we make the hypothesis that the data we observe—
consisting of input output pairs (xi , yi ) ∈ X × Y —satisfies the condition that yi = f (xi )
where f is the unknown function we are seeking to estimate. While the actual measured
data will generally not satisfy this hypothesis, this model serves both as an idealization
and a building block for the subsequent noisy measurement model.
Using the fundamental maps Γ⋅ ∶ Y X → (X ⊗Y )X and ev∶ X ⊗(X ⊗Y )X → X ⊗Y gives a
sequence of measurable maps which determine corresponding deterministic P maps. This
composite, shown in Figure 18, is our noise free sampling distribution.

1 ⊗ δΓ δev
X ⊗YX X ⊗ (X ⊗ Y )X X ⊗Y

Snf = composite

Figure 18: The noise free sampling distribution Snf .

This deterministic sampling distribution is given by the calculation of the composition,


17
It would be interesting to analyze the more general case where there is uncertainty in the input state
also and take the prior as Q◯ ⋊ P or Q◯⋉ P for some measure Q on X.
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 45

i.e., evaluating the integral


Snf (U ∣ (x, f )) = ∫(u,g)∈X⊗(X⊗Y )X (δev )U (u, g) d(1 ⊗ δΓ )(x,f ) for U ∈ ΣX⊗Y
= (δev )U (x, Γf )
= δΓf (x) (U )
= δ(x,f (x)) (U )
= δ(Γx (evx (⌜f ⌝))) (U ) because (x, f (x)) = Γx (evx (f ))
= χU (Γx (evx (f )))
= χevx−1 (Γ−1
x
(U )) (f ).

Using the commutativity of Figure 10, the noise free sampling distribution can also be
written as Snf (U ∣ (x, f )) = χΓ−1 ˆ −1
⋅ (ev x (U ))
(f ).
Precomposing the sampling distribution with this prior probability measure the com-
posite
(Snf ○ ΓP (⋅ ∣ x))(U ) = ∫(u,f )∈X⊗Y X Snf (U ∣ (u, f )) d(ΓP (⋅ ∣ x)) for U ∈ ΣX⊗Y
´¹¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
−1
=P Γx
= ∫f ∈Y X Snf (U ∣ Γx (f )) dP (34)
= ∫f ∈Y X χΓ−1 ˆ −1
⋅ (ev x (U ))
(f ) dP
= P (Γ⋅ (ev
−1 ˆ x (U )))
−1

By the relation Γx ○ evx = ev


ˆ x ○ Γ⋅ this can also be written as

(Snf ○ ΓP (⋅ ∣ x))(U ) = P (evx−1 (Γx−1 (U ))).

Given that the probability measure P is specified as a Gaussian process (which is


defined in terms of how it restricts to finite subspaces X0 ⊂ X), for computational pur-
poses we need to consider the push forward probability measure of P on Y X to Y X0 as
in Figure 14. Taking the special case with X0 = {x}, the pushforward corresponds to
composition with the deterministic projection map δevx . Starting with the diagram of
Figure 10, precomposing with P and postcomposition with the deterministic map δπY ○ δι
gives the diagram in Figure 19. Then we can use the fact P projected onto any coordinate
is a Gaussian distribution to compute the likelihood that a measurement will occur in a
measurable set B ⊂ Y .

P ∼ GP(m, k) Γ⋅ ev
ˆx
1 YX (X ⊗ Y )X X ⊗Y

δevx δπY
P evx−1 ∼ N (m(x), k(x, x))

Figure 19: The distribution P ∼ GP(m, k) can be evaluated on rectangles U = A × B by


projecting onto the given x coordinate.
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 46

Under this assumption P ∼ GP(m, k) the expected value of the probability measure
(Snf ○ ΓP (⋅ ∣ x)) on the real vector space X ⊗ Y is
E(Snf ○ΓP (⋅∣x)) [IdX⊗Y ] = −1 ˆ −1 )
∫(u,v)∈X⊗Y (u, v) d(P (Γ⋅ ev x
= ∫g∈(X⊗Y )X x ev
ˆ (g) d(P Γ−1 )

= ∫f ∈Y X x
ev
ˆ (Γ(f )) dP
= ∫f ∈Y X (x, f (x)) dP
= (x, m(x)),
where the last equation follows because on the two components of the vector valued
integral, ∫f ∈Y X f (x) dP = m(x) and ∫f ∈Y X x dP = x as the integrand is constant. The
variance is18
E(Snf ○ΓP (⋅∣x)) [(IdX⊗Y − E(Snf ○ΓP (⋅∣x)) [IdX⊗Y ])2 ] = E(Snf ○ΓP (⋅∣x)) [(IdX⊗Y − (x, m(x)))2 ],
which when expanded gives
= ∫(u,v)∈X⊗Y (IdX⊗Y − (x, m(x)))2 (u, v) d(P (Γ−1 ⋅ evx ))
−1

= ∫f ∈Y X (IdX⊗Y − (x, m(x)))2 (evx (Γ⋅ (f ))) dP


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=(x,f (x))
= ∫f ∈Y X ((x − x) , (f (x) − m(x))2 ) dP
2

= (0, k(x, x)).


Consequently this sampling distribution, together with the prior distribution δx ◯
⋊P =
ΓP (⋅ ∣ x), provide what we expect of such a model.

7.1.2 Gaussian Additive Measurement Noise Model


Additive noise measurement models are often expressed by the simple expression
z =y+ (35)
where y represents the state while the  term itself represents a normally distributed
random variable with zero mean and variance σ 2 . In categorical terms this expression
corresponds to the map in Figure 20.

My ∼ N (y, σ 2 )
1 Y

Figure 20: The additive Gaussian noise measurement model.

Because the state y in Equation 35 is arbitrary, this additive noise model is represen-
M
tative of the P map Y Ð→ Y defined by
M (B ∣ y) = My (B) ∀y ∈ Y, ∀B ∈ ΣY .
18
The squaring operator in the variance is defined component wise on the vector space X ⊗ Y .
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 47

Given a GP P ∼ GP(f, k) on Y X , it follows that for any x ∈ X, P evx−1 ∼ N (f (x), k(x, x))
and for any B ∈ ΣY , the composition

P evx−1 ∼ N (f (x), k(x, x)) M


1 Y Y
is
(M ○ P evx−1 )(B) = ∫u∈Y MB (u) d(P evx−1 )
(v−u)2
= ∫u∈Y ( √2πσ
1
∫v∈B e −
2σ 2 dv) d(P evx−1 )
(v−u)2 (u−f (x))2
= √ 1
2πk(f (x),f (x)) ∫u∈Y ( √2πσ
1
∫v∈B e

2σ 2 dv) e− 2⋅k(f (x),f (x)) du
(v−u)2 (u−f (x))2
= √ 1
2π⋅σ⋅ k(f (x),f (x)) ∫v∈B ∫u∈Y e

2σ 2 e − 2⋅k(f (x),f (x))
du dv
(v−f (x))2
= 2(k(x,x)+σ 2 )
1 −

2π(k(x,x)+σ 2 ) ∫v∈B e dv.

Thus this composite is the normal distribution


M ○ P evx−1 ∼ N (f (x), k(x, x) + σ 2 ) (36)
1 Y
More generally we have the commutative P diagram given in Figure 21, where, for all
f ∈ Y X,
σ 2 iff x = x′
Nf ∼ GP(f, kN ) kN (x, x′ ) = { (37)
0 otherwise.

N
YX YX
P ∼ GP(f, k)

1 δevx δevx

P evx−1 ∼ N (f (x), k(x, x))


M
Y Y

Figure 21: Construction of the generic Markov kernel N for modeling the Gaussian addi-
tive measurement noise.

The commutativity of the right hand square in Figure 21 follows from


(δevx ○ N )(B ∣ f ) = ∫g∈Y X (δevx )B (g) dNf
= Nf (evx−1 (B))
= Nf (x) (B)
= ∫y∈Y MB (y) d(δevx )f
²
=δf (x)
= (M ○ δevx )(B ∣ f ).
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 48

With this Gaussian additive noise measurement model N our sampling distribution
Snf can easily be modified by incorporating the additional map N into the sequence in
Figure 18 to yield the Gaussian additive noise sampling distribution model Sn shown in
Figure 22.

X ⊗YX X ⊗YX X ⊗ (X ⊗ Y )X X ⊗Y
1X ⊗ N 1X ⊗ δΓ⋅ δev

Sn = composite

Figure 22: The sampling distribution model in P with additive Gaussian noise.

Here 1X ⊗ N is, by the definition given in Section 5.2,

(1X ⊗ N ) (U, (x, f )) = N (Γ−1


x (U ) ∣ f )

so the nondeterministic noisy sampling distribution is given by

Sn (U ∣ (x, f )) = (Snf ○ (1 ⊗ N )) (U ∣ (x, f )) for U ∈ ΣX⊗Y

= ∫(u,g)∈X⊗Y X (Snf )U (u, g) d(N (Γx (⋅) ∣ f )


−1

= ∫g∈Y X (Snf )U (Γx (g)) dN (⋅ ∣ f )


= ∫g∈Y X (Snf )(U ∣ (x, g)) dN (⋅ ∣ f ) (38)
= ∫g∈Y X χΓ−1 ˆ −1
⋅ (ev x (U ))
(g)dN (⋅ ∣ f )
= N (Γ⋅ (ev
−1 ˆ x (U )) ∣ f )
−1

= N (evx (Γ−1
−1
x (U )) ∣ f )
= Nf evx−1 (Γx−1 (U )).

Just as we did for the GP P ∶ 1 → Y X in Figure 19, each GP Nf can be analyzed by


its push forward measures onto any coordinate x ∈ X to obtain the diagram in Figure 23.

Nf ∼ GP(f, kN ) δΓ⋅ δev


ˆx
1 YX (X ⊗ Y )X X ⊗Y

δevx δπ Y
Nf (evx−1 (⋅)) ∼ N (f (x), σ 2 )

Figure 23: The GP Nf can be evaluated on rectangles U = A × B by projecting onto the


given x coordinate.

Taking U as a rectangle, U = A × B, with A ∈ ΣX and B ∈ ΣY , the likelihood that a


7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 49

measurement will occur in the rectangle conditioned on (x, f ) is given by

Sn (A × B ∣ (x, f )) = Nf evx−1 (Γx−1 (A × B))


= δx (A) ⋅ Nf evx−1 (B)
(y−f (x))2
= δx (A) ⋅ √1
2πσ ∫y∈B e

2σ 2 dy.

Using the associativity property of categories, from Figure 22 with a prior ΓP (⋅ ∣ x)


on X ⊗ Y X , the composite Sn ○ ΓP (⋅ ∣ x) can be decomposed as

Sn ○ ΓP (⋅ ∣ x) = Snf ○ ((1X ⊗ N ) ○ ΓP (⋅ ∣ x))

while the term ((1X ⊗ N ) ○ ΓP (⋅ ∣ x)) = ΓN ○P (⋅ ∣ x) follows from the commutativity of the
diagram in Figure 24, where, as shown in Equation 36, M ○ P evx−1 ∼ N (m, k(x, x) + σ 2 )
which implies N ○ P ∼ GP(m, k + kN ).

1 1

δx ΓP (⋅ ∣ x)P

δπ X δπY X
X X ⊗Y X YX ΓN ○P (⋅ ∣ x)

1X 1X ⊗ N N

δπ X δπY X
X X ⊗Y X YX X ⊗YX

Figure 24: The composite of the prior and noise measurement model is the graph of a GP
at x.

Using the fact Sn ○ ΓP (⋅ ∣ x) = Snf ○ ΓN ○P (⋅ ∣ x), the expected value of the composite
Sn ○ ΓP (⋅ ∣ x) is readily shown to be

E(Sn ○ΓP (⋅∣x)) [IdX⊗Y ] = (x, m(x))


while the variance is

E(Sn ○ΓP (⋅∣x)) [(IdX⊗Y − E(Sn ○ΓP (⋅∣x)) [IdX⊗Y ])2 ] = (0, k(x, x) + σ 2 ).
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 50

7.2 Parametric Models


A parametric model can be though of as carving out a subset of Y X specifying the form
of functions which one wants to consider as valid hypotheses. With this in mind, let us
define a p-dimensional parametric map as a measurable function

i∶ R Ð→ Y X
p

p
where R has the product σ-algebra with respect to the canonical projection maps onto
the measurable space R with the Borel σ-algebra. Note that i(a) ∈ Y X corresponds (via
the SMwCC structure) to a function i(a)∶ X → Y .19 This parametric map i determines
the deterministic P arrow δi ∶ R → Y X , which in turn determines the deterministic tensor
p

product arrow 1X ⊗ δi ∶ X ⊗ R Ð→ X ⊗ Y X . This arrow serves as a bridge connecting the


p

two forms of Bayesian models, the parametric and nonparametric models.


A parametric model consists of a parametric mapping combined with a nonparametric
noisy measurement model Sn with prior (1X ⊗δi )○ΓP (⋅ ∣ x) to give the diagram in Figure 25
and we define a parametric Bayesian model as any model which fits into the scheme of
Figure 25.

ΓP (⋅ ∣ x) d

X ⊗R X ⊗YX X ⊗Y
p

1X ⊗ δi Sn

Figure 25: The generic parametric Bayesian model.

In the ML literature, one generally assumes complete certainty with regards to the
input state x ∈ X. However, there are situations in which complete knowledge of the
input state x is itself uncertain. This occurs in object recognition problems where x is
a feature vector which may be only partially observed because of obscuration and such
data is the only training data available.
For real world modeling applications there must be a noise model component asso-
ciated with a parametric model for it to make sense. For example we could estimate
an unknown function as a constant function, and hence have the 1 parameter model
i∶ R → Y X given by i(a) = a, the constant function on X with value a. Despite how crude
this approximation may be, we can still obtain a “best” such Bayesian approximation to
the function given measurement data where “best” is defined in the Bayesian probabilis-
tic sense - given a prior and a measurement the posterior gives the best estimate under
19
Note that the function i(a) is unique by our construction of the transpose of the function i(a) ∈ Y X .
The non-uniqueness aspect of the SMwCC structure only arises in the other direction - given a conditional
probability measure there may be multiple functions satisfying the required commutativity condition.
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 51

the given modeling assumptions. Without a noise component, however, we cannot even
account for the fact our data is different than our model which, for analysis and prediction
purposes, is a worthless model.

Example 14. Affine Parametric Model Let X = R and p = n + 1. The affine para-
n

metric model is given by considering the valid hypotheses to consist of affine functions

Fa ∶ X → Y
(39)
∶ x ↦ ∑nj=1 aj xj + an+1

where x = (x1 , x2 , . . . , xn ) ∈ X, the ordered (n + 1) − tuple a = (a1 , . . . , an , an+1 ) ∈ R


n+1
are
fixed parameters so Fa ∈ Y X and the parametric map

i ∶ Rn+1 Ð→ Y X
∶ a ↦ i(a) = Fa

specifies the subset of all possible affine models Fa .


In particular, if n = 2 and the test data consist of two data classes, say with labels −1
and 1, which is separable then the coefficients {a1 , a2 , a3 } specify the hyperplane separating
the data points as shown in Figure 26.

Separating Hyperplane
5.0

4.5

4.0

3.5

3.0

2.5

2.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 26: An affine model suffices for separable data.

In this particular example where the class labels are integer valued, the resulting
function we are estimating will not be integer valued but, as usual, approximated by real
values.
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 52

Such parametric models are useful to avoid over fitting data because the number of
parameters are finite and fixed with respect to the number of measurements in contrast
to nonparametric methods in which each measurement serves as a parameter defining the
updated probability measure on Y X .
More generally, for any parametric map i take the canonical basis vectors ej , which
are the j th unit vector in R , and let the image of the basis elements {ej }pj=1 under the
p

parametric map i be i(ej ) = fj ∈ Y X . Because Y X forms a real vector space under


pointwise addition and scalar multiplication, (f + g)(x) = f (x) + g(x) and (αf )(x) =
α(f (x)) for all f, g ∈ Y X , x ∈ X, and α ∈ R, we observe that the “image carved out” by
the parametric map i is just the span of the image of the basis elements {ej }pj=1 . In the
above example fj = πj , for j = 1, 2 where πj is the canonical projection map R2 → R, and
f3 = 1, the constant function with value 1 on all points x ∈ X. Thus the image is as
specified by the Equation 39.
Example 15. Elliptic Parametric Model When the data is not linearly separable as
in the previous example, but rather of the form shown in Figure 27, then a higher order
parametric model is required.

Separating Ellipse
5.0

4.5

4.0

3.5

3.0

2.5

2.0
-1 0 1 2 3 4

Figure 27: An elliptic parametric model suffices to separate the data.

Taking X = Rn and p = n2 + n + 1, the elliptic parametric model is given by considering


the valid hypotheses to consist of all elliptic functions
Fa ∶ X → Y
(40)
∶ x ↦ ∑nj=1 aj xj + ∑nj=1 ∑nk=1 an+n(j−1)+k xj xk + an2 +n+1

where x = (x1 , x2 , . . . , xn ) ∈ X, the ordered (n2 +n+1)−tuple a = (a1 , . . . , an2 +n+1 ) ∈ Rn


2 +n+1
7 BAYESIAN MODELS FOR FUNCTION ESTIMATION 53

are fixed parameters so Fa ∈ Y X and the parametric map


n2 +n+1
i ∶ R Ð→ Y X
∶ a ↦ i(a) = Fa

specifies the subset of all possible elliptic models Fa .


With this model the linearly nonseparable data becomes separable. This is the basic
idea behind support vector machines (SVMs): simply embed the data into a higher order
space where it can be (approximately) separated by a higher order parametric model.

Returning to the general construction of the Bayesian model for the parametric model
we take the Gaussian additive noise model, Equation 38, and expand the diagram in
Figure 25 to the diagram in Figure 28, where the parametric model sampling distribution
can be readily determined on rectangles A × B ∈ ΣX⊗Y by

⎛ ⎞
S(A × B ∣ (x, a)) = N ⎜evx (Γx (A × B)) ∣ i(a) ⎟
⎜ −1 −1

⎝ ±⎠
=Fa
= NFa evx−1 (Γ−1
x (A × B))
(y−Fa (x))2
= δx (A) ⋅ √2πσ
1
∫y∈B e

2σ 2 dy.

ΓP (⋅ ∣ x)
Sn

1X ⊗ δi 1X ⊗ N 1X ⊗ δΓ δev
X ⊗R
p
X ⊗YX X ⊗YX X ⊗ (X ⊗ Y )X X ⊗Y

Figure 28: The parametric model sampling distribution as a composite of four compo-
nents.

Here we have used the fact NFa evx−1 ∼ N (Fa (x), σ 2 ) which follows from Equation 37
and the property that a GP evaluated on any coordinate is a normal distribution with
the mean and variance evaluated at that coordinate.
8 CONSTRUCTING INFERENCE MAPS 54

8 Constructing Inference Maps


We now proceed to construct the inference maps I for each of the models specified in the
previous section. This construction permits the updating of the GP prior distributions P
for the nonparametric models and the normal priors P on Rk for the parametric models
through the relation that the posterior measure is given by I ○ d, where d is a data
measurement. The resulting analysis produces the familiar updating rules for the mean
and covariance functions characterizing a GP.

8.1 The noise free inference map


Under a prior probability of the form δx ⊗ P = ΓP (⋅ ∣ x) on the hypothesis space X ⊗ Y X ,
which is a one point measure with respect to the component X, the sampling distribution
Snf in Figure 18 can be viewed as a family of deterministic P maps—one for each point
x ∈ X.

S x = δevx
YX Y

Figure 29: The noise free sampling distributions S x given the prior δx ⊗ P with the dirac
measure on the X component.

Using the property that δevx (B ∣ f ) = 1evx−1 (B) (f ) for all B ∈ ΣY and f ∈ Y X , the resulting
deterministic sampling distributions (one for each x ∈ X) are given by

S x (B ∣ f ) = 1evx−1 (B) (f ). (41)

This special case of the prior δx ⊗P , which is the most important one for many ML applica-
tions and the one implicitly assumed in ML textbooks, permits a complete mathematical
analysis.
Given the probability measure P ∼ GP(m, k) and S x = δevx , it follows the composite
is the pushforward probability measure

S x ○ P = P evx−1 , (42)

which is the special case of Figure 14 with X0 = {x}. Using the fact that P projected
onto any coordinate is a normal distribution as shown in Figure 30, it follows that the
expected mean is
EP evx−1 (IdY ) = EP (evx )
= m(x)
while the expected variance is

EP evx−1 (IdY − EP evx−1 (IdY ))2 ) = EP (evx − EP (evx ))2 )


= k(x, x).
8 CONSTRUCTING INFERENCE MAPS 55

These are precisely specified by the characterization P evx−1 ∼ N (m(x), k(x, x)).

1
P ∼ GP(m, k) P evx−1 ∼ N (m(x), k(x, x))
S x = δevx
YX Y
Ix

Figure 30: The composite of the prior distribution P ∼ GP(m, k) and the sampling
distribution S x give the coordinate projections as priors on Y .

Recall that the corresponding inference map I x is any P map satisfying the necessary
and sufficient condition of Equation 11, i.e., for all A ∈ ΣY X and B ∈ ΣY ,

∫f ∈A S (B ∣ f ) dP = ∫y∈B I (A ∣ y) d(P evx ).


x x −1
(43)

Since the σ-algebra of Y X is generated by elements evz−1 (A), for z ∈ Y and A ∈ ΣY , we can
take A = evz−1 (A) in the above expression to obtain the equivalent necessary and sufficient
condition on I x of

∫f ∈ev−1 (A) S (B ∣ f ) dP = ∫y∈B I (evz (A) ∣ y) d(P evx ).


x x −1 −1
z

From Equation 41, S x (B ∣ f ) = 1evx−1 (B) (f ), so substituting this value into the left hand
side of this equation reduces that term to P (evx−1 (B) ∩ evz−1 (A)). Rearranging the order
of the terms it follows the condition on the inference map I x is

∫y∈B I (evz (A) ∣ y) d(P evx ) = P (evx (B) ∩ evz (A)).


x −1 −1 −1 −1

Since the left hand side of this expression is integrated with respect to the pushforward
probability measure P evx−1 it is equivalent to

∫y∈B I (evz (A) ∣ y) d(P evx ) = ∫f ∈ evx−1 (B) I (evz (A) ∣ evx (f )) dP
x −1 −1 x −1

= ∫f ∈ evx−1 (B) I x evz−1 (A ∣ evx (f )) dP.


In summary, if I x is to be an inference map for the prior P and sampling distribution S x ,
then it is necessary and sufficient that it satisfy the condition

∫f ∈ ev−1 (B) I evz (A ∣ evx (f )) dP = P (evx (B) ∩ evz (A)).


x −1 −1 −1
x

Given a (deterministic) measurement20 at (x, y), the stochastic process I x (⋅ ∣ y) ∶ 1 → Y X


is the posterior of P ∼ GP(m, k). This posterior, denoted PY1 X ≜ I x (⋅ ∣ y), is generally not
20
Meaning the arrow d = δy in Figure 17. In general it is unnecessary to assume deterministic mea-
surements in which case the composite I x ○ d represents the posterior.
8 CONSTRUCTING INFERENCE MAPS 56

unique. However we can require that the posterior PY1 X be a GP specified by updated
mean and covariance functions m1 and k 1 respectively, which depend upon the condition-
ing value y, so PY1 X ∼ GP(m1 , k 1 ). To determine PY1 X , and hence the desired inference
map I x , we make a hypothesis about the updated mean and covariance functions m1 and
k 1 characterizing PY1 X given a measurement at the pair (x, y) ∈ X × Y . Let us assume the
updated mean function is of the form

k(z, x)
m1 (z) = m(z) + (y − m(x)) (44)
k(x, x)

and the updated covariance function is of the form

k(w, x)k(x, z)
k 1 (w, z) = k(w, z) − . (45)
k(x, x)

To prove these updated functions suffice to specify the inference map I x (⋅ ∣ y) = PY1 X ∼
GP(m1 , k 1 ) satisfying the necessary and sufficient condition we simply evaluate

∫f ∈ ev−1 (B) I evz (A ∣ evx (f )) dP


x −1
x

by substituting I x (⋅ ∣ f (x)) = P 1 (m1 , k 1 ) and verify that it yields P (evx−1 (B) ∩ evz−1 (A)).
Since I x evz−1 (⋅ ∣ f (x)) = PY1 X evx−1 is a normal distribution of mean

k(z, x)
m1 (z) = m(x) + ⋅ (f (x) − m(x))
k(x, x)

and covariance
k(z, x)2
k 1 (z, z) = k(z, z) −
k(x, x)
it follows that

⎛ 1 −(m1 (z)−v)2 ⎞
∫f ∈ ev−1 (B) I x
ev −1
(A ∣ ev x (f )) dP = ∫f ∈ ev−1 (B) √ ∫ e 2k1 (z,z) dv dP
⎝ 2πk 1 (z, z) v∈A ⎠
z
x x

which can be expanded to


−(m(z)+ (f (x)−m(x))−v)
k(z,x) 2
⎛ 1 k(x,x) ⎞
∫f ∈ ev−1 (B) √
1 (z,z)
∫ e 2k dv dP
x ⎝ 2πk 1 (z, z) v∈A ⎠

and equals
−(m(z)+ (y−m(x))−v)
k(z,x) 2
⎛ 1 k(x,x) ⎞
∫y∈B √
1 (z,z)
∫ e 2k dv dP evx−1 .
⎝ 2πk 1 (z, z) v∈A ⎠
8 CONSTRUCTING INFERENCE MAPS 57

Using PY X evx−1 ∼ N (m(x), k(x, x)) we can rewrite the expression as


1 − 1 (u−u)T Ω−1 (u−u)
√ ∫y∈B ∫v∈A e 2 dv dy
2π ∣ Ω ∣

where
y m(x)
u=( ) u=( )
v m(z)
and
k[x, x] k[x, z]
Ω=( ),
k[z, x] k[z, z]
which we recognize as a normal distribution N (u, Ω).
On the other hand, we claim that

P (evx−1 (⋅) ∩ evz−1 (⋅))


1 Yx × Yz ,
where Yx and Yz are two copies of Y , is also a normal distribution of mean u = (m(x), m(z))
with covariance matrix Ω.21 To prove our claim consider the P diagram in Figure 31 where
X0 = {x, z}, ι ∶ X0 ↪ X is the inclusion map referenced in Section 6.2, and evx × evz is
an isomorphism between the two different representations of the set of all measurable
functions Y X0 alluded to in the second paragraph of Section 6.

N (m(x), k(x, x)) Yx

δevx δevx δπYx

P δY ι δevx ×evz
1 YX Y X0 Yx × Yz

δevz δevz δπYz

N (m(z), k(z, z)) Yz

Figure 31: Proving the joint distribution δevx ×evz ○ δY ι ○ P = P (evx−1 (⋅) ∩ evz−1 (⋅))) is a
normal distribution N (u, Ω).
21
Formally the arguments should be numbered in the given probability measure as P (evx−1 (#1) ∩
evz−1 (#2))because evx−1 (A) ∩ evz−1 (B) ≠ evx−1 (B) ∩ evz−1 (A). However the subscripts can be used to
identify which component measurable sets are associated with each argument.
8 CONSTRUCTING INFERENCE MAPS 58

The diagram in Figure 31 commutes because δπYx ○ δevx ×evz = δx and δπYz ○ δevx ×evz = δz
while, using (evx × evz ) ○ Y ι = (evx , evz ),

(δevx ×evz ○ δY ι ○ P )(A × B) = ∫f ∈Y X δ(evx ,evz ) (A × B ∣ f ) dP


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
=(1A×B )(evx ,evz )(f )
= ∫f ∈Y X (1evx−1 (A) ⋅ 1evz−1 (B) )(f ) dP
= ∫f ∈Y X 1evx−1 (A)∩evz−1 (B) (f ) dP
= P (evx−1 (A) ∩ evz−1 (B)).

Moreover, the covariance k of P (evx−1 (⋅) ∩ evz−1 (⋅))) is represented by the matrix Ω
because by definition of P , in terms of m and k, its restriction to Y X0 ≅ Yx × Yz has
covariance k ∣X0 ≅ Ω.
Consequently the necessary and sufficient condition for I x = PY1 X ∼ GP(m1 , k 1 ) to
be an inference map is satisfied by the projection of PY1 X onto any single coordinate z
which corresponds to the restriction of PY1 X via the deterministic map Y ι ∶ Y X → Y X0
with X0 = {z} as in Figure 14. But this procedure immediately extends to all finite
subsets X0 ⊂ X using matrix algebra and consequently we conclude that the necessary
and sufficient condition for I x to be an inference map for the prior P and the noise free
sampling distribution S x is satisfied.
Writing the prior GP as P ∼ GP(m0 , k 0 ) the recursive updating equations are
k i (z, xi )
mi+1 (z ∣ (xi , yi )) = mi (z) + (yi − mi (xi )) for i = 0, . . . , N − 1 (46)
k (xi , xi )
i

and
k i (w, xi )k i (xi , z)
k i+1 ((w, z) ∣ (xi , yi )) = k i (w, z) − for i = 0, . . . , N − 1 (47)
k i (xi , xi )
where the terms on the left denote the posterior mean and covariance functions of mi and
k i given a new measurement (xi , yi ). These expressions coincide with the standard formu-
las written for N arbitrary measurements {(xi , yi )}Ni=1 , with X0 = (x0 , . . . , xN −1 ) a finite
−1

set of independent points of X with corresponding measurements yT = (y0 , y1 , . . . , yN −1 ),

m̃(z ∣ X0 ) = m(z) + K(z, X0 )K(X0 , X0 )−1 (y − m(X0 )) (48)

where m(X0 ) = (m(x0 ), . . . , m(xN −1 ))T , and

k̃((w, z) ∣ X0 ) = k(w, z) − K(w, X0 )K(X0 , X0 )−1 K(X0 , z) (49)

where K(w, X0 ) is the row vector with components k(w, xi ), K(X0 , X0 ) is the matrix
with components k(xi , xj ), and K(X0 , z) is a column vector with components k(xi , z).22
The notation m̃ and k̃ is used to differentiate these standard expressions from ours above.
Equations 48 and 49 are a computationally efficient way to keep track of the updated
22
When the points are not independent then one can use a perturbation method or other procedure
to avoid degeneracy.
8 CONSTRUCTING INFERENCE MAPS 59

mean and covariance functions. One can easily verify the recursive equations determine
the standard equations using induction.
A review of the derivation of PY1 X indicates that the posterior PY1 X ∼ GP(m1 , k 1 ) is
actually parameterized by the measurement (x1 , y1 ) because the above derivation holds for
any measurement (x1 , y1 ) and this pair of values uniquely determines m1 and k 1 through
the Equations 46 and 47, or equivalently Equations 48 and 49, for a single measurement.
By the SMwCC structure of P each parameterized GP PY1 X can be put into the
bijective correspondence shown in Figure 32, where

PY1 X (B ∣ (z, (x, y))) = PY1 X (evz−1 (B) ∣ (x, y)) ∀B ∈ ΣY , z ∈ X, y ∈ Y


= PY1 X evz−1 (B ∣ (x, y))
(v−m1 (z))2
= 2k1 (z,z)
1 −

2πk1 (z,z) ∫v∈B e dv
(v−(m(z)+ (y−x))2
k(z,x)
k(x,x)

k(x,x)k(z,z)−k(x,z)2
= 1 2

k(x,x)k(z,z)−k(x,z)2
∫v∈B e k(x,x) dv
2π k(x,x)

which is a probability measure on Y conditioned on z and parameterized by the pair


(x, y). Iterating this process we obtain the viewpoint that the parameterized process
PY X (evz−1 (B) ∣ {(xi , yi )}N
i=1 ) is a posterior conditional probability parameterized over N
measurements.

PY1 X
X ⊗Y YX

X ⊗ (X ⊗ Y ) Y
PY1 X

Figure 32: Each GP PY1 X , which is parameterized by a measurement (x, y) ∈ X ⊗ Y ,


determines a conditional PY1 X .

8.2 The noisy measurement inference map


When the measurement model has additive Gaussian noise which is iid on each slice
x ∈ X, the resulting inference map is easily given by observing that from Equation 36,
the composite δevx ○ N ○ P ∼ N (m(x), k(x, x) + kN (x, x)). Thus, the noisy sampling
distribution along with the prior P ∼ GP(m, k) can be viewed as a noise free distribution
P ∼ GP(m, κ) on Y X , where κ ≜ k + kN , and kN is given by equation 37. This is clear
from the composite of Figure 24 with the Dirac measure δx on the X component. Now
the noisy measurement inference map for the Bayesian model with prior P and sampling
distribution S x = δx ○ N , as shown in Figure 33, can be determined by decomposing it into
two simpler Bayesian problems whose inference maps are (1) trivial (the identity map)
and (2) already known.
8 CONSTRUCTING INFERENCE MAPS 60

1
P ∼ N (m, k) δx ○ N ○ P ∼ N (m(x), k(x, x) + kN (x, x))
N ○P ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=κ(x,x)
N δevx
YX YX Y

Sx
⇓ Decomposition

1 1
P ∼ N (m, k) N ○P
N ○P δx ○ N ○ P
N δx
YX YX YX Y
I∗ Inf

Figure 33: Splitting the Gaussian additive noise Bayesian model (top diagram) into two
separate Bayesian models (bottom two diagrams) and composing the inference maps for
these two simple Bayesian models gives the inference map for the original Gaussian addi-
tive Bayesian model.

Observe that the composition of the two bottom diagrams is the top diagram. The
bottom diagram on the right is a noise free Bayesian model with GP prior N ○ P and
sampling distribution δx whose inference map Inf we have already determined analytically
in Section 8.1. Given a measurement y ∈ Y at x ∈ X, the inference map is given by the
updating Equations 44 and 45 for the mean and covariance functions characterizing the GP
on Y X . The resulting posterior GP on Y X can then be viewed as a measurement on Y X for
the bottom left diagram, which is a Bayesian model with prior P and sampling distribution
N . The inference map I⋆ for this diagram is the identity map on Y X , I⋆ = δIdY X . This is
easy to verify using Bayes product rule (Equation 13), ∫a∈A N (B ∣ a) dP = ∫f ∈B δIdY X (A ∣
f ) d(N ○ P ), for any A, B ∈ ΣY X . Composition of these two inference maps, Inf and I⋆
then yields the resulting inference map for the Gaussian additive noise Bayesian model.
With this observation both of the recursive updating schemes given by Equations 46
and 47 are valid for the Gaussian additive noise model with k replaced by κ. The corre-
sponding standard expressions for the noisy model are then
m̃(z ∣ X0 ) = m(z) + K(z, X0 )K(X0 , X0 )−1 (y − m(X0 ))
and
κ̃((w, z) ∣ X0 ) = κ(w, z) − K(w, X0 )K(X0 , X0 )−1 K(X0 , z),
where the quantities like K(w, X0 ) are as defined previously (following Equation 49)
except now k is replaced by κ. For w ≠ z and neither among the measurements X0 these
8 CONSTRUCTING INFERENCE MAPS 61

expressions, upon substituting in for κ, reduce to the familiar expressions


m̃(z ∣ X0 ) = m(z) + K(z, X0 )(K(X0 , X0 ) + σ 2 I)−1 (y − m(X0 ))
and
k̃((w, z) ∣ X0 ) = k(w, z) − K(w, X0 )(K(X0 , X0 ) + σ 2 I)−1 K(X0 , z),
which provide for a computationally efficient way to compute the mean and covariance of
a GP given a finite number of measurements.

8.3 The inference map for parametric models


Under the prior δx ⊗ P on the hypothesis space in the parametric model, Figure 28, the
parametric sampling distribution model can be viewed as a family of models, one for each
x ∈ X, given by the diagram in Figure 34.

p
δi N δevx
R YX YX Y

Spx

Figure 34: The Gaussian additive noise parametric sampling distributions Spx viewed as
a family of sampling distributions, one for each x ∈ X.

The sampling distribution can be computed as


Spx (B ∣ a) = (δevx ○ N ○ δi )(B ∣ a)
= ∫f ∈Y X (δevx ○ N )(B ∣ f ) d (δi )a
±
δF a
= (δevx ○ N )(B ∣ Fa )
= N (evx−1 (B) ∣ Fa ).
Because NFa ∼ GP (Fa , kN ), it follows that
N (evx−1 (●) ∣ Fa ) = NFa evx−1 ∼ N (Fa (x), σ 2 )
and consequently
1 (y−Fa (x))2
Spx (B ∣ a) = √ ∫B e

2σ 2 dy.
2πσ
Taking the prior P ∶ 1 → R as a normal distribution with mean m and covariance
p

function k, it follows that the composite Spx ○P ∼ N (Fm (x), k(x, x)+σ 2 ) while the inference
map Ipx satisfies, for all B ∈ ΣY and all A ∈ ΣRp ,

∫a∈A Sp (B ∣ a) dP = ∫y∈B Ip (A ∣ y) d(Sp ○ P ).


x x x
8 CONSTRUCTING INFERENCE MAPS 62

To determine this inference map Ipx it is necessary to require the parametric map

i ∶ R Ð→ Y X
p

∶ a ↦ ia

be an injective linear homomorphism. Under this condition, which can often be achieved
simply by eliminating redundant modeling parameters, we can explicitly determine the
inference map for the parameterized model, denoted Ipx , by decomposing it into two
inference maps as displayed in the diagram in Figure 35.

1
P Snx ○ P i−1
P i−1
δi Snx
R
n
YX Y Spx = Snx ○ δi
I⋆ Inx
Ipx

Figure 35: The inference map for the parametric model is a composite of two inference
maps.

We first show the stochastic process P i−1 is a GP and by taking the sampling distri-
bution Snx = δevx ○N as the noisy measurement model we can use the result of the previous
section to provide us with the inference map Inx in Figure 35.

Lemma 16. Let k be the matrix representation of the covariance function k. The the
push forward of P N (m, k) by i is a GP P i−1 ∼ GP(im , k̂), where k̂(u, v) = uT kv.

Proof. We need to show that the push forward of P i−1 by the restriction map Y ι ∶ Y X Ð→ Y X0
is a normal distribution for any finite subspace ι ∶ X0 ↪ X. Consider the commutate dia-
gram in Figure 36, where Yx is a copy of Y , X0 = (x1 , . . . , xn′ ), and

evx1 × . . . × evxn′ ∶ Y X0 → ∏ Yx
x∈X0

is the canonical isomorphism.


8 CONSTRUCTING INFERENCE MAPS 63

1
P
P i−1
n
δi
R YX

δY ι
δY ι ○i
δevx1 ×...×evxn′
Y X0 ∏x∈X0 Yx

Figure 36: The restriction of P i−1 .

The composite of the measurable maps

((evx1 × . . . × evxn′ ) ○ Y ι ○ i) (a) = (ia (x1 ), . . . , ia (xn′ )) (50)

from which it follows that the composite map δevx1 ×...×evxn′ ○δY ι ○P i−1 ∼ N (X0T m, X0T kX0 ).

Now the diagram

P P i−1
n
δi
R YX
I⋆

with the sampling distribution for this Bayesian problem as δi . Let eTj = (0, . . . , 0, 1, 0, . . . , 0)
be the j th unit vector of R and let iej = fj ∈ Y X . The elements {fj }pj=1 form the compo-
p

nents of a basis for the image of i by the assumed injective property of i. Let this finite
basis have a dual basis {fj∗ }pj=1 so that fk∗ (fj ) = δk (j).
Consider the measurable map

f1∗ × . . . × fp∗ ∶ Y X → R
p

; g ↦ (f1∗ (g), . . . , fp∗ (g)),

Using the linearity of the parameter space R it follows a = ∑pi=1 ai ei and consequently
p

((f1∗ × . . . × fp∗ ) ○ i) (a) = (f1∗ (ia ), . . . , fp∗ (ia ))


= a using fj∗ (ia ) = fj∗ (∑pk=1 ak fk ) = aj
8 CONSTRUCTING INFERENCE MAPS 64

and hence (f1∗ × . . . × fp∗ ) ○ i = idRp in Meas. Now it follows the corresponding inference
map I⋆ = δf1∗ ×...×fp∗ because the necessary and sufficient condition for I⋆ is given, for all
evz−1 (B) ∈ ΣY X (which generate ΣY X ) and all A ∈ ΣRn , by

∫a∈A δi (evz (B) ∣ a) dP = ∫g∈ev−1 (B) I⋆ (A ∣ g) dP i (51)


−1 −1
z

with the left hand term reducing to the expression

∫a∈A 1i−1 (evz−1 (B)) (a) dP = P (i (evz (B)) ∩ A).


−1 −1

On the other hand, using I⋆ = δf1∗ ×...×fp∗ , the right hand term of Equation 51 also reduces
to the same expression since

∫g∈evz−1 (B) δf1∗ ×...×fp∗ (A ∣ g) d(P i ) = ∫a∈i−1 (evz−1 (B)) 1((f1∗ ×...×fp∗ )○i)−1 (A) (a) dP
−1

= ∫a∈i−1 (evz−1 (B)) 1A (a) dP


= P (i−1 (evz−1 (B)) ∩ A)

thus proving I⋆ = δf1∗ ×...×fp∗ .


Taking
Ipx = I⋆ ○ Inx ,
it follows that for a given measurement (x, y) that the composite is

Ipx = Inx ((f1∗ × . . . × fp∗ )−1 (⋅) ∣ y) (52)

which is the push forward measure of the GP Inx (⋅ ∣ y) ∼ GP(i1m , κ1 ) where (as defined
previously) κ = k + kN and

κ(z, x)
i1m (z) = im (z) + (y − im (x)) (53)
κ(x, x)
and
κ(u, x)κ(x, v)
κ1 (u, v) = κ(u, v) − . (54)
κ(x, x)
This GP projected onto any finite subspace ι ∶ X0 ↪ X is a normal distribution and, for
X0 = {x1 , x2 , . . . , xn }, it follows that

Inx (● ∣ y) ∼ GP(i1m , κ1 ) δY ι
1 YX Y X0

δevx1 ×...×evxn δevx1 ×...×evxn ∣Y X0


Ip (● ∣ y) ∼ N ((i1m (x1 ), . . . , i1m (xn ))T , κ1 ∣X0 )

∏i=1 Yi ≅ R
n n
9 STOCHASTIC PROCESSES AS POINTS 65

where Yi is a copy of Y = R and the restriction δevx1 ×...×evxn ∣Y X0 is an isomorphism. The


inference map Ip (● ∣ y) is the updated normal distribution on R given the measurement
n

(x, y) which can be rewritten as

Ip (● ∣ y) ∼ N (m + K(X0 , x)κ(x, x)−1 (y − mT x), κ1 ∣X0 ),

where X0 is now viewed as the ordered set X0 = (x1 , . . . , xn ) and K(X0 , x) is the n-vector
with components κ(xj , x).
Iterating this updating procedure for N measurements {(xi , yi )}N i=1 the N
th posterior

coincides with the analogous noisy measurment inference updating Equations 48 and 49
with κ in place of k.

9 Stochastic Processes as Points


Having defined stochastic processes we would be remiss not to mention the Markov
process—one of the most familiar type of processes used for modeling. Many applica-
tions can be approximated by Markov models and a familiar example is the Kalman filter
which we describe below as it is the archetype. While Kalman filtering is not commonly
viewed as a ML problem, it is useful to put it into perspective with respect to the Bayesian
modeling paradigm.
By looking at Markov processes we are immediately led to a generalization of the
definition of a stochastic process which is due to Lawvere and Meng [18]. To motivate
this we start with the elementary idea first before giving the generalized definition of a
stochastic process.

9.1 Markov processes via Functor Categories


Here we assume knowledge of the definition of a functor, and refer the unfamiliar reader
to any standard text on category theory. Let T be any set with a total (linear) ordering
≤ so for every t1 , t2 ∈ T either t1 ≤ t2 or t2 ≤ t1 . (Here we have switched from our standard
“X” notation to “T ” as we wish to convey the image of a space with properties similar
to time as modeled by the real line.) We can view (T, ≤) as a category with the objects
as the elements and the set of arrows from one object to another as

⋆ iff t1 ≤ t2
homT (t1 , t2 ) = {
∅ otherwise

The functor category P T has as objects functors F ∶ (T, ≤) → P which play an im-
portant role in the theory of stochastic processes, and we formally give the following
definition.

Definition 17. A Markov transformation is a functor F ∶ (T, ≤) → P.


9 STOCHASTIC PROCESSES AS POINTS 66

From the modeling perspective we look at the image of the functor F ∈ob P T in the
category P so given any sequence of ordered points {ti }∞i=1 in T their image under F is
shown in Figure 37, where Fti ,ti+1 = F(≤) is a P arrow.

Ft1 ,t2 Ft2 ,t3 Ft3 ,t4


F(t1 ) F(t2 ) F(t3 ) ...

Figure 37: A Markov transformation as the image of a P valued Functor.

By functoriality, these arrows satisfy the conditions


1. Fti ,ti = idti , and

2. Fti ,ti+2 = Fti+1 ,ti+2 ○ Fti ,ti+1


Using the definition of composition in P the second condition can be rewritten as

Fti ,ti+2 (B ∣ x) = ∫ Fti+1 ,ti+2 (B ∣ u) dFti ,ti+1 (⋅ ∣ x)


u∈F (ti+1 )

for x ∈ F(ti ) (the “state” of the process at time ti ) and B ∈ ΣF (ti+2 ) . This equation
is called the Chapman-Kolomogorov relation and can be used, in the non categorical
characterization, to define a Markov process.
The important aspect to note about this definition of a Markov model is that the
measurable spaces F(ti ) can be distinct from the other measurable spaces F(tj ), for
j ≠ i, and of course the arrows Fti ,ti+1 are in general distinct. This simple definition
of a Markov transformation as a functor captures the property of an evolving process
being “memoryless” since if we know where the process F is at ti , say x ∈ F(ti ), then its
expectation at ti+1 (as well as higher order moments) can be determined without regard
to its “state” prior to ti .
The arrows of the functor category P T are natural transformations η ∶ F → G, for
F, G ∈ob P T , and hence satisfy the commutativity relation given in Figure 38 for every
t1 , t2 ∈ T with t1 ≤ t2 .

ηt1
F(t1 ) G(t1 )

Ft1 ,t2 Gt1 ,t2

ηt2
F(t2 ) G(t2 )

Figure 38: An arrow in P T is a natural transformation.


9 STOCHASTIC PROCESSES AS POINTS 67

The functor category P T has a terminal object 1 mapping t ↦ 1 for every t ∈ T and
this object 1 ∈ob P T allows us to generalize the definition of a stochastic process.23

Definition 18. Let X be any category. A stochastic process is a point in the category
P X , i.e., a P X arrow η ∶ 1 → F for some F ∈ob P X .24

Different categories X correspond to different types of stochastic processes. Taking


the simplest possible case let X be a set considered as a discrete category—the objects
are the elements x ∈ X while there are no nonidentity arrows in X viewed as a category.
This case generalizes Definition 9 because, for Y a fixed measurable space we have the
functor Ŷ ∶ X → P mapping each object x ∈ob X to a copy Yx of Y and this special case
corresponds to Definition 9.
Taking X = T , where T is a totally ordered set (and subsequently viewed as a category
with one arrow between any two elements), and looking at the image of

≤ ≤ ≤ ...
t1 t2 t3

under the stochastic process µ ∶ 1 → F gives the commutative diagram in Figure 39.

µt1 µt2 µt3 µt4

Ft1 ,t2 Ft2 ,t3 Ft3 ,t4


F(t1 ) F(t2 ) F(t3 ) ...

Figure 39: A Markov model as the image of a stochastic process.

From this perspective a stochastic process µ can be viewed as a family of probability


measures on the measurable spaces F(ti ), and the stochastic process µ coupled with a
P T arrow η ∶ F → G maps one Markov model to another
23
The elementary definition of a stochastic process, Definition 9, as a probability measure on a function
space suffices for what we might call standard ML. For more general constructions, such as Markov Models
and Hierarchical Hidden Markov Models (HHMM) the generalized definition is required.
24
In any category with a terminal object 1 an arrow whose domain is 1 is called a point. So an arrow
x ∶ 1 → X is called a point of X whereas f ∶ X → Y is sometimes referred to as a generalized element to
emphasize that it “varies” over the domain. It is constructive to consider what this means in the category
of Sets and why the terminology is meaningful.
9 STOCHASTIC PROCESSES AS POINTS 68

µt1 µt2 µt3 µt4

Ft1 ,t2 Ft2 ,t3 Ft3 ,t4


F(t1 ) F(t2 ) F(t3 ) ...

ηt1 ,t2 ηt2 ,t3 ηt3 ,t4

Gt1 ,t2 Gt2 ,t3 Gt3 ,t4


G(t1 ) G(t2 ) G(t3 ) ...

One can also observe that GPs can be defined using this generalized definition of a
stochastic process. For X a measurable space it follows for any finite subset X0 ⊂ X we
have the inclusion map ι ∶ X0 ↪ X which is a measurable function, using the subspace
σ-algbra for X0 , and we are led back to Diagram 14 with the stochastic process P ∶ 1 → Ŷ ,
where Ŷ is as defined in the paragraph above following Definition 18, which satisfies the
appropriate restriction property defining a GP.
These simple examples illustrate that different stochastic processes can be obtained by
either varying the structure of the category X and/or by placing additional requirements
on the projection maps, e.g., requiring the projections be normal distributions on finite
subspaces of the exponent category X.

9.2 Hidden Markov Models


To bring in the Bayesian aspect of Markov models it is necessary to consider the mea-
surement process associated with a sequence as in Figure 39. In particular, consider the
standard diagram

µt1 dt1
St1
F(t1 ) Yt1

which characterizes a Bayesian model, where Yt1 is a copy of a Y which is a data mea-
surement space, St1 is interpreted as a measurement model and dt1 is an actual data
measurement on the “state” space F(t1 ). This determines an inference map It1 so that
10 FINAL REMARKS 69

given a measurement dt1 the posterior probability on F(t1 ) is It1 ○ dt1 . Putting the two
measurement models together with the Markov transformation model F we obtain the
following diagram in Figure 40.

µ̂t1 = I ○ dt1 1
µt1
Ft1 ,t2 ○ µ̂t1
Ft1 ,t2
F(t1 ) F(t2 )
dt1
St1 It1 St2 It2

Yt1 Yt2

Figure 40: The hidden Markov model viewed in P.

This is the hidden Markov process in which given a prior probability µt1 on the space
Ft1 we can use the measurement dt1 to update the prior to the posterior µ̂t1 = It1 ○ dt1 on
F(t1 ). The posterior then composes with Ft1 ,t2 to give the prior Ft1 ,t2 ○ µ̂t1 on F(t2 ), and
now the process can be repeated indefinitely. The Kalman filter is an example in which the
Markov map Ft1 ,t2 describe the linear dynamics of some system under consideration (as
in tracking a satellite), while the sampling distributions St1 model the noisy measurement
process which for the Kalman filter is Gaussian additive noise. Of course one can easily
replace the linear dynamic by a nonlinear dynamic and the Gaussian additive noise model
by any other measurement model, obtaining an extended Kalman filter, and the above
form of the diagram does not change at all, only the P maps change.

10 Final Remarks
In closing, we would like to make a few comments on the use of category theory for ML,
where the largest potential payoff lies in exploiting the abstract framework that categorical
language provides. This section assumes a basic familiarity with monads and should be
viewed as only providing conceptual directions for future research which we believe are
relevant for the mathematical development of learning systems. Further details on the
theory of monads can be found in most category theory books, while the basics as they
relate to our discussion below can be found in our previous paper [6], in which we provide
the simplest possible example of a decision rule on a discrete space.
Seemingly all aspects of ML including Dirichlet distributions and unsupervised learn-
ing (clustering) can be characterized using the category P. As an elementary example,
10 FINAL REMARKS 70

mixture models can be developed by consideration of the space of all (perfect) probability
measures PX on a measurable space X endowed with the coarsest σ-algebra such that
the evaluation maps evB ∶ PX → [0, 1] given by evB (P ) = P (B), for all B ∈ ΣX , are
measurable. This actually defines the object mapping of a functor P ∶ P → Meas which
sends a measurable space X to the space PX of probability measures on X. On arrows,
P sends the P-arrow f ∶ X → Y to the measurable function Pf ∶ PX → PY defined
pointwise on ΣY by
Pf (P )(B) = ∫ fB dP.
X

This functor is called the Giry monad, denoted G, and the Kleisli category K(G) of the
Giry monad is equivalent to P.25 The reason we have chosen to present the material
from the perspective of P rather that K(G) is that the existing literature on ML uses
Markov kernels rather than the equivalent arrows in K(G). The Giry monad determines
the nondeterministic P mapping

PX εX X

given by εX (P, B) = evB (P ) = P (B) for all P ∈ P(X) and all B ∈ ΣX . Using this
construction, any probability measure P on PX then yields a mixture of probability
measures on X through the composite map

P εX ○ P = A mixture model.

PX εX X

We have briefly introduced the Kleisli category K(G) (≅ P) because it is a subcategory


D of the Eilenberg–Moore category of G-algebras, which we call the category of decision
rules,26 because the objects of this category are Meas arrows r ∶ PX → X sending a
probability measure P on X to an actual element of X satisfying some basic properties
including r(δx ) = x. Thus r acts as a decision rule converting a probability measure on
X to an actual element of X and, if P is deterministic, takes that measure to the point
x ∈ X of nonzero measure.27 Decision theory is generally presented from the perspective
of taking probability measures on X and, usually via a family of loss functions θ ∶ X → R,
25
See Giry[14] for the basic definitions and equivalence of these categories.
26
Doberkat [8] has analyzed the Kleisli category under the condition that the arrows are not only
measurable but also continuous. This is an unnecessary assumption, resulting in all finite spaces having
no decision rules, though his considerable work on this category K(G) provides much useful insight as
well as applications of this category.
27
Measurable spaces are defined only up to isomorphism, so that if two elements x, y ∈ X are nondis-
tinguishable in terms of the σ-algebra, meaning there exist no measurable set A ∈ ΣX such that x ∈ A
and y ∈/ A, then δx = δy and we also identify x with y.
11 APPENDIX A: INTEGRALS OVER PROBABILITY MEASURES. 71

making a selection among a family of possible choices θ ∈ Θ where Θ is some measurable


space rather than X. However, it can clearly be viewed from this more basic viewpoint.
The largest potential payoff in using category theory for ML and related applications
appears to be in integrating decision theory with probability theory, expressed in terms of
the category D, which would provide a basis for an automated reasoning system. While
the Bayesian framework presented in this paper can fruitfully be exploited to construct
estimation of unknown functions it still lacks the ability to make decisions of any kind.
Even if we were to invoke a list of simple rules to make decisions the category P is
too restrictive to implement these rules. By working in the larger category of decision
rules D, it is possible to implement both the Bayesian reasoning presented in this work
as well as decision rules as part of larger reasoning system. Our perspective on this
problem is that Bayesian reasoning in general is inadequate—not only because it lacks
the ability to make decisions—but because it is a passive system which “waits around”
for additional measurement data. An automated reasoning system must take self directed
action as in commanding itself to “swivel the camera 45 degrees right to obtain necessary
additional information”, which is a (decision) command and control component which
can be integrated with Bayesian reasoning. An intelligent system would in addition,
based upon the work of Rosen [21], in which he employed categorical ideas, possess an
anticipatory component. While he did not use the language of SMCC it is clear this
aspect was his intention and critical in his method of modeling intelligent systems, and
within the category D this additional aspect can also be modeled.

11 Appendix A: Integrals over probability measures.


The following three properties are the only three properties used throughout the paper
to derive the values of integrals defined over probability measures.

1. The integral of any measurable function f ∶ X → R with respect to a dirac measure


satisfies
∫u∈X f (u) dδx = f (x).
This is straightforward to show using standard measure theoretic arguments.

2. Integration with respect to a push forward measure can be pulled back. Suppose
f ∶ X → Y is any measurable function, P is a probability measure on X, and
φ ∶ Y → R is any measurable function. Then

∫y∈Y φ(y) d(P f ) = ∫x∈X φ(f (x)) dP


−1

To prove this simply show that it holds for φ = 1B , the characteristic function at
B, then extend it to any simple function, and finally use the monotone convergence
theorem to show it holds for any measurable function.
13 REFERENCES 72

3. Suppose f ∶ X → Y is any measurable function and P is a probability measure on


X. Then
∫x∈X δf (B ∣ x) dP = ∫x∈X 1B (f (x)) dP = P (f (B))
−1

This is a special case of case (2) with φ = 1B .

12 Appendix B: The weak closed structure in P


Here is a simple illustration of the weak closed property of P using finite spaces. Let
X = 2 = {0, 1} and Y = {a, b, c}, both with the powerset σ-algebra. This yields the
powerset σ-algebra on Y X and each function can be represented by an ordered pair, such
as (b, c) denoting the function f (1) = b and f (2) = c. Define two probability measures
P, Q on Y X by
P ({(b, c)}) = .5 = P ({(c, b)})
Q({(b, b)}) = .5 = Q({(c, c)})
and both measures having a value of 0 on all other singleton measurable sets. Both of
these probability measures on Y X yield the same conditional probability measure

P =Q
(X, ΣX ) (Y, ΣY )

since
P ({a}∣1) = 0 = Q({a}∣1)
P ({b}∣1) = .5 = Q({b}∣1)
P ({c}∣1) = .5 = Q({c}∣1)
and
P ({a}∣2) = 0 = Q({a}∣2)
P ({b}∣2) = .5 = Q({b}∣2)
P ({c}∣2) = .5 = Q({c}∣2)
Since P ≠ Q the uniqueness condition required for the closedness property fails and only
the existence condition is satisfied.

13 References
[1] S. Abramsky, R. Blute, and P. Panangaden, Nuclear and trace ideals in tensored-∗
categories. Journal of Pure and Applied Algebra, Vol. 143, Issue 1-3, 1999, pp 3-47.

[2] David Barber, Bayesian Reasoning and Machine Learning, Cambridge University
Press, 2012.
13 REFERENCES 73

[3] N. N. Cencov, Statistical decision rules and optimal inference, Volume 53 of Trans-
lations of Mathematical Monographs, American Mathematical Society, 1982.

[4] Bob Coecke and Robert Speckens, Picturing classical and quantum Bayesian in-
ference, Synthese, June 2012, Volume 186, Issue 3, pp 651-696. http://link.
springer.com/article/10.1007/s11229-011-9917-5

[5] David Corfield, Category Theory in Machine Learning, n-category cafe blog.
http://golem.ph.utexas.edu/category/2007/09/category_theory_in_
machine_lea.html

[6] Jared Culbertson and Kirk Sturtz, A Categorical Foundation for Bayesian Proba-
bility, Applied Categorical Structures, 2013. http://link.springer.com/article/
10.1007/s10485-013-9324-9.

[7] R. Davis, Gaussian Processes, http://www.stat.columbia.edu/~rdavis/papers/


VAG002.pdf

[8] E.E. Doberkat, Kleisi morphism and randomized congruences for the Giry monad, J.
Pure and Applied Algebra, Vol. 211, pp 638-664, 2007.

[9] R.M. Dudley, Real Analysis and Probability, Cambridge Studies in Advanced Math-
ematics, no. 74, Cambridge University Press, 2002

[10] R. Durrett, Probability: Theory and Examples, 4th ed., Cambridge University Press,
New York, 2010.

[11] A.M. Faden. The Existence of Regular Conditional Probabilities: Necessary and
Sufficient Conditions. The Annals of Probability, 1985, Vol. 13, No. 1, 288-298.

[12] Brendan Fong, Causal Theories: A Categorical Perspective on Bayesian Networks.


Preprint, April 2013. http://arxiv.org/pdf/1301.6201.pdf

[13] Tobias Fritz, A presentation of the category of stochastic matrices, 2009. http:
//arxiv.org/pdf/0902.2554.pdf

[14] M. Giry, A categorical approach to probability theory, in Categorical Aspects of


Topology and Analysis, Vol. 915, pp 68-85, Springer-Verlag, 1982.

[15] E.T. Jaynes, Probability Theory: The Logic of Science, Cambridge University Press,
2003.

[16] F.W. Lawvere, The category of probabilistic mappings. Unpublished seminar notes
1962.

[17] F.W. Lawvere, Bayesian Sections, private communication, 2011.


13 REFERENCES 74

[18] X. Meng, Categories of convex sets and of metric spaces, with applications to stochas-
tic programming and related areas, Ph.D. Thesis, State University of New York at
Buffalo, 1988.

[19] Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.

[20] C.E. Rasmussen and C.K.I.. Williams, Gaussian Processes for Machine Learning.
MIT Press, 2006.

[21] Robert Rosen, Life Itself, Columbia University Press, 1991.

[22] V.A. Voevodskii, Categorical probability, Steklov Mathematical Institute Semi-


nar, Nov. 20, 2008. http://www.mathnet.ru/php/seminars.phtml?option_lang=
eng&presentid=259.

Jared Culbertson Kirk Sturtz


RYAT, Sensors Directorate Universal Mathematics
Air Force Research Laboratory, WPAFB Vandalia, OH 45377
Dayton, OH 45433 [email protected]
[email protected]

You might also like