Ai 2
Ai 2
Dr Sean Holden
Computer Laboratory, Room FC06
Telephone extension 63725
Email: [email protected]
www.cl.cam.ac.uk/sbh11/
Copyright c Sean Holden 2002-12.
1
Syllabus part I: advanced planning
New things to be looked at include some more advanced material on planning
algorithms:
Heuristics and GraphPlan: incorporating heuristics into partial-order plan-
ning, planning graphs, the GraphPlan algorithm. [1 lecture]
Planning using propositional logic: representing planning problems using
propositional logic, and generating plans using satisability solvers. [1 lec-
ture]
Planning using constraint satisfaction: representing planning problems so that
they can be solved using constraint satisfaction solvers. [1 lecture]
There is no warranty attached to the stated lecture timings.
2
Syllabus part II: uncertainty in AI
We then delve into some more modern material which takes account of uncer-
tainty:
Uncertainty and Bayesian networks: review of probability as applied to AI,
Bayesian networks, inference in Bayesian networks using both exact and ap-
proximate techniques, other ways of dealing with uncertainty. [4 lectures]
Utility and decision-making: maximising expected utility, decision networks,
the value of information. [1 lecture]
Please read the supplementary notes on probability handout.
3
Syllabus part III: uncertainty and time
We then look at how uncertain reasoning and learning can take place when time is
to be taken into account:
Markov processes: transition and sensor models.
Inference in temporal models: ltering, prediction, smoothing and nding the
most likely explanation.
Hidden Markov models. [2 lectures]
4
Syllabus part IV: learning
Finally, we apply probability to supervised learning to obtain [1 lecture] more
sophisticated models of learning.
Bayes theorem as applied to supervised learning. [1 lecture]
The maximum likelihood and maximum a posteriori hypotheses. [1 lecture]
Applying the Bayesian approach to neural networks. [3 lectures]
We nish the course by taking a brief look at reinforcement learning.
How can we learn from rewards and punishments?
The Q-learning algorithm. [1 lecture]
Reinforcement learning can be thought of as combining many of the elements
covered in this course and in AI I, and thus provides a natural place to stop.
5
Books
Once again, the main single text book for the course is:
Articial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig,
Prentice Hall.
There is an accompanying web site at
aima.cs.berkeley.edu
Either the second or third edition should be ne, but avoid the rst edition as it
does not t this course so well.
Chapter numbers given in these notes refer to the third edition.
6
Books
For some of the new material on neural networks you might also like to take a
look at:
Pattern Recognition and Machine Learning. Christopher M. Bishop. Springer,
2006.
For some of the new material on reinforcement learning you might like to consult:
Machine Learning. Tom Mitchell. McGraw Hill, 1997.
For further material on planning try:
Automated Planning: Theory and Practice. Malik Ghallab, Dana Nau and
Paolo Traverso. Morgan Kaufmann, 2004.
7
Dire Warning
DIRE WARNING
This course contains quite a lot of:
1. Probability
2. Matrix algebra
3. Calculus
As I am an evil and vindictive person who likes to be unkind to kittens I will
assume that you know everything on these subjects that was covered in earlier
courses.
If you dont it is essential that you re-visit your old notes and make sure that
youre at home with that material.
YOU HAVE BEEN WARNED
8
Hows your maths?
To see if youre up to speed on the maths, have a go at the following:
Evaluate the integral
_
exp(x
2
) dx
Hint: this is a pretty standard result. Square the integral and change to polar
coordinates.
9
Hows your maths?
Following on from that, heres something a bit more challenging.
Evaluate the integral
_
exp
_
1
2
_
x
T
x + x
T
+
_
_
dx
1
dx
n
where is a symmetric n n matrix with real elements, R
n
, R and
x
T
=
_
x
1
x
2
x
n
R
n
(This second one is a bit tricky. Ill show you the answer later. . . )
10
Planning II
We now examine:
The way in which basic heuristics might be dened for use in planning prob-
lems.
The construction of planning graphs and their use in obtaining more sensible
heuristics.
Planning graphs as the basis of the GraphPlan algorithm.
Planning using propositional logic.
Planning using constraint satisfaction.
Reading: Russell and Norvig, relevant sections of chapter 11.
11
A quick review
We used the following simple example problem.
The intrepid little scamps in the Cambridge University Roof-Climbing Society
wish to attach an inatable gorilla to the spire of a famous College. To do this
they need to leave home and obtain:
An inatable gorilla: these can be purchased from all good joke shops.
Some rope: available from a hardware store.
A rst-aid kit: also available from a hardware store.
They need to return home after theyve nished their shopping.
How do they go about planning their jolly escapade?
12
The STRIPS language
STRIPS: Stanford Research Institute Problem Solver (1970).
States: are conjunctions of ground literals with no functions.
At(Home) Have(Gorilla)
Have(Rope)
Have(Kit)
Goals: are conjunctions of literals where variables are assumed existentially
quantied.
At(x) Sells(x, Gorilla)
A planner nds a sequence of actions that makes the goal true when performed.
13
An example of partial-order planning
Here is the initial plan:
At(Home) Have(G) Have(R) Have(FA)
Finish
Start
At(Home) Sells(JS,G) Sells(HS,R) Sells(HS,FA)
Thin arrows denote ordering.
14
An example of partial-order planning
There are two actions available:
Go(y)
At(y), At(x)
Buy(y)
At(x), Sells(x, y)
Have(y)
At(x)
15
An example of partial-order planning
Start
Buy(G)
At(JS), Sells(JS,G)
Go(JS)
Finish
At(Home), Have(G), Have(R), Have(FA)
At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)
Buy(R)
Sells(HS,R), At(HS)
Go(HS)
At(x)
At(x)
At(Home)
The At(HS) precondition is easy to achieve.
But if we introduce a causal link fromStart to Go(HS) then we risk invalidating
the precondition for Go(JS).
16
An example of partial-order planning
The planner could backtrack and try to achieve the At(x) precondition using the
existing Go(JS) step.
Start
Buy(G)
At(JS), Sells(JS,G)
Go(JS)
Finish
At(Home), Have(G), Have(R), Have(FA)
At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)
Buy(R)
Go(HS)
At(JS)
At(JS)
Sells(HS,R), At(HS)
At(Home)
This involves a threat, but one that can be xed using promotion.
17
Using heuristics in planning
We found in looking at search problems that heuristics were a helpful thing to
have.
Note that now:
There is no simple representation of a state.
Consequently it is harder to measure the distance to a goal.
Dening heuristics for planning is therefore more difcult than it was for search
problems.
18
Using heuristics in planning
We can quickly suggest some possibilities.
For example
h = number of unsatised preconditions
or
h =number of unsatised preconditions
number satised by the start state
These can lead to underestimates or overestimates:
Underestimates if actions can affect one another in undesirable ways.
Overestimates if actions achieve many preconditions.
19
Using heuristics in planning
We can go a little further by learning from Constraint Satisfaction Problems and
adopting the most constrained variable heuristic:
Prefer the precondition satisable in the smallest number of ways.
This can be computationally demanding but two special cases are helpful:
Choose preconditions for which no action will satisfy them.
Choose preconditions that can only be satised in one way.
20
Planning graphs
Planning graphs can be used:
To compute more sensible heuristics.
To generate entire plans.
Also, planning graphs are easy to construct.
They apply only when it is possible to work entirely using propositional represen-
tations of plans.
Luckily, STRIPS can always be propositionalized...
21
Planning graphs
For example: the triumphant return of the gorilla-purchasing roof-climbers...
At(y), At(x)
Go(y)
At(x)
Predicate
Go(Home)
At(JS)
At(Home)
Go(JS)
and so on...
Propositional
At(Home)
Go(HS)
Go(HS)
At(HS), At(Home)
At(Home), At(JS)
At(JS)
At(JS), At(Home) At(HS), At(JS)
22
Planning graphs
A planning graph is constructed in levels:
Level 0 corresponds to the start state.
At each level we keep approximate track of all things that could be true at the
corresponding time.
At each level we keep approximate track of what actions could be applicable
at the corresponding time.
The approximation is due to the fact that not all conicts between actions are
tracked. So:
The graph can underestimate how long it might take for a particular proposi-
tion to appear, and therefore . . .
. . . a heuristic can be extracted.
23
Planning graphs: a simple example
Our intrepid student adventurers will of course need to inate their gorilla before
attaching it to a distinguished roof . It has to be purchased before it can be inated.
Start state: Empty.
We assume that anything not mentioned in a state is false. So the state is actually
Have(Gorilla) and Inflated(Gorilla)
Actions:
Buy(Gorilla)
Have(Gorilla)
Have(Gorilla) Inflated(Gorilla)
Have(Gorilla)
Inflate(Gorilla)
Goal: Have(Gorilla) and Inflated(Gorilla).
24
Planning graphs
Buy(G)
H(G)
Buy(G)
Inf(G)
H(G) H(G)
I(G) I(G)
A
0
A
1
S
0
H(G)
I(G)
Describe start
state.
All actions available in
start state.
S
1
S
1
All possibilities for
what might be the
case at time 1.
All actions that might
be available at time
1.
All possibilities for
what might be the
case at time 2.
= a persistence actionwhat happens if no action is taken.
H(G)
I(G)
An action level A
i
contains all actions that could happen given the propositions in S
i
.
25
Mutex links
We also record, using mutual exclusion (mutex) links which pairs of actions could
not occur together.
Mutex links 1: Effects are inconsistent.
Buy(G)
H(G) H(G)
A
0
S
0
H(G)
S
1
The effect of one action negates the effect of another.
26
Mutex links
Mutex links 2: The actions interfere.
Inf(G)
I(G)
I(G)
I(G)
S
1
A
1
S
1
The effect of an action negates the precondition of another.
27
Mutex links
Mutex links 3: Competing for preconditions.
Buy(G)
Inf(G)
H(G)
A
1
H(G)
S
1
The precondition for an action is mutually exclusive with the precondition for
another. (See next slide!)
28
Mutex links
A state level S
i
contains all propositions that could be true, given the possible
preceding actions.
We also use mutex links to record pairs that can not be true simultaneously:
Possibility 1: pair consists of a proposition and its negation.
H(G)
H(G)
S
1
29
Mutex links
Possibility 2: all pairs of actions that could achieve the pair of propositions are
mutex.
Buy(G)
Inf(G)
H(G)
A
1
H(G)
I(G)
S
1
The construction of a planning graph is continued until two identical levels are
obtained.
30
Planning graphs
Buy(G)
H(G)
Buy(G)
Inf(G)
H(G) H(G)
I(G) I(G)
A
0
A
1
S
0
H(G)
H(G)
I(G)
I(G)
S
1
S
2
31
Obtaining heuristics from a planning graph
To estimate the cost of reaching a single proposition:
Any proposition not appearing in the nal level has innite cost and can never
be reached.
The level cost of a proposition is the level at which it rst appears but this may
be inaccurate as several actions can apply at each level and this cost does not
count the number of actions. (It is however admissible.)
A serial planning graph includes mutex links between all pairs of actions ex-
cept persistence actions.
Level cost in serial planning graphs can be quite a good measurement.
32
Obtaining heuristics from a planning graph
How about estimating the cost to achieve a collection of propositions?
Max-level: use the maximum level in the graph of any proposition in the set.
Admissible but can be inaccurate.
Level-sum: use the sum of the levels of the propositions. Inadmissible but
sometimes quite accurate if goals tend to be decomposable.
Set-level: use the level at which all propositions appear with none being mutex.
Can be accurate if goals tend not to be decomposable.
33
Other points about planning graphs
A planning graph guarantees that:
1. If a proposition appears at some level, there may be a way of achieving it.
2. If a proposition does not appear, it can not be achieved.
The rst point here is a loose guarantee because only pairs of items are linked by
mutex links.
Looking at larger collections can strengthen the guarantee, but in practice the gains
are outweighed by the increased computation.
34
Graphplan
The GraphPlan algorithm goes beyond using the planning graph as a source of
heuristics.
Start at level 0;
while(true) {
if (all goal propositions appear in the current level
AND no pair has a mutex link) {
attempt to extract a plan;
if (a solution is obtained)
return the solution;
else if (graph indicates there is no solution)
return fail;
}
else
expand the graph to the next level;
}
We extract a plan directly from the planning graph. Termination can be proved
but will not be covered here.
35
Graphplan in action
Here, at levels S
0
and S
1
we do not have both H(G) and I(G) available with no
mutex links, and so we expand rst to S
1
and then to S
2
.
Buy(G)
H(G)
Buy(G)
Inf(G)
H(G) H(G)
I(G) I(G)
A
0
A
1
S
0
H(G)
H(G)
I(G)
I(G)
S
1
S
2
At S
2
we try to extract a solution (plan).
36
Extracting a plan from the graph
Extraction of a plan can be formalised as a search problem.
States contain a level, and a collection of unsatised goal propositions.
Start state: the current nal level of the graph, along with the relevant goal propo-
sitions.
Goal: a state at level S
0
containing the initial propositions.
37
Extracting a plan from the graph
Actions: For a state S with level S
i
, a valid action is to select any set X of actions
in A
i1
such that:
1. no pair has a mutex link;
2. no pair of their preconditions has a mutex link;
3. the effects of the actions in X achieve the propositions in S.
The effect of such an action is a state having level S
i1
, and containing the pre-
conditions for the actions in X.
Each action has a cost of 1.
38
Graphplan in action
Start state
Action: Action:
Buy(G)
H(G)
Buy(G)
Inf(G)
H(G) H(G)
I(G) I(G)
A
0
A
1
S
0
H(G)
H(G)
I(G)
I(G)
S
1
S
2
Buy(G)
H(G)
S
0
S
1
S
2
H(G) I(G)
Inf(G) and
39
Heuristics for plan extraction
We can of course also apply heuristics to this part of the process.
For example, when dealing with a set of propositions:
Choose the proposition having maximum level cost rst.
For that proposition, attempt to achieve it using the action for which the maxi-
mum/sum level cost of its preconditions is minimum.
40
Planning III: planning using propositional logic
Last year we saw that plans might be extracted from a knowledge base via theorem
proving, using rst order logic (FOL) and situation calculus.
BUT: this might be computationally infeasible for realistic problems.
Sophisticated techniques are available for testing satisability in propositional
logic, and these have also been applied to planning.
The basic idea is to attempt to nd a model of a sentence having the form
description of start state
descriptions of the possible actions
description of goal
41
Propositional logic for planning
We attempt to construct this sentence such that:
If M is a model of the sentence then M assigns to a proposition if and only
if it is in the plan.
Any assignment denoting an incorrect plan will not be a model as the goal
description will not be .
The sentence is unsatisable if no plan exists.
42
Propositional logic for planning
Start state:
S =At
0
(a, spire) At
0
(b, ground)
At
0
(a, ground) At
0
(b, spire)
b
The two climbers want to swap places...
a
Remember that an expression such as At
0
(a, spire) is a proposition. The su-
perscripted number now denotes time.
43
Propositional logic for planning
Goal:
G =At
i
(a, ground) At
i
(b, spire)
At
i
(a, spire) At
i
(b, ground)
Actions: can be introduced using the equivalent of successor-state axioms
At
1
(a,ground)
(At
0
(a, ground) Move
0
(a, ground, spire))
(At
0
(a, spire) Move
0
(a, spire, ground))
(1)
Denote by A the collection of all such axioms.
44
Propositional logic for planning
We will nownd that SAGhas a model in which Move
0
(a, spire, ground)
and Move
0
(b, ground, spire) are while all remaining actions are .
In more realistic planning problems we will clearly not know in advance at what
time the goal might expect to be achieved.
We therefore:
Loop through possible nal times T.
Generate a goal for time T and actions up to time T.
Try to nd a model and extract a plan.
Until a plan is obtained or we hit some maximum time.
45
Propositional logic for planning
Unfortunately there is a problemwe may, if considerable care is not applied,
also be able to obtain less sensible plans.
In the current example
Move
0
(b, ground, spire) =
Move
0
(a, spire, ground) =
Move
0
(a, ground, spire) =
is a model, because the successor-state axiom (1) does not in fact preclude the
application of Move
0
(a, ground, spire).
We need a precondition axiom
Move
i
(a, ground, spire) At
i
(a, ground)
and so on.
46
Propositional logic for planning
Life becomes more complicated still if a third location is added: hospital.
Move
0
(a, spire, ground) Move
0
(a, spire, hospital)
is perfectly valid and so we need to specify that he cant move to two places
simultaneously
(Move
i
(a, spire, ground) Move
i
(a, spire, hospital))
(Move
i
(a, ground, spire) Move
i
(a, ground, hospital))
.
.
.
and so on.
These are action-exclusion axioms.
Unfortunately they will tend to produce totally-ordered rather than partially-ordered
plans.
47
Propositional logic for planning
Alternatively:
1. Prevent actions occurring together if one negates the effect or precondition of
the other.
2. Or, specify that something cant be in two places simultaneously
x, i, l1, l2 l1 = l2 (At
i
(x, l1) At
i
(x, l2))
This is an example of a state constraint.
Clearly this process can become very complex, but there are techniques to help
deal with this.
48
Planning IV: planning using constraint satisfaction
49
Review of constraint satisfaction problems (CSPs)
We have:
A set of n variables V
1
, V
2
, . . . , V
n
.
For each V
i
a domain D
i
specifying the values that V
i
can take.
A set of m constraints C
1
, C
2
, . . . , C
m
.
Each constraint C
i
involves a set of variables and species an allowable collection
of values.
A state is an assignment of specic values to some or all of the variables.
An assignment is consistent if it violates no constraints.
An assignment is complete if it gives a value to every variable.
A solution is a consistent and complete assignment.
50
Example
We will use the problem of colouring the nodes of a graph as a running example.
1
2
8
6
5
3
4
7 7
5
6
4
3
1
2
8
Each node corresponds to a variable. We have three colours and directly con-
nected nodes should have different colours.
Caution required: later on, edges will have a different meaning.
51
Example
This translates easily to a CSP formulation:
The variables are the nodes
V
i
= node i
The domain for each variable contains the values black, red and cyan
D
i
= {B, R, C}
The constraints enforce the idea that directly connected nodes must have dif-
ferent colours. For example, for variables V
1
and V
2
the constraints specify
(B, R), (B, C), (R, B), (R, C), (C, B), (C, R)
Variable V
8
is unconstrained.
52
Different kinds of CSP
This is an example of the simplest kind of CSP: it is discrete with nite domains.
We will concentrate on these.
We will also concentrate on binary constraints; that is, constraints between pairs
of variables.
Constraints on single variablesunary constraintscan be handled by ad-
justing the variables domain. For example, if we dont want V
i
to be red, then
we just remove that possibility from D
i
.
Higher-order constraints applying to three or more variables can certainly be
considered, but...
...when dealing with nite domains they can always be converted to sets of
binary constraints by introducing extra auxiliary variables.
How does that work?
53
The state-variable representation
Another planning language: the state-variable representation.
Things of interest such as people, places, objects etc are divided into domains:
D
1
= {climber1, climber2}
D
2
= {home, jokeShop, hardwareStore, pavement, spire, hospital}
D
3
= {rope, inflatableGorilla}
Part of the specication of a planning problem involves stating which domain a
particular item is in. For example
D
1
(climber1)
and so on.
Relations and functions have arguments chosen from unions of these domains.
above(x, y) D
above
1
D
above
2
is a relation. The D
above
i
are unions of one or more D
i
.
54
The state-variable representation
The relation above is in fact a rigid relation (RR), as it is unchanging: it does not
depend upon state. (Remember uents in situation calculus?)
Similarly, we have functions
at(x
1
, s) : D
at
1
S D
at
.
Here, at(x, s) is a state-variable. The domain D
at
1
and range D
at
are unions of
one or more D
i
. In general these can have multiple parameters
sv(x
1
, . . . , x
n
, s) : D
sv
1
D
sv
n
S D
sv
.
A state-variable denotes assertions such as
at(gorilla, s) = jokeShop
where s denotes a state and the set S of all states will be dened later.
The state variable allows things such as locations to changeagain, much like
uents in the situation calculus.
Variables appearing in relations and functions are considered to be typed.
55
The state-variable representation
Note:
For properties such as a location a function might be considerably more suit-
able than a relation.
For locations, everything has to be somewhere and it can only be in one place
at a time.
So a function is perfect and immediately solves some of the problems seen earlier.
56
The state-variable representation
Actions as usual, have a name, a set of preconditions and a set of effects.
Names are unique, and followed by a list of variables involved in the action.
Preconditions are expressions involving state variables and relations.
Effects are assignments to state variables.
For example:
buy(x, y, l)
Preconditions at(x, s) = l
sells(l, y)
has(y, s) = l
Effects has(y, s) = x
57
The state-variable representation
Goals are sets of expressions involving state variables.
For example:
Goal:
at(climber, s) = home
has(rope, s) = climber
at(gorilla, s) = spire
From now on we will generally suppress the state s when writing state variables.
58
The state-variable representation
We can essentially regard a state as just a statement of what values the state vari-
ables take at a given time.
Formally:
For each state variable sv we can consider all ground instances such as
sv(climber, rope)with arguments that are consistent with the rigid rela-
tions.
Dene X to be the set of all such ground instances.
A state s is then just a set
s = {(v = c)|v X}
where c is in the range of v.
This allows us to dene the effect of an action.
A planning problem also needs a start state s
0
, which can be dened in this way.
59
The state-variable representation
Considering all the ground actions consistent with the rigid relations:
An action is applicable in s if all expressions v = c appearing in the set of
preconditions also appear in s.
Finally, there is a function that maps a state and an action to a new state
(s, a) = s
Specically, we have
(s, a) = {(v = c)|v X}
where either c is specied in an effect of a, or otherwise v = c is a member of s.
Note: the denition of implicitly solves the frame problem.
60
The state-variable representation
A solution to a planning problem is a sequence (a
0
, a
1
, . . . , a
n
) of actions such
that...
a
0
is applicable in s
0
and for each i, a
i
is applicable in s
i
= (s
i1
, a
i1
).
For each goal g we have
g (s
n
, a
n
).
What we need now is a method for transforming a problem described in this lan-
guage into a CSP.
Well once again do this for a xed upper limit T on the number of steps in the
plan.
61
Converting to a CSP
Step 1: encode actions as CSP variables.
For each time step t where 0 t T 1, the CSP has a variable
action
t
with domain
D
action
t
= {a|a is the ground instance of an action} {none}
Example: at some point in searching for a plan we might attempt to nd the
solution to the corresponding CSP involving
action
5
= attach(inflatableGorilla, spire)
WARNING: be careful in what follows to distinguish between state variables, ac-
tions etc in the planning problem and variables in the CSP.
62
Converting to a CSP
Step 2: encode ground state variables as CSP variables, with a complete copy of
all the state variables for each time step.
So, for each t where 0 t T we have a CSP variable
sv
t
i
(c
1
, . . . , c
n
)
with domain D
sv
i
. (That is, the domain of the CSP variable is the range of the
state variable.)
Example: at some point in searching for a plan we might attempt to nd the
solution to the corresponding CSP involving
location
9
(climber1) = hospital.
63
Converting to a CSP
Step 3: encode the preconditions for actions in the planning problem as con-
straints in the CSP problem.
For each time step t and for each ground action a(c
1
, . . . , c
n
) with arguments con-
sistent with the rigid relations in its preconditions:
For a precondition of the form sv
i
= v include constraint pairs
(action
t
= a(c
1
, . . . , c
n
),
sv
t
i
= v)
Example: consider the action buy(x, y, l) introduced above, and having the pre-
conditions at(x) = l, sells(l, y) and has(y) = l.
Assume sells(y, l) is only true for
l = jokeShop
and
y = inflatableGorilla
(its a very strange town) so we only consider these values for l and y. Then for
each time step t we have the constraints...
64
Converting to a CSP
action
t
= buy(climber1, inflatableGorilla, jokeShop)
paired with
at
t
(climber1) = jokeShop
action
t
= buy(climber1, inflatableGorilla, jokeShop)
paired with
has
t
(inflatableGorilla) = jokeShop
action
t
= buy(climber2, inflatableGorilla, jokeShop)
paired with
at
t
(climber2) = jokeShop
action
t
= buy(climber2, inflatableGorilla, jokeShop)
paired with
has
t
(inflatableGorilla) = jokeShop
and so on...
65
Converting to a CSP
Step 4: encode the effects of actions in the planning problem as constraints in the
CSP problem.
For each time step t and for each ground action a(c
1
, . . . , c
n
) with arguments con-
sistent with the rigid relations in its preconditions:
For an effect of the form sv
i
= v include constraint pairs
(action
t
= a(c
1
, . . . , c
n
),
sv
t+1
i
= v)
Example: continuing with the previous example, we will include constraints
action
t
= buy(climber1, inflatableGorilla, jokeShop)
paired with
has
t+1
(inflatableGorilla) = climber1
action
t
= buy(climber2, inflatableGorilla, jokeShop)
paired with
has
t+1
(inflatableGorilla) = climber2
and so on...
66
Converting to a CSP
Step 5: encode the frame axioms as constraints in the CSP problem.
An action must not change things not appearing in its effects. So:
For:
1. Each time step t.
2. Each ground action a(c
1
, . . . , c
n
) with arguments consistent with the rigid re-
lations in its preconditions.
3. Each sv
i
that does not appear in the effects of a, and each v D
sv
i
include in the CSP the ternary constraint
(action
t
= a(c
1
, . . . , c
n
),
sv
t
i
= v,
sv
t+1
i
= v)
67
Finding a plan
Finally, having encoded a planning problem into a CSP, we solve the CSP.
The scheme has the following property:
A solution to the planning problem with at most T steps exists if and only if there
is a a solution to the corresponding CSP.
Assume the CSP has a solution.
Then we can extract a plan simply by looking at the values assigned to the action
t
variables in the solution of the CSP.
It is also the case that:
There is a solution to the planning problem with at most T steps if and only if there
is a solution to the corresponding CSP from which the solution can be extracted
in this way.
For a proof see:
Automated Planning: Theory and Practice
Malik Ghallab, Dana Nau and Paolo Traverso. Morgan Kaufmann 2004.
68
Uncertainty I: Probability as Degree of Belief
We now examine:
How probability theory might be used to represent and reason with knowledge
when we are uncertain about the world.
How inference in the presence of uncertainty can in principle be performed
using only basic results along with the full joint probability distribution.
How this approach fails in practice.
How the notions of independence and conditional independence may be used
to solve this problem.
Reading: Russell and Norvig, chapter 13.
69
Uncertainty in AI
The (predominantly logic-based) methods covered so far have assorted shortcom-
ings:
Limited epistemological commitmenttrue/false/unknown.
Actions are possible when sufcient knowledge is available...
...but this is not generally the case.
In practice there is a need to cope with uncertainty.
For example in the Wumpus World:
We can not make observations further aeld than the current locality.
Consequently inferences regarding pit/wumpus location etc will not usually be
possible.
70
Uncertainty in AI
A couple of more subtle problems have also presented themselves:
The Qualication Problem: it is not generally possible to guarantee that an
action will succeedonly that it will succeed if many other preconditions
do/dont hold.
Rational action depends on the likelihood of achieving different goals, and
their relative desirability.
71
Logic (as seen so far) has major shortcomings
An example:
x symptom(x, toothache) problem(x, cavity)
This is plainly incorrect. Toothaches can be caused by things other than cavities.
x symptom(x, toothache) problem(x, cavity)
problem(x, abscess)
problem(x, gum-disease)
BUT:
It is impossible to complete the list.
Theres no clear way to take account of the relative likelihoods of different
causes.
72
Logic (as seen so far) has major shortcomings
If we try to make a causal rule
x problem(x, abscess) symptom(x, toothache)
its still wrongabscesses do not always cause pain.
We need further information in addition to
problem(x, abscess)
and its still not possible to do this correctly.
73
Logic (as seen so far) has major shortcomings
FOL can fail for essentially three reasons:
1. Laziness: it is not feasible to assemble a set of rules that is sufciently exhaus-
tive.
If we could, it would not be feasible to apply them.
2. Theoretical ignorance: insufcient knowledge exists to allow us to write the
rules.
3. Practical ignorance: even if the rules have been obtained there may be insuf-
cient information to apply them.
Instead of thinking in terms of the truth or falsity of a statement we want to deal
with an agents degree of belief in the statement.
Probability theory is the perfect tool for application here.
Probability theory allows us to summarise the uncertainty due to laziness and
ignorance.
74
An important distinction
There is a fundamental difference between probability theory and fuzzy logic:
When dealing with probability theory, statements remain in fact either true or
false.
A probability denotes an agents degree of belief one way or another.
Fuzzy logic deals with degree of truth.
In practice the use of probability theory has proved spectacularly successful.
75
Belief and evidence
An agents beliefs will depend on what it has perceived: probabilities are based
on evidence and may be altered by the acquisition of new evidence:
Prior (unconditional) probability denotes a degree of belief in the absence of
evidence.
Posterior (conditional) probability denotes a degree of belief after evidence is
perceived.
As we shall see Bayes theoremis the fundamental concept that allows us to update
one to obtain the other.
76
Making rational decisions under uncertainty
When using logic, we concentrated on nding an action sequence guaranteed to
achieve a goal, and then executing it.
When dealing with uncertainty we need to dene preferences among states of the
world and take into account the probability of reaching those states.
Utility theory is used to assign preferences.
Decision theory combines probability theory and utility theory.
A rational agent should act in order to maximise expected utility.
77
Probability
We want to assign degrees of belief to propositions about the world.
We will need:
Random variables with associated domainstypically Boolean, discrete, or
continuous.
All the usual conceptsevents, atomic events, sets etc.
Probability distributions and densities.
Probability axioms (Kolmogorov).
Conditional probability and Bayes theorem.
So if youve forgotten this stuff now is a good time to re-read it.
78
Probability
The standard axioms are:
Range
0 Pr(x) 1
Always true propositions
Pr(always true proposition) = 1
Always false propositions
Pr(always false proposition) = 0
Union
Pr(x y) = Pr(x) + Pr(y) Pr(x y)
79
Origins of probabilities I
Historically speaking, probabilities have been regarded in a number of different
ways:
Frequentist: probabilities come from measurements.
Objectivist: probabilities are actual properties of the universe which fre-
quentist measurements seek to uncover.
An excellent example: quantum phenomena.
A bad example: coin ippingthe uncertainty is due to our uncertainty about
the initial conditions of the coin.
Subjectivist: probabilities are an agents degrees of belief.
This means the agent is allowed to make up the numbers!
80
Origins of probabilities II
The reference class problem: even frequentist probabilities are subjective.
Example: Say a doctor takes a frequentist approach to diagnosis. She examines
a large number of people to establish the prior probability of whether or not they
have heart disease.
To be accurate she tries to measure similar people. (She knows for example that
gender might be important.)
Taken to an extreme, all people are different and there is therefore no reference
class.
81
Origins of probabilities III
The principle of indifference (Laplace).
Give equal probability to all propositions that are syntactically symmetric with
respect to the available evidence.
Renements of this idea led to the attempted development by Carnap and oth-
ers of inductive logic.
The aim was to obtain the correct probability of any proposition from an arbi-
trary set of observations.
It is currently thought that no unique inductive logic exists.
Any inductive logic depends on prior beliefs and the effect of these beliefs is
overcome by evidence.
82
Prior probability
A prior probability denotes the probability (degree of belief) assigned to a propo-
sition in the absence of any other evidence.
For example
Pr(Cavity = true) = 0.05
denotes the degree of belief that a random person has a cavity before we make
any actual observation of that person.
To keep things compact, we will use
Pr(Cavity)
to denote the entire probability distribution of the random variable Cavity.
Instead of
Pr(Cavity = true) = 0.05
Pr(Cavity = false) = 0.95
write
Pr(Cavity) = (0.05, 0.95)
83
Notation
A similar convention will apply for joint distributions. For example, if Decay
can take the values severe, moderate or low then
Pr(Cavity, Decay)
is a 2 by 3 table of numbers.
severe moderate low
true 0.26 0.1 0.01
false 0.01 0.02 0.6
Similarly
Pr(true, Decay)
denotes 3 numbers etc.
84
The full joint probability distribution
The full joint probability distribution is the joint distribution of all random vari-
ables that describe the state of the world.
This can be used to answer any query.
(But of course lifes not really that simple!)
85
Conditional probability
We use the conditional probability
Pr(x|y)
to denote the probability that a proposition x holds given that all the evidence we
have so far is contained in proposition y.
From basic probability theory
Pr(x|y) =
Pr(x y)
Pr(y)
Conditional probability is not analogous to logical implication.
Pr(x|y) = 0.1 does not mean that if y is true then Pr(x) = 0.1.
Pr(x) is a prior probability.
The notation Pr(x|y) is for use when y is the entire evidence.
Pr(x|y z) might be very different.
86
Using the full joint distribution to perform inference
We can regard the full joint distribution as a knowledge base.
We want to use it to obtain answers to questions.
CP CP
HBP HBP HBP HBP
HD 0.09 0.05 0.07 0.01
HD 0.02 0.08 0.03 0.65
Well use this medical diagnosis problem as a running example.
HD = Heart disease
CP = Chest pain
HBP = High blood pressure
87
Using the full joint distribution to perform inference
The process is nothing more than the application of basic results:
Sum atomic events:
Pr(HD CP) =Pr(HD CP HBP)
+ Pr(HD CP HBP)
+ Pr(HD CP HBP)
+ Pr(HD CP HBP)
+ Pr(HD CP HBP)
+ Pr(HD CP HBP)
= 0.09 + 0.05 + 0.07 + 0.01 + 0.02 + 0.08
= 0.32
Marginalisation: if A and B are sets of variables then
Pr(A) =
b
Pr(A b) =
b
Pr(A|b) Pr(b)
88
Using the full joint distribution to perform inference
Usually we will want to compute the conditional probability of some variable(s)
given some evidence.
For example
Pr(HD|HBP) =
Pr(HD HBP)
Pr(HBP)
=
0.09 + 0.07
0.09 + 0.07 + 0.02 + 0.03
= 0.76
and
Pr(HD|HBP) =
Pr(HD HBP)
Pr(HBP)
=
0.02 + 0.03
0.09 + 0.07 + 0.02 + 0.03
= 0.24
89
Using the full joint distribution to perform inference
The process can be simplied slightly by noting that
=
1
Pr(HBP)
is a constant and can be regarded as a normaliser making relevant probabilities
sum to 1.
So a short cut is to avoid computing it as above. Instead:
Pr(HD|HBP) = Pr(HD HBP) = (0.09 + 0.07)
Pr(HD|HBP) = Pr(HD HBP) = (0.02 + 0.03)
and we need
Pr(HD|HBP) + Pr(HD|HBP) = 1
so
=
1
0.09 + 0.07 + 0.02 + 0.03
90
Using the full joint distribution to perform inference
The general inference procedure is as follows:
Pr(Q|e) =
1
Z
Pr(Q e) =
1
Z
u
Pr(Q, e, u)
where
Q is the query variable.
e is the evidence.
u are the unobserved variables.
1/Z normalises the distribution.
91
Using the full joint distribution to perform inference
Simple eh?
Well, no...
For n Boolean variables the table has 2
n
entries.
Storage and processing time are both O(2
n
).
You need to establish 2
n
numbers to work with.
In reality we might well have n > 1000, and of course its even worse if variables
are non-Boolean.
How can we get around this?
92
Exploiting independence
If I toss a coin and roll a dice, the full joint distribution of outcomes requires
2 6 = 12 numbers to be specied.
1 2 3 4 5 6
head 0.014 0.028 0.042 0.057 0.071 0.086
tail 0.033 0.067 0.1 0.133 0.167 0.2
Here Pr(Coin = head) = 0.3 and the dice has probability i/21 for the ith
outcome.
BUT: if we assume the outcomes are independent then
Pr(Coin, Dice) = Pr(Coin) Pr(Dice)
Where Pr(Coin) has two numbers and Pr(Dice) has six.
So instead of 12 numbers we only need 8.
93
Exploiting independence
Similarly, say instead of just considering HD, HBP and CP we also consider the
outcome of the Oxford versus Cambridge tiddlywinks competition TC:
Pr(TC = Oxford) = 0.2
Pr(TC = Cambridge) = 0.7
Pr(TC = Draw) = 0.1
Now
Pr(HD, HBP, CP, TC) = Pr(TC|HD, HBP, HD) Pr(HD, HBP, HD)
Assuming that the patient is not an extraordinarily keen fan of tiddlywinks, their
cardiac health has nothing to do with the outcome, so
Pr(TC|HD, HBP, HD) = Pr(TC)
and 2 2 2 3 = 24 numbers has been reduced to 3 + 8 = 11.
94
Exploiting independence
In general you need to identify such independence through knowledge of the prob-
lem.
BUT:
It generally does not work as clearly as this.
The independent subsets themselves can be big.
95
Bayes theorem
From rst principles
Pr(x, y) = Pr(x|y) Pr(y)
Pr(x, y) = Pr(y|x) Pr(x)
so
Pr(x|y) =
Pr(y|x) Pr(x)
Pr(y)
The most important equation in modern AI?
When evidence e is involved this can be written
Pr(Q|R, e) =
Pr(R|Q, e) Pr(Q|e)
Pr(R|e)
96
Bayes theorem
Taking another simple medical diagnosis example: does a patient with a fever
have malaria? A doctor might know that
Pr(fever|malaria) = 0.99
Pr(malaria) =
1
10000
Pr(fever) =
1
20
Consequently we can try to obtain Pr(malaria|fever) by direct application
of Bayes theorem
Pr(malaria|fever) =
0.99 0.0001
0.05
= 0.00198
or using the alternative technique
Pr(malaria|fever) = Pr(fever|malaria) Pr(malaria)
if the relevant further quantity Pr(fever|malaria) is known.
97
Bayes theorem
Sometimes the rst possibility is easier, sometimes not.
Causal knowledge such as
Pr(fever|malaria)
might well be available when diagnostic knowledge such as
Pr(malaria|fever)
is not.
Say the incidence of malaria, modelled by Pr(Malaria), suddenly changes.
Bayes theorem tells us what to do.
The quantity
Pr(fever|malaria)
would not be affected by such a change.
Causal knowledge can be more robust.
98
Conditional independence
What happens if we have multiple pieces of evidence?
We have seen that to compute
Pr(HD|CP, HBP)
directly might well run into problems.
We could try using Bayes theorem to obtain
Pr(HD|CP, HBP) = Pr(CP, HBP|HD) Pr(HD)
However while HD is probably manageable, a quantity such as Pr(CP, HBP|HD)
might well still be problematic especially in more realistic cases.
99
Conditional independence
However although in this case we might not be able to exploit independence di-
rectly we can say that
Pr(CP, HBP|HD) = Pr(CP|HD) Pr(HBP|HD)
which simplies matters.
Conditional independence:
Pr(A, B|C) = Pr(A|C) Pr(B|C).
If we know that C is the case then A and B are independent.
Although CP and HBP are not independent, they do not directly inuence one
another in a patient known to have heart disease.
This is much nicer!
Pr(HD|CP, HBP) = Pr(CP|HD) Pr(HBP|HD) Pr(HD)
100
Naive Bayes
Conditional independence is often assumed even when it does not hold.
Naive Bayes:
Pr(A, B
1
, B
2
, . . . , B
n
) = Pr(A)
n
i=1
Pr(B
i
|A)
Also known as Idiots Bayes.
Despite this, it is often surprisingly effective.
101
Uncertainty II - Bayesian Networks
Having seen that in principle, if not in practice, the full joint distribution alone
can be used to perform any inference of interest, we now examine a practical
technique.
We introduce the Bayesian Network (BN) as a compact representation of the
full joint distribution.
We examine the way in which a BN can be constructed.
We examine the semantics of BNs.
We look briey at how inference can be performed.
Reading: Russell and Norvig, chapter 14.
102
Bayesian networks
Also called probabilistic/belief/causal networks or knowledge maps.
CP HBP
HD TW
Each node is a random variable (RV).
Each node N
i
has a distribution
Pr(N
i
|parents(N
i
))
A Bayesian network is a directed acyclic graph.
Roughly speaking, an arrow from N to M means N directly affects M.
103
Bayesian networks
After a regrettable incident involving an inatable gorilla, a famous College has
decided to install an alarm for the detection of roof climbers.
The alarm is very good at detecting climbers.
Unfortunately, it is also sometimes triggered when one of the extremely fat
geese that lives in the College lands on the roof.
One porters lodge is near the alarm, and inhabited by a chap with excellent
hearing and a pathological hatred of roof climbers: he always reports an
alarm. His hearing is so good that he sometimes thinks he hears an alarm,
even when there isnt one.
Another porters lodge is a good distance away and inhabited by an old chap
with dodgy hearing who likes to listen to his collection of DEATH METAL
with the sound turned up.
104
Bayesian networks
No: 0.95
Yes: 0.05 Yes: 0.2
No: 0.8
a
a a
a
0.001
Y
N
Y
N
Y
Y
N
N
Alarm
Climber Goose
Lodge1 Lodge2
Pr(A|C, G)
0.98
0.08
0.96
0.2
0.6 0.99
0.08
Pr(L1|A) Pr(L2|A)
Pr(A|C, G) C G
Pr(Goose) Pr(Climber)
105
Bayesian networks
Note that:
In the present example all RVs are discrete (in fact Boolean) and so in all cases
Pr(N
i
|parents(N
i
)) can be represented as a table of numbers.
Climber and Goose have only prior probabilities.
All RVs here are Boolean, so a node with p parents requires 2
p
numbers.
A BNwith n nodes represents the full joint probability distribution for those nodes
as
Pr(N
1
= n
1
, N
2
= n
2
, . . . , N
n
= n
n
) =
n
i=1
Pr(N
i
= n
i
|parents(N
i
)) (2)
For example
Pr(C, G, A, L1, L2) = Pr(L1|A) Pr(L2|A) Pr(A|C, G) Pr(C) Pr(G)
= 0.99 0.6 0.08 0.95 0.8
106
Semantics
In general Pr(A, B) = Pr(A|B) Pr(B) so abbreviating Pr(N
1
= n
1
, N
2
= n
2
, . . . , N
n
=
n
n
) to Pr(n
1
, n
2
, . . . , n
n
) we have
Pr(n
1
, . . . , n
n
) = Pr(n
n
|n
n1
, . . . , n
1
) Pr(n
n1
, . . . , n
1
)
Repeating this gives
Pr(n
1
, . . . , n
n
) = Pr(n
n
|n
n1
, . . . , n
1
) Pr(n
n1
|n
n2
, . . . , n
1
) Pr(n
1
)
=
n
i=1
Pr(n
i
|n
i1
, . . . , n
1
)
(3)
Now compare equations (2) and (3). We see that BNs make the assumption
Pr(N
i
|N
i1
, . . . , N
1
) = Pr(N
i
|parents(N
i
))
for each node, assuming that parents(N
i
) {N
i1
, . . . , N
1
}.
Each N
i
is conditionally independent of its predecessors given its parents
107
Semantics
When constructing a BN we want to make sure the preceding property holds.
This means we need to take care over ordering.
In general causes should directly precede effects.
N
i
parents(N
i
)
Here, parents(N
i
) contains all preceding nodes having a direct inuence on N
i
.
108
Semantics
Deviation from this rule can have major effects on the complexity of the network.
Thats bad! We want to keep the network simple:
If each node has at most p parents and there are n Boolean nodes, we need to
specify at most n2
p
numbers...
...whereas the full joint distribution requires us to specify 2
n
numbers.
So: there is a trade-off attached to the inclusion of tenuous although strictly-
speaking correct edges.
109
Semantics
As a rule, we should include the most basic causes rst, then the things they
inuence directly etc.
What happens if you get this wrong?
Example: add nodes in the order L2,L1,G,C,A.
Goose
Lodge2
Climber Alarm
Lodge1
110
Semantics
In this example:
Increased connectivity.
Many of the probabilities here will be quite unnatural and hard to specify.
Once again: causal knowledge is preferred to diagnostic knowledge.
111
Semantics
As an alternative we can say directly what conditional independence assumptions
a graph should be interpreted as expressing. There are two common ways of doing
this.
A
P
2
P
1
N
1
N
2
Any node A is conditionally independent of the N
i
its non-descendantsgiven
the P
i
its parents.
112
Semantics
M
7
M
6
M
5
M
4
M
8
M
1
M
2
M
3
A
Any node A is conditionally independent of all other nodes given the Markov
blanket M
i
that is, its parents, its children and its childrens parents.
113
More complex nodes
How do we represent
Pr(N
i
|parents(N
i
))
when nodes can denote general discrete and/or continuous RVs?
BNs containing both kinds of RV are called hybrid BNs.
Naive discretisation of continuous RVs tends to result in both a reduction in
accuracy and large tables.
O(2
p
) might still be large enough to be unwieldy.
We can instead attempt to use standard and well-understood distributions,
such as the Gaussian.
This will typically require only a small number of parameters to be specied.
114
More complex nodes
Example: functional relationships are easy to deal with.
N
i
= f(parents(N
i
))
Pr(N
i
= n
i
|parents(N
i
)) =
_
1 if n
i
= f(parents(N
i
))
0 otherwise
115
More complex nodes
Example: a continuous RV with one continuous and one discrete parent.
Pr(Speed of car|Throttle position, Tuned engine)
where SC and TP are continuous and TE is Boolean.
For a specic setting of ET = true it might be the case that SC increases
with TP, but that some uncertainty is involved
Pr(SC|TP, et) = N(g
et
TP + c
et
,
2
et
)
For an un-tuned engine we might have a similar relationship with a different
behaviour
Pr(SC|TP, et) = N(g
et
TP + c
et
,
2
et
)
There is a set of parameters {g, c, } for each possible value of the discrete RV.
116
More complex nodes
Example: a discrete RV with a continuous parent
Pr(Go roofclimbing|Size of fine)
We could for example use the probit distribution
Pr(Go roofclimbing = true|size) =
_
t size
s
_
where
(x) =
_
x
N(y)dy
and N(x) is the Gaussian distribution with zero mean and variance 1.
117
More complex nodes
10 8 6 4 2 0 2 4 6 8 10
0
0.2
0.4
0.6
0.8
1
The probit distribution
x
(
x
)
90 92 94 96 98 100 102 104 106 108 110
0
0.2
0.4
0.6
0.8
1
Pr(GRC = true|size) with t = 100 and dierent values of s
size
(
t
s
i
z
e
/
s
)
118
More complex nodes
Alternatively, for this example we could use the logit distribution
Pr(Go roofclimbing = true|size) =
1
1 + e
(2(tsize)/s)
which has a similar shape.
Tails are longer for the logit distribution.
The logit distribution tends to be easier to use...
...but the probit distribution is often more accurate.
119
Basic inference
We saw earlier that the full joint distribution can be used to perform all inference
tasks:
Pr(Q|e) =
1
Z
Pr(Q e) =
1
Z
u
Pr(Q, e, u)
where
Q is the query variable
e is the evidence
u are the unobserved variables
1/Z normalises the distribution.
120
Basic inference
As the BN fully describes the full joint distribution
Pr(Q, u, e) =
n
i=1
Pr(N
i
|parents(N
i
))
It can be used to perform inference in the obvious way
Pr(Q|e) =
1
Z
u
n
i=1
Pr(N
i
|parents(N
i
))
but as well see this is in practice problematic.
More sophisticated algorithms aim to achieve this more efciently.
For complex BNs we resort to approximation techniques.
121
Other approaches to uncertainty: Default reasoning
One criticism made of probability is that it is numerical whereas human argument
seems fundamentally different in nature:
On the one hand this seems quite defensible. I certainly am not aware of doing
logical thought through direct manipulation of probabilities, but. . .
. . . on the other hand, neither am I aware of solving differential equations in
order to walk!
Default reasoning:
Does not maintain degrees of belief .
Allows something to be believed until a reason is found not to.
122
Other approaches to uncertainty: rule-based systems
Rule-based systems have some desirable properties:
Locality: if we establish the evidence X and we have a rule X Y then Y
can be concluded regardless of any other rules.
Detachment: once any Y has been established it can then be assumed. (Its
justication is irrelevant.)
Truth-functionality: truth of a complex formula is a function of the truth of its
components.
These are not in general shared by probabilistic systems. What happens if:
We try to attach measures of belief to rules and propositions.
We try to make a truth-functional system by, for example, making belief in
X Y a function of beliefs in X and Y ?
123
Other approaches to uncertainty: rule-based systems
Problems that can arise:
1. Say I have the causal rule
Heart disease
0.95
Chest pain
and the diagnostic rule
Chest pain
0.7
Heart disease
Without taking very great care to keep track of the reasoning process, these
can form a loop.
2. If in addition I have
Chest pain
0.6
Recent physical exertion
then it is quite possible to form the conclusion that with some degree of cer-
tainty heart disease is explained by exertion, which may well be incorrect.
124
Other approaches to uncertainty: rule-based systems
In addition, we might argue that because heart disease is an explanation for chest
pain the belief in physical exertion should decrease.
In general when such systems have been successful it has been through very care-
ful control in setting up the rules.
125
Other approaches to uncertainty: Dempster-Shafer theory
Dempster-Shafer theory attempts to distinguish between uncertainty and igno-
rance.
Whereas the probabilistic approach looks at the probability of X, we instead look
at the probability that the available evidence supports X.
This is denoted by the belief function Bel(X).
Example: given a coin but no information as to whether it is fair I have no reason
to think one outcome should be preferred to another
Bel(outcome = head) = Bel(outcome = tail) = 0
These beliefs can be updated when new evidence is available. If an expert tells
us there is n percent certainty that its a fair coin then
Bel(outcome = head) = Bel(outcome = tail) =
n
100
1
2
.
We may still have a gap in that
Bel(outcome = head) + Bel(outcome = tail) = 1.
Dempster-Shafer theory provides a coherent system for dealing with belief func-
tions.
126
Other approaches to uncertainty: Dempster-Shafer theory
Problems:
The Bayesian approach deals more effectively with the quantication of how
belief changes when new evidence is available.
The Bayesian approach has a better connection to the concept of utility, whereas
the latter is not well-understood for use in conjunction with Dempster-Shafer
theory.
127
Uncertainty III: exact inference in Bayesian networks
We now examine:
The basic equation for inference in Bayesian networks, the latter being hard to
achieve if approached in the obvious way.
The way in which matters can be improved a little by a small modication to
the way in which the calculation is done.
The way in which much better improvements might be possible using a still
more informed approach, although not in all cases.
Reading: Russell and Norvig, chapter 14, section 14.4.
128
Performing exact inference
We know that in principle any query Q can be answered by the calculation
Pr(Q|e) =
1
Z
u
Pr(Q, e, u)
where Q denotes the query, e denotes the evidence, u denotes unobserved vari-
ables and 1/Z normalises the distribution.
The naive implementation of this approach yields the Enumerate-Joint-Ask algo-
rithm, which unfortunately requires O(2
n
) time and space for n Boolean random
variables (RVs).
129
Performing exact inference
In what follows we will make use of some abbreviations.
C denotes Climber
G denotes Goose
A denotes Alarm
L1 denotes Lodge1
L2 denotes Lodge2
Instead of writing out Pr(C = ), Pr(C = ) etc we will write Pr(c), Pr(c) and
so on.
130
Performing exact inference
Also Pr(Q, e, u) has a particular form expressing conditional independences:
No: 0.95
Yes: 0.05 Yes: 0.2
No: 0.8
a
a a
a
0.001
Y
N
Y
N
Y
Y
N
N
Alarm
Climber Goose
Lodge1 Lodge2
Pr(A|C, G)
0.98
0.08
0.96
0.2
0.6 0.99
0.08
Pr(L1|A) Pr(L2|A)
Pr(A|C, G) C G
Pr(Goose) Pr(Climber)
Pr(C, G, A, L1, L2) = Pr(C)Pr(G)Pr(A|C, G)Pr(L1|A)Pr(L2|A)
131
Performing exact inference
Consider the computation of the query Pr(C|l1, l2)
We have
Pr(C|l1, l2) =
1
Z
G
Pr(C)Pr(G)Pr(A|C, G)Pr(l1|A)Pr(l2|A)
Here there are 5 multiplications for each set of values that appears for summation,
and there are 4 such values.
In general this gives time complexity O(n2
n
) for n Boolean RVs.
Looking more closely we see that
Pr(C|l1, l2) =
1
Z
G
Pr(C)Pr(G)Pr(A|C, G)Pr(l1|A)Pr(l2|A)
=
1
Z
Pr(C)
A
Pr(l1|A)Pr(l2|A)
G
Pr(G)Pr(A|C, G)
=
1
Z
Pr(C)
G
Pr(G)
A
Pr(A|C, G)Pr(l1|A)Pr(l2|A)
(4)
So for example...
132
Performing exact inference
Pr(c|l1, l2) =
1
Z
Pr(c)
_
Pr(g)
_
Pr(a|c, g)Pr(l1|a)Pr(l2|a)
+Pr(a|c, g)Pr(l1|a)Pr(l2|a)
_
+Pr(g)
_
Pr(a|c, g)Pr(l1|a)Pr(l2|a)
+Pr(a|c, g)Pr(l1|a)Pr(l2|a)
__
with a similar calculation for Pr(c|l1, l2).
Basically straightforward, BUT optimisations can be made.
133
Performing exact inference
Pr(c)
Pr(g) Pr(g)
Pr(a|c, g)
+
+
+
Pr(a|c, g) Pr(a|c, g) Pr(a|c, g)
Repeated Repeated
Pr(l1|a)
Pr(l2|a)
Pr(l1|a)
Pr(l2|a) Pr(l2|a)
Pr(l1|a) Pr(l1|a)
Pr(l2|a)
134
Optimisation 1: Enumeration-Ask
The enumeration-ask algorithm improves matters to O(2
n
) time and O(n) space
by performing the computation depth-rst.
However matters can be improved further by avoiding the duplication of compu-
tations that clearly appears in the example tree.
135
Optimisation 2: variable elimination
Looking again at the fundamental equation (4)
1
Z
Pr(C)
. .
C
G
Pr(G)
. .
G
A
Pr(A|C, G)
. .
A
Pr(l1|A)
. .
L1
Pr(l2|A)
. .
L2
where C, G, A, L1, L2 denote the relevant factors.
The basic idea is to evaluate (4) from right to left (or in terms of the tree, bottom
up) storing results as we progress and re-using them when necessary.
Pr(l1|A) depends on the value of A. We store it as a table F
L1
(A). Similarly for
Pr(l2|A).
F
L1
(A) =
_
0.99
0.08
_
F
L2
(A) =
_
0.6
0.001
_
as Pr(l1|a) = 0.99, Pr(l1|a) = 0.08 and so on.
136
Optimisation 2: variable elimination
Similarly for Pr(A|C, G), which is dependent on A, C and G
F
A
(A, C, G) =
A C G F
A
(A, C, G)
0.98
0.96
0.2
0.08
0.02
0.04
0.8
0.92
Can we write
Pr(A|C, G)Pr(l1|A)Pr(l2|A) (5)
as
F
A
(A, C, G)F
L1
(A)F
L2
(A) (6)
in a reasonable way?
137
Optimisation 2: variable elimination
The answer is yes provided multiplication of factors is dened correctly. Look-
ing at (4)
1
Z
Pr(C)
G
Pr(G)
A
Pr(A|C, G)Pr(l1|A)Pr(l2|A)
note that the values of the product (5) in the summation depend on the values of
C and G external to it, and the values of A themselves. So (6) should be a table
collecting values for (5) where correspondences between RVs are maintained.
This leads to a denition for multiplication of factors best given by example.
138
Optimisation 2: variable elimination
F(A, B)F(B, C) = F(A, B, C)
where
A B F(A, B) B C F(B, C) A B C F(A, B, C)
0.3 0.1 0.3 0.1
0.9 0.8 0.3 0.8
0.4 0.8 0.9 0.8
0.1 0.3 0.9 0.3
0.4 0.1
0.4 0.8
0.1 0.8
0.1 0.3
139
Optimisation 2: variable elimination
This process gives us
F
A
(A, C, G)F
L1
(A)F
L2
(A) =
A C G
0.98 0.99 0.6
0.96 0.99 0.6
0.2 0.99 0.6
0.08 0.99 0.6
0.02 0.08 0.001
0.04 0.08 0.001
0.8 0.08 0.001
0.92 0.08 0.001
140
Optimisation 2: variable elimination
How about
F
A,L1,L2
(C, G) =
A
F
A
(A, C, G)F
L1
(A)F
L2
(A)
To denote the fact that A has been summed out we place a bar over it in the
notation.
A
F
A
(A, C, G)F
L1
(A)F
L2
(A) =F
A
(a, C, G)F
L1
(a)F
L2
(a)
+ F
A
(a, C, G)F
L1
(a)F
L2
(a)
where
F
A
(a, C, G) =
C G
0.98
0.96
0.2
0.08
F
L1
(a) = 0.99 F
L2
(a) = 0.6
and similarly for F
A
(a, C, G), F
L1
(a) and F
L2
(a).
141
Optimisation 2: variable elimination
F
A
(a, C, G)F
L1
(a)F
L2
(a) =
C G
0.98 0.99 0.6
0.96 0.99 0.6
0.2 0.99 0.6
0.08 0.99 0.6
F
A
(a, C, G)F
L1
(a)F
L2
(a) =
C G
0.02 0.08 0.001
0.04 0.08 0.001
0.8 0.08 0.001
0.92 0.08 0.001
F
A,L1,L2
(C, G) =
C G
(0.98 0.99 0.6) + (0.02 0.08 0.001)
(0.96 0.99 0.6) + (0.04 0.08 0.001)
(0.2 0.99 0.6) + (0.8 0.08 0.001)
(0.08 0.99 0.6) + (0.92 0.08 0.001)
142
Optimisation 2: variable elimination
Now, say for example we have c, g. Then doing the calculation explicitly would
give
A
Pr(A|c, g)Pr(l1|A))Pr(l2|A)
= Pr(a|c, g)Pr(l1|a)Pr(l2|a) + Pr(a|c, g)Pr(l1|a)Pr(l2|a)
= (0.2 0.99 0.6) + (0.8 0.08 0.001)
which matches!
Continuing in this manner form
F
G,A,L1,L2
(C, G) = F
G
(G)F
A,L1,L2
(C, G)
sum out G to obtain F
G,A,L1,L2
(C) =
G
F
G
(G)F
A,L1,L2
(C, G), form
F
C,G,A,L1,L2
= F
C
(C)F
G,A,L1,L2
(C)
and normalise.
143
Optimisation 2: variable elimination
Whats the computational complexity now?
For Bayesian networks with suitable structure we can perform inference in
linear time and space.
However in the worst case it is #P-hard, which is worse than NP-hard.
Consequently, we may need to resort to approximate inference.
144
Uncertainty IV: Simple Decision-Making
We now examine:
The concept of a utility function.
The way in which such functions can be related to reasonable axioms about
preferences.
A generalization of the Bayesian network, known as a decision network.
How to measure the value of information, and how to use such measurements
to design agents that can ask questions.
Reading: Russell and Norvig, chapter 16.
145
Simple decision-making
We now look at choosing an action by maximising expected utility.
A utility function U(s) measures the desirability of a state.
If we can express a probability distribution for the states resulting from alternative
actions, then we can act in order to maximise expected utility.
For an action a, let Result(a) = {s
1
, . . . , s
n
} be a set of states that might be the
result of performing action a. Then the expected utility of a is
EU(a|E) =
sResult(a)
Pr(s|a, E)U(s)
Note that this applies to individual actions. Sequences of actions will not be
covered in this course.
146
Simple decision-making: all of AI?
Much as this looks like a complete and highly attractive method for an agent to
decide how to act, it hides a great deal of complexity:
1. It may be hard to compute U(s). You generally dont know how good a state
is until you know where it might lead on to: planning etc...
2. Knowing what state youre currently in involves most of AI!
3. Dealing with Pr(s|a, E) involves Bayesian networks.
147
Utility in more detail
Overall, we now want to express preferences between different things.
Lets use the following notation:
X > Y : X is preferred to Y
X = Y : we are indifferent regarding X and Y
X Y : X is preferred, or were indifferent
X, Y and so on are lotteries. A lottery has the form
X = [p
1
, O
1
|p
2
, O
2
| |p
n
, O
n
]
where O
i
are the outcomes of the lottery and p
i
their respective probabilities.
Outcomes can be other lotteries or actual states.
148
Axioms for utility theory
Given we are dealing with preferences it seems that there are some clear properties
that such things should exhibit:
Transitivity: if X > Y and Y > Z then X > Z.
Orderability: either X > Y or Y > X or X = Y .
Continuity: if X > Y > Z then there is a probability p such that
[p, X|(1 p), Z] = Y
Substitutability: if X = Y then
[p, X|(1 p), L] = [p, Y |(1 p), L]
149
Axioms for utility theory
Monotonicity: if X > Y then for probabilities p
1
and p
2
, p
1
p
2
if and only if
[p
1
, X|(1 p
1
), Y ] [p
2
, X|(1 p
2
), Y ]
Decomposability:
[p
1
, X|(1 p
1
), [p
2
, Y |(1 p
2
), Z]] = [p
1
, X|(1 p
1
)p
2
, Y |(1 p
1
)(1 p
2
), Z]
If an agents preferences conform to the utility theory axiomsand note that
we are only considering preferences, not numbersthen it is possible to dene a
utility function U(s) for states such that:
1. U(s
1
) > U(s
2
) s
1
> s
2
2. U(s
1
) = U(s
2
) s
1
= s
2
3. U([p
1
, s
1
|p
2
, s
2
| |p
n
, s
n
]) =
n
i=1
p
i
U(s
i
).
We therefore have a justication for the suggested approach.
150
Designing utility functions
There is complete freedom in how a utility function is dened, but clearly it will
pay to dene them carefully.
Example: the utility of money (for most people) exhibits a monotonic preference.
That is, we prefer to have more of it.
But we need to talk about preferences between lotteries.
Say youve won 100, 000 pounds in a quiz and youre offered a coin ip:
For heads: you win a total of 1, 000, 000 pounds.
For tails: you walk away with nothing!
Would you take the offer?
151
Designing utility functions
The expected monetary value (EMV) of this lottery is
(0.5 1, 000, 000) + (0.5 0) = 500, 000
whereas the EMV of the initial amount is 100, 000.
BUT: most of us would probably refuse to take the coin ip.
The story is not quite as simple as this though: our attitude probably depends on
how much money we have to start with. If I have M pounds to start with then I am
in fact choosing between expected utility of
U(M + 100, 000)
and expected utility of
(0.5 U(M)) + (0.5 U(M + 1, 000, 000))
If M is 50, 000, 000 my attitude is much different to if it is 10, 000.
152
Designing utility functions
In fact, research shows that the utility of M pounds is for most people almost
exactly proportional to log M for M > 0. . .
5 4 3 2 1 0 1 2 3 4 5
x 10
8
8
6
4
2
0
2
4
6
8
The utility U(M) of M pounds
M
U
(
M
)
. . . and follows a similar shape for M < 0.
153
Decision networks
Decision networksalso known as inuence diagrams. . .
Build cost
Site of landfill
Legal action
Road traffic Air quality
Cost to taxpayer
Road conjestion
Utility
. . . allow us to work actions and utilities into the formalism of Bayesian networks.
A decision network has three types of node. . .
154
Decision networks
A decision network has three types of node:
Chance nodes: are denoted by ovals. These are random variables (RVs) repre-
sented by a distribution conditional on their parents, as in Bayesian networks.
Parents can be other chance nodes or a decision node.
Decision nodes: are denoted by squares. They describe possible outcomes of the
decision of interest. Here we deal only with single decisions: multiple decisions
require alternative techniques.
Utility nodes: are denoted by diamonds. They describe the utility function relevant
to the problem, as a function of the values of the nodes parents.
155
Decision networks
Sometimes such diagrams are simplied by leaving out the RVs describing the
new state and converting current state and decision directly to utility:
This gives us fewer nodes to deal with BUT
potentially less flexibility in exploring alternative
descriptions of the problem.
and so never appear as evidence.
road conjestion describe future state
Air quality, cost to taxpayer and
Build cost
Legal action
Road traffic
Site of landfill
Utility
EU(a|E) =
sResult(a)
Pr(s|a, E)U(s)
This is an action-utility table. The utility no longer depends on a state but is the
expected utility for a given action.
156
Evaluation of decision networks
Once a specic action is selected for a decision node it acts like a chance node for
which a specic value is being used as evidence.
1. Set the current state chance nodes to their evidence values.
2. For each potential action
Fix the decision node.
Compute the probabilities for the utility nodes parents.
Compute the expected utility.
3. Return the action that maximised EU(a|E).
157
The value of information
We have been assuming that a decision is to be made with all evidence available
beforehand. This is unlikely to be the case.
Knowing what questions one should ask is a central, and important part of making
decisions. Example:
Doctors do not diagnose by rst obtaining results for all possible tests on their
patients.
They ask questions to decide what tests to do.
They are informed in formulating which tests to perform by probabilities of
test outcomes, and by the manner in which knowing an outcome might im-
prove treatment.
Tests can have associated costs.
158
The value of perfect information
Information value theory provides a formal way in which we can reason about
what further information to gather using sensing actions.
Say we have evidence E, so
EU(action|E) = max
a
sResult(a)
Pr(s|a, E)U(s)
denotes how valuable the best action based on E must be.
How valuable would it be to learn about a further piece of evidence?
If we examined another RV E
= e
|E, E
) = max
a
sResult(a)
Pr(s|a, E, E
)U(s)
BUT: because E
) =
_
Pr(E
= e
|E)EU(action
|E, E
= e
)
_
EU(action|E)
VPI has the following properties:
VPI
E
(E
) 0
It is not necessarily additive, that is, it is possible that
VPI
E
(E
, E
) = VPI
E
(E
) + VPI
E
(E
)
It is independent of ordering
VPI
E
(E
, E
) = VPI
E
(E
) + VPI
E,E
(E
)
= VPI
E
(E
) + VPI
E,E
(E
)
160
Agents that can gather information
In constructing an agent with the ability to ask questions, we would hope that it
would:
Use a good order in which to ask the questions.
Avoid asking irrelevant questions.
Trade off the cost of obtaining information against the value of that informa-
tion.
Choose a good time to stop asking questions.
We now have the means with which to approach such a design.
161
Agents that can gather information
Assuming we can associate a cost C(E
=
e
maximising VPI
E
(E
) C(E
).
If VPI
E
(E
) C(E
i=1
Pr(S
i
|S
i1
)Pr(E
i
|S
i
) .
This follows from basic probability theory as for example
Pr(S
0
, S
1
, S
2
, E
1
, E
2
) = Pr(E
2
|S
0:2
, E
1
)Pr(S
2
|S
0:1
, E
1
)Pr(E
1
|S
0:1
)Pr(S
1
|S
0
)Pr(S
0
)
= Pr(E
2
|S
2
)Pr(S
2
|S
1
)Pr(E
1
|S
1
)Pr(S
1
|S
0
)Pr(S
0
)
169
Example: two biased coins
Heres a simple example with only two states and two observations.
I have two biased coins.
I ip one and tell you the outcome.
I then either stay with the same coin, or swap them.
This continues, producing a succession of outcomes:
0.2
0.2
head
0.9 0.1
head
0.8 0.8
coin1 coin2
170
Example: two biased coins
Well use the following numbers:
The prior Pr(S
0
= coin1) = 0.5.
The transition model
Pr(S
t
= coin1|S
t1
= coin1) = Pr(S
t
= coin2|S
t1
= coin2) = 0.8
Pr(S
t
= coin1|S
t1
= coin2) = Pr(S
t
= coin2|S
t1
= coin1) = 0.2
The sensor model
Pr(E
t
= head|S
t
= coin1) = 0.1
Pr(E
t
= head|S
t
= coin2) = 0.9
171
Example: two biased coins
This is straightforward to simulate.
Heres an example of what happens:
[C2,C2,C1,C1,C1,C1,C1,C1,C1,C1,C1,C1,C2,C1,C1,C1,C1,C1,C1,C1,C1,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2,C2]
[Hd,Tl,Tl,Tl,Hd,Tl,Hd,Tl,Tl,Tl,Hd,Tl,Hd,Tl,Tl,Tl,Tl,Tl,Hd,Tl,Tl,Hd,Hd,Hd,Hd,Hd,Hd,Hd,Hd,Tl,Hd,Hd,Hd,Hd,Hd,Hd,Hd,Hd,Tl,Hd]
As expected, we tend to see runs of a single coin, and might expect to be able to
guess which is being used as one favours heads and the other tails.
172
Example: 2008, paper 9, question 5
A friend of mine likes to climb on the roofs of Cambridge. To make a good start to
the coming week, he climbs on a Sunday with probability 0.98. Being concerned
for his own safety, he is less likely to climb today if he climbed yesterday, so
Pr(climb today|climb yesterday) = 0.4
If he did not climb yesterday then he is very unlikely to climb today, so
Pr(climb today|climb yesterday) = 0.1
Unfortunately, he is not a very good climber, and is quite likely to injure himself
if he goes climbing, so
Pr(injury|climb today) = 0.8
whereas
Pr(injury|climb today) = 0.1
173
Example: 2008, paper 9, question 5
This has a similar corresponding diagram:
0.1
0.1 0.8
0.4
0.6
0.9 climb climb
injury injury
Well look at the rest of this exam question later.
174
Performing inference
There are four basic inference tasks that we might want to perform.
In each of the following cases, assume that we have observed the evidence
E
1:t
= e
1:t
Task 1: ltering
Deduce what state we might now be in by computing
Pr(S
t
|e
1:t
).
In the coin tossing question: If youve seen all the outcomes so far, infer which
coin was used last.
In the exam question: If you observed all the injuries so far, infer whether my
friend climbed today.
175
Performing inference
Task 2: prediction
Deduce what state we might be in some time in the future by computing
Pr(S
t+T
|e
1:t
) for some T > 0.
In the coin tossing question: If youve seen all the outcomes so far, infer which
coin will be tossed T steps in the future.
In the exam question: If youve observed all the injuries so far, infer whether my
friend will go climbing T nights from now.
176
Performing inference
Task 3: Smoothing
Deduce what state we might have been in at some point in the past by computing
Pr(S
t
|e
1:T
) for 0 t < T.
In the coin tossing question: If youve seen all the outcomes so far, infer which
coin was tossed at time t in the past.
In the exam question: If youve observed all the injuries so far, infer whether my
friend climbed on night t in the past.
177
Performing inference
Task 4: Find the most likely explanation
Deduce the most likely sequence of states so far by computing
argmax
s
1:t
Pr(s
1:t
|e
1:t
)
In the coin tossing question: If youve seen all the outcomes so far, infer the most
probable sequence of coins used.
In the exam question: If youve observed all the injuries so far, infer the most
probable collection of nights on which my friend climbed.
178
Filtering
We want to compute Pr(S
t
|e
1:t
). This is often called the forward message and
denoted
f
1:t
= Pr(S
t
|e
1:t
)
for reasons that are about to become clear.
Remember that S
t
is an RV and so f
1:t
is a probability distribution containing a
probability for each possible value of S
t
.
It turns out that this can be done in a simple manner with a recursive estimation.
Obtain the result at time t + 1:
1. using the result from time t and...
2. ...incorporating new evidence e
t+1
.
f
1:t+1
= g(e
t+1
, f
1:t
)
for a suitable function g that well now derive.
179
Filtering
Step 1:
Project the current state distribution forward
Pr(S
t+1
|e
1:t+1
) = Pr(S
t+1
|e
1:t
, e
t+1
)
= cPr(e
t+1
|S
t+1
, e
1:t
)Pr(S
t+1
|e
1:t
)
= cPr(e
t+1
|S
t+1
)
. .
Sensor model
Pr(S
t+1
|e
1:t
)
. .
Needs more work
where as usual c is a constant that normalises the distribution. Here,
The rst line does nothing but split e
1:t+1
into e
t+1
and e
1:t
.
The second line is an application of Bayes theorem.
The third line uses assumption 3 regarding sensor models.
180
Filtering
Step 2:
To obtain Pr(S
t+1
|e
1:t
)
Pr(S
t+1
|e
1:t
) =
s
t
Pr(S
t+1
, s
t
|e
1:t
)
=
s
t
Pr(S
t+1
|s
t
, e
1:t
)Pr(s
t
|e
1:t
)
=
s
t
Pr(S
t+1
|s
t
)
. .
Transition model
Pr(s
t
|e
1:t
)
. .
Available from previous step
Here,
The rst line uses marginalisation.
The second line uses the basic equation Pr(A, B) = Pr(A|B)Pr(B).
The third line uses assumption 2 regarding transition models.
181
Filtering
Pulling it all together
Pr(S
t+1
|e
1:t+1
) = cPr(e
t+1
|S
t+1
)
. .
Sensor model
s
t
Pr(S
t+1
|s
t
)
. .
Transition model
Pr(s
t
|e
1:t
)
. .
From previous step
(9)
This will be shortened to
f
1:t+1
= cFORWARD(e
t+1
, f
1:t
)
Here
f
1:t
is a shorthand for Pr(S
t
|e
1:t
).
f
1:t
is often interpreted as a message being passed forward.
The process is started using the prior.
182
Prediction
Prediction is somewhat simpler as
Pr(S
t+T+1
|e
1:t
)
. .
Prediction at t+T+1
=
s
t+T
Pr(S
t+T+1
, s
t+T
|e
1:t
)
=
s
t+T
Pr(S
t+T+1
|s
t+T
, e
1:t
)Pr(s
t+T
|e
1:t
)
=
s
t+T
Pr(S
t+T+1
|s
t+T
)
. .
Transition model
Pr(s
t+T
|e
1:t
)
. .
Prediction at t+T
However we do not get to make accurate predictions arbitrarily far into the future!
183
Smoothing
For smoothing, we want to calculate Pr(S
t
|e
1:T
) for 0 t < T.
Again, we can do this in two steps.
Step 1:
Pr(S
t
|e
1:T
) = Pr(S
t
|e
1:t
, e
t+1:T
)
= cPr(S
t
|e
1:t
)Pr(e
t+1:T
|S
t
, e
1:t
)
= cPr(S
t
|e
1:t
)Pr(e
t+1:T
|S
t
)
= cf
1:t
b
t+1:T
Here
f
1:t
is the forward message dened earlier.
b
t+1:T
is a shorthand for Pr(e
t+1:T
|S
t
) to be regarded as a message being passed
backward.
184
Smoothing
Step 2:
b
t+1:T
= Pr(e
t+1:T
|S
t
) =
s
t+1
Pr(e
t+1:T
, s
t+1
|S
t
)
=
s
t+1
Pr(e
t+1:T
|s
t+1
)Pr(s
t+1
|S
t
)
=
s
t+1
Pr(e
t+1
, e
t+2:T
|s
t+1
)Pr(s
t+1
|S
t
)
=
s
t+1
Pr(e
t+1
|s
t+1
)
. .
Sensor model
Pr(e
t+2:T
|s
t+1
)
. .
b
t+2:T
Pr(s
t+1
|S
t
)
. .
Transition model
= BACKWARD(e
t+1:T
, b
t+2:T
)
(10)
This process is initialised with
b
t+1:t
= Pr(e
T+1:T
|S
T
) = (1, . . . , 1)
185
The forward-backward algorithm
So: our original aim of computing Pr(S
t
|e
1:T
) can be achieved using:
A recursive process working from time 1 to time t (equation 9).
A recursive process working from time T to time t + 1 (equation 10).
This results in a process that is O(T) given the evidence e
1:T
and smooths for a
single point at time t.
To smooth at all points 1 : T we can easily repeat the process obtaining O(T
2
).
Alternatively a very simple example of dynamic programming allows us to smooth
at all points in O(T) time.
186
The forward-backward algorithm
Done
Prior
Recursively compute all values b
t+1:T
and combine with stored values for f
1:t
.
Recursively compute all values for f
1:t
and store results
187
Computing the most likely sequence: the Viterbi algorithm
In computing the most likely sequence the aim is to obtain
argmax
s
1:t
Pr(s
1:t
|e
1:t
)
Earlier we derived the joint distribution for all relevant variables
Pr(S
0
, S
1
, . . . , S
t
, E
1
, E
2
, . . . , E
t
) = Pr(S
0
)
t
i=1
Pr(S
i
|S
i1
)Pr(E
i
|S
i
)
188
Computing the most likely sequence: the Viterbi algorithm
We therefore have
max
s
1:t
Pr(s
1:t
, S
t+1
|e
1:t+1
)
= c max
s
1:t
Pr(e
t+1
|S
t+1
)Pr(S
t+1
|s
t
)Pr(s
1:t
|e
1:t
)
= cPr(e
t+1
|S
t+1
) max
s
t
_
_
_
Pr(S
t+1
|s
t
) max
s
1:t1
Pr(s
1:t1
, s
t
|e
1:t
)
_
_
_
This looks a bit erce, despite the fact that:
The second line is just Bayes theorem applied to the joint distribution.
The last line is just a re-arrangement of the second line.
189
Computing the most likely sequence: the Viterbi algorithm
There is however a way to visualise it that leads to a dynamic programming algo-
rithm called the Viterbi algorithm.
Step 1: Simplify the notation.
Assume there are n states s
1
, . . . , s
n
and m possible observations e
1
, . . . , e
m
at
any given time.
Denote Pr(S
t
= s
j
|S
t1
= s
i
) by p
i,j
(t).
Denote Pr(e
t
|S
t
= s
i
) by q
i
(t).
Its important to remember in what follows that the observations are known but
that were maximising over all possible state sequences.
190
Computing the most likely sequence: the Viterbi algorithm
The equation were interested in is now of the form
P =
T
t=1
p
i,j
(t)q
i
(t)
(The prior Pr(S
0
) has been dropped out for the sake of clarity, but is easy to put
back in in what follows.)
The equation P will be referred to in what follows.
It is in fact a function of any given sequence of states.
191
Computing the most likely sequence: the Viterbi algorithm
Step 2: Make a grid: columns denote time and rows denote state.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 2 3 k k + 1 t
s
1
s
2
s
3
s
n1
s
n
192
Computing the most likely sequence: the Viterbi algorithm
Step 3: Label the nodes:
Say at time t the actual observation was e
t
. Then label the node for s
i
in
column t with the value q
i
(t).
Any sequence of states through time is now a path through the grid. So for any
transition from s
i
at time t 1 to s
j
at time t label the transition with the value
p
i,j
(t).
In the following diagrams we can often just write p
i,j
and q
i
because the time is
clear from the diagram.
So for instance...
193
Computing the most likely sequence: the Viterbi algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 2 3 k k + 1 t
s
1
s
2
s
3
s
n1
s
n
q
1
(2)
p
2,1
(2)
q
n
(k)
q
2
(1)
p
1,3
(3)
q
3
(3)
p
n,n1
(k + 1)
q
n1
(k + 1)
194
Computing the most likely sequence: the Viterbi algorithm
The value of P =
T
t=1
p
i,j
(t)q
i
(t) for any path through the grid is just the
product of the corresponding labels that have been added.
But we dont want to nd the maximum by looking at all the possible paths
because this would be time-consuming.
The Viterbi algorithm computes the maximum by moving from one column to
the next updating as it goes.
Say youre at column k and for each node m in that column you know the
highest value for the product to this point over any possible path. Call this:
W
m
(k) = max
s
1:k
k
t=1
p
i,j
(t)q
i
(t)
195
Computing the most likely sequence: the Viterbi algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 2 3 k k + 1 t
s
1
s
2
s
3
s
n1
s
n
p
n,n1
(k + 1)
q
n1
(k + 1)
W
1
(k)
W
2
(k)
W
3
(k)
W
n1
(k)
W
n
(k)
196
Computing the most likely sequence: the Viterbi algorithm
Here is the key point: you only need to know
The values W
i
(k) for i = 1, . . . , n at time k.
The numbers p
i,j
(k + 1).
The numbers q
i
(k + 1).
to compute the values W
i
(k + 1) for the next column k + 1.
This is because
W
i
(k + 1) = max
j
W
j
(k)p
j,i
(k + 1)q
i
(k + 1)
197
Computing the most likely sequence: the Viterbi algorithm
Once you get to the column for time t:
The node with the largest value for W
i
(t) tells you the largest possible value
of P.
Provided you stored the path taken to get there you can work backwards to
nd the corresponding sequence of states.
This is the Viterbi algorithm.
198
Computing the most likely sequence: the Viterbi algorithm
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1 2 3 k k + 1 t
s
1
s
2
s
3
s
n1
s
n
W
3
(t) maximum
199
Hidden Markov models
Now for a specic case: hidden Markov models (HMMs). Here we have a single,
discrete state variable S
i
taking values s
1
, s
2
, . . . , s
n
. For example, with n = 3 we
might have
s
1
s
2
s
3
Pr(S
t+1
|S
t
= s
1
) Pr(S
t+1
|S
t
= s
2
)
0.3
0.6
0.1
0.2
0.6
0.2
Pr(S
t+1
|S
t
= s
3
)
0.2
0.3
0.5
s
3
s
2
s
1
200
Hidden Markov models
In this simplied case the conditional probabilities Pr(S
t+1
|S
t
) can be represented
using the matrix
S
ij
= Pr(S
t+1
= s
j
|S
t
= s
i
)
or for the example on the previous slide
S =
_
_
0.3 0.1 0.6
0.2 0.6 0.2
0.2 0.3 0.5
_
_
Pr(S|s
1
)
Pr(S|s
2
)
Pr(S|s
3
)
=
_
_
_
_
Pr(s
1
|s
1
) Pr(s
2
|s
1
) Pr(s
n
|s
1
)
Pr(s
1
|s
2
) Pr(s
2
|s
2
) Pr(s
n
|s
2
)
.
.
.
.
.
.
.
.
.
.
.
.
Pr(s
1
|s
n
) Pr(s
2
|s
n
) Pr(s
n
|s
n
)
_
_
_
_
To save space, I am abbreviating Pr(S
t+1
= s
i
|S
t
= s
j
) to Pr(s
i
|s
j
).
201
Hidden Markov models
The computations were making are always conditional on some actual observa-
tions e
1:T
.
For each t we can therefore use the sensor model to dene a further matrix E
t
:
E
t
is square and diagonal (all off-diagonal elements are 0).
The ith element of the diagonal is Pr(e
t
|S
t
= s
i
).
So in our present example with 3 states, there will be a matrix
E
t
=
_
_
Pr(e
t
|s
1
) 0 0
0 Pr(e
t
|s
2
) 0
0 0 Pr(e
t
|s
3
)
_
_
for each t = 1, . . . , T.
202
Hidden Markov models
In the general case the equation for ltering was
Pr(S
t+1
|e
1:t+1
) = cPr(e
t+1
|S
t+1
)
s
t
Pr(S
t+1
|s
t
)Pr(s
t
|e
1:t
)
and the message f
1:t
was introduced as a representation of Pr(S
t
|e
1:t
).
In the present case we can dene f
1:t
to be the vector
f
1:t
=
_
_
_
_
Pr(s
1
|e
1:t
)
Pr(s
2
|e
1:t
)
.
.
.
Pr(s
n
|e
1:t
)
_
_
_
_
Key point: the ltering equation now reduces to nothing but matrix multiplication.
203
What does matrix multiplication do?
What does matrix multiplication do? It computes weighted summations:
Ab =
_
_
_
_
a
1,1
a
1,2
a
1,m
a
2,1
a
2,2
a
2,m
.
.
.
.
.
.
.
.
.
.
.
.
a
n,1
a
n,2
a
n,m
_
_
_
_
_
_
_
_
b
1
b
2
.
.
.
b
m
_
_
_
_
=
_
_
_
_
m
i=1
a
1,i
b
i
m
i=1
a
2,i
b
i
.
.
.
m
i=1
a
n,i
b
i
_
_
_
_
So the point at the end of the last slide shouldnt come as a big surprise!
204
Hidden Markov models
Now, note that if we have n states
S
T
f
1:t
=
_
_
_
_
_
Pr(s
1
|s
1
) Pr(s
1
|s
n
)
Pr(s
2
|s
1
) Pr(s
2
|s
n
)
.
.
.
.
.
.
.
.
.
Pr(s
n
|s
1
) Pr(s
n
|s
n
)
_
_
_
_
_
_
_
_
_
_
Pr(s
1
|e
1:t
)
Pr(s
2
|e
1:t
)
.
.
.
Pr(s
n
|e
1:t
)
_
_
_
_
_
=
_
_
_
_
_
Pr(s
1
|s
1
)Pr(s
1
|e
1:t
) + + Pr(s
1
|s
n
)Pr(s
n
|e
1:t
)
Pr(s
2
|s
1
)Pr(s
1
|e
1:t
) + + Pr(s
2
|s
n
)Pr(s
n
|e
1:t
)
.
.
.
Pr(s
n
|s
1
)Pr(s
1
|e
1:t
) + + Pr(s
n
|s
n
)Pr(s
n
|e
1:t
)
_
_
_
_
_
=
_
_
_
_
_
s
Pr(s
1
|s)Pr(s|e
1:t
)
s
Pr(s
2
|s)Pr(s|e
1:t
)
.
.
.
s
Pr(s
n
|s)Pr(s|e
1:t
)
_
_
_
_
_
205
Hidden Markov models
And taking things one step further
E
t+1
S
T
f
1:t
=
_
_
Pr(e
t+1
|s
1
) 0
.
.
.
0 Pr(e
t+1
|s
n
)
_
_
_
_
_
_
s
Pr(s
1
|s)Pr(s|e
1:t
)
s
Pr(s
2
|s)Pr(s|e
1:t
)
.
.
.
s
Pr(s
n
|s
)
Pr(s|e
1:t
)
_
_
_
_
=
_
_
_
_
Pr(e
t+1
|s
1
)
s
Pr(s
1
|s)Pr(s|e
1:t
)
Pr(e
t+1
|s
2
)
s
Pr(s
2
|s)Pr(s|e
1:t
)
.
.
.
Pr(e
t+1
|s
n
)
s
Pr(s
n
|s)Pr(s|e
1:t
)
_
_
_
_
Compare this with the equation for ltering
Pr(S
t+1
|e
1:t+1
) = cPr(e
t+1
|S
t+1
)
s
t
Pr(S
t+1
|s
t
)Pr(s
t
|e
1:t
)
206
Hidden Markov models
Comparing the expression for E
t+1
S
T
f
1:t
with the equation for ltering we see
that
f
1:t+1
= cE
t+1
S
T
f
1:t
and a similar equation can be found for b
b
T+1:t
= SE
T+1
b
T+2:t
Exercise: derive this.
The fact that these can be expressed simply using only multiplication of vectors
and matrices allows us to make an improvement to the forward-backward algo-
rithm.
207
Hidden Markov models
The forward-backward algorithm works by:
Moving up the sequence from 1 to T, computing and storing values for f.
Moving down the sequence from T to 1 computing values for b and combining
them with the stored values for f using the equation
Pr(S
t
|e
1:T
) = cf
1:t
b
t+1:T
Now in our simplied HMM case we have
f
1:t+1
= cE
t+1
S
T
f
1:t
or multiplying through by (E
t+1
S
T
)
1
and re-arranging
f
1:t
=
1
c
(S
T
)
1
(E
t+1
)
1
f
1:t+1
208
Hidden Markov models
So as long as:
We know the nal value for f.
S
T
has an inverse.
Every observation has non-zero probability in every state.
We dont have to store T different values for fwe just work through, discarding
intermediate values, to obtain the last value and then work backward.
209
Example: 2008, paper 9, question 5
A friend of mine likes to climb on the roofs of Cambridge. To make a good start to
the coming week, he climbs on a Sunday with probability 0.98. Being concerned
for his own safety, he is less likely to climb today if he climbed yesterday, so
Pr(climb today|climb yesterday) = 0.4
If he did not climb yesterday then he is very unlikely to climb today, so
Pr(climb today|climb yesterday) = 0.1
Unfortunately, he is not a very good climber, and is quite likely to injure himself
if he goes climbing, so
Pr(injury|climb today) = 0.8
whereas
Pr(injury|climb today) = 0.1
210
Example: 2008, paper 9, question 5
You learn that on Monday and Tuesday evening he obtains an injury, but on
Wednesday evening he does not. Use the ltering algorithm to compute the prob-
ability that he climbed on Wednesday.
Initially
f
1:0
=
_
0.98
0.02
_
S =
_
0.4 0.6
0.1 0.9
_
E =
_
0.8 0
0 0.1
_
E
=
_
0.2 0
0 0.9
_
211
Example: 2008, paper 9, question 5
The update equation is
f
1:t+1
= cE
t+1
S
T
f
1:t
so
f
1:1
=
c
10, 000
_
8 0
0 1
__
4 1
6 9
__
98
2
_
=
_
0.83874
0.16126
_
Repeating this twice more using E
i=Tlag+1
SE
i
_
_
_
_
1
1
.
.
.
1
_
_
_
_
Dene
a:b
=
b
i=a
SE
i
so
b
Tlag+1:T
=
Tlag+1:T
_
_
_
_
1
1
.
.
.
1
_
_
_
_
217
Online smoothing
Now when e
T+1
arrives we have
b
Tlag+2:T+1
=
T+1
i=Tlag+2
SE
i
_
_
_
_
1
1
.
.
.
1
_
_
_
_
=
Tlag+2:T+1
_
_
_
_
1
1
.
.
.
1
_
_
_
_
= E
1
Tlag+1
S
1
Tlag+1:T
SE
T+1
_
_
_
_
1
1
.
.
.
1
_
_
_
_
218
Online smoothing
This leads to an easy way to update
a+1:b+1
= E
1
a
S
1
a:b
SE
b+1
Using this gives the required update for b.
219
Supervised learning II: the Bayesian approach
We now place supervised learning into a probabilistic setting by examining:
The application of Bayes theorem to the supervised learning problem.
Priors, the likelihood, and the posterior probability of a hypothesis.
The maximum likelihood and maximum a posteriori hypotheses, and some
examples.
Bayesian decision theory: minimising the error rate.
Application of the approach to neural networks, using approximation tech-
niques.
220
Reading
There is some relevant material to be found in Russell and Norvig chapters 18 to
20 although the intersection between that material and what I will cover is small.
Almost all of what I cover can be found in:
Machine Learning. Tom Mitchell, McGraw Hill 1997, chapter 6.
Pattern Recognition and Machine Learning. Christopher M. Bishop, Springer,
2006.
221
Supervised learning: a quick reminder
We want to design a classier, denoted h(x)
x
Classier
h(x)
Label Attribute vector
It should take an attribute vector
x
T
=
_
x
1
x
2
x
n
_
and label it.
What we mean by label depends on whether were doing classication or regres-
sion.
222
Supervised learning: a quick reminder
In classication were assigning x to one of a set {
1
, . . . ,
c
} of c classes.
For example, if x contains measurements taken from a patient then there might be
three classes:
1
= patient has disease
2
= patient doesnt have disease
3
= dont ask me buddy, Im just a computer!
Well often specialise to the case of two classes, denoted C
1
and C
2
.
223
Supervised learning: a quick reminder
In regression were assigning x to a real number h(x) R.
For example, if x contains measurements taken regarding todays weather then we
might have
h(x) = estimate of amount of rainfall expected tomorrow
For the two-class classication problem we will also refer to a situation somewhat
between the two, where
h(x) = Pr(x is in C
1
)
224
Supervised learning: a quick reminder
We dont want to design h explicitly.
Training sequence
h = L(s)
Label
h(x)
s
Learner
L
Classier
Attribute vector
x
So we use a learner L to infer it on the basis of a sequence s of training examples.
225
Supervised learning: a quick reminder
The training sequence s is a sequence of m labelled examples.
s =
_
_
_
_
(x
1
, y
1
)
(x
2
, y
2
)
.
.
.
(x
m
, y
m
)
_
_
_
_
That is, examples of attribute vectors x with their correct label attached.
So a learner only gets to see the labels for amost probably smallsubset of the
possible inputs x.
Regardless, we aim that the hypothesis h = L(s) will usually be successful at
predicting the label of an input it hasnt seen before.
This ability is called generalization.
226
Supervised learning: a quick reminder
There is generally a set H of hypotheses from which L is allowed to select h
L(s) = h H
H is called the hypothesis space.
The learner can output a hypothesis explicitly oras in the case of a multilayer
perceptronit can output a vector
w =
_
w
1
w
2
w
W
_
of weights which in turn specify h
h(x) = f(w; x)
where w = L(s).
227
Supervised learning: a quick reminder
In AI I you saw the backpropagation algorithm for training multilayer percep-
trons, in the case of regression.
This worked by minimising a function of the weights representing the error cur-
rently being made:
E(w) =
1
2
m
i=1
(f(w; x
i
) y
i
)
2
The summation here is over the training examples. The expression in the summa-
tion grows as fs prediction for x
i
diverges from the known label y
i
.
Backpropagation tries to nd a w that minimises E(w) by performing gradient
descent
w
t+1
= w
t
E(w)
w
w
t
228
Difculties with classical neural networks
There are some well-known difculties associated with neural network training of
this kind.
0.5 1 1.5 2 2.5 3
-0.4
-0.2
0.2
0.4
0.6
0.8
BEWARE!!!
229
Sources of uncertainty
So we have to be careful. But lets press on with this approach for a little while
longer...
The model used above suggests two sources of uncertainty that we might treat
with probabilities.
Lets assume weve selected an Hto use, and its the same one nature is using.
We dont know how nature chooses h
(
z
)
10
5
0
5
10
10
5
0
5
10
0
0.2
0.4
0.6
0.8
1
Input x
1
Logistic (z) applied to the output of a linear function
Input x
2
P
r
(
x
i
s
i
n
C
1
)
232
The likelihood
So: if were given a training sequence, what is the probability that it was generated
using some h?
For an example (x, y), y can be C
1
or C
2
. Its helpful here to rename the classes
as just 1 and 0 respectively because this leads to a nice simple expression. Now
Pr(Y |h, x) =
_
(h(x)) if Y = 1
1 (h(x)) if Y = 0
Consequently when y has a known value we can write
Pr(y|h, x) = [(h(x))]
y
[1 (h(x))]
(1y)
If we assume that the examples are independent then the probability of seeing the
labels in a training sequence s is straightforward.
233
The likelihood
Collecting the inputs and outputs in s together into separate matrices, so
y
T
=
_
y
1
y
2
y
m
_
and
X =
_
x
1
x
2
x
m
_
we have the likelihood of the training sequence
Pr(y|h, X) =
m
i=1
Pr(y
i
|h, x
i
)
=
m
i=1
[(h(x
i
))]
y
i
[1 (h(x
i
))]
(1y
i
)
234
The likelihood
Another example: regression. A common likelihood in the regression case works
by assuming that examples are corrupted by Gaussian noise with mean 0 and some
specied variance
2
y = h(x) + , where N(0,
2
)
As usual, the density for N(,
2
) is
p(Z) =
1
2
2
exp
_
(z )
2
2
2
_
by adding h(x) to we just shift its mean, so
p(y|h, x) =
1
2
2
exp
_
(y h(x))
2
2
2
_
235
The likelihood
Consequently if the examples are independent then the likelihood of a training
sequence s is
p(y|h, X) =
m
i=1
p(y
i
|h, x
i
)
=
m
i=1
1
2
2
exp
_
(y
i
h(x
i
))
2
2
2
_
=
1
(2
2
)
m/2
exp
_
1
2
2
m
i=1
(y
i
h(x
i
))
2
_
where weve used the fact that
exp(a) exp(b) = exp(a + b)
236
Bayes theorem appears once more...
Right: weve take care of the uncertainty by introducing the prior p(h) and the
likelihood of the training sequence p(y|h, X).
By this point you hopefully want to apply Bayes theorem and write
p(h|y) =
p(y|h)p(h)
p(y)
where
p(y) =
hH
p(h, y) =
hH
p(y|h)p(h)
and to simplify the expression we have now dropped the mention of X as the
inputs are xed. p(h|y) is called the posterior distribution.
The denominator Z = p(y) is called the evidence and leads on to fascinating
issues of its own. Unfortunately we wont have time to explore them.
237
Bayes theorem appears once more...
The boxed equation on the last slide has a very simple interpretation: whats the
probability that this specic h was used to generate the training sequence Ive
been given?
Two natural learning algorithms now present themselves:
1. The maximum likelihood hypothesis
h
ML
= argmax
hH
p(y|h)
2. The maximum a posteriori hypothesis
h
MAP
= argmax
hH
p(h|y)
= argmax
hH
p(y|h)p(h)
Obviously h
ML
corresponds to the case where the prior p(h) is uniform.
238
Example: maximum likelihood learning
We derived an exact expression for the likelihood in the regression case above:
p(y|h) =
1
(2
2
)
m/2
exp
_
1
2
2
m
i=1
(y
i
h(x
i
))
2
_
Proposition: under the assumptions used, any learning algorithm that works by
minimising the sum of squared errors on s nds h
ML
.
This is clearly of interest: the notable example is the backpropagation algorithm.
We now prove the proposition...
239
Example: maximum likelihood learning
The proposition holds because:
h
ML
= argmax
hH
p(y|h)
= argmax
hH
log p(y|h)
= argmax
hH
log
_
1
(2
2
)
m/2
exp
_
1
2
2
m
i=1
(y
i
h(x
i
))
2
__
= argmax
hH
log
_
1
(2
2
)
m/2
_
1
2
2
m
i=1
(y
i
h(x
i
))
2
= argmax
hH
1
2
2
m
i=1
(y
i
h(x
i
))
2
= argmin
hH
m
i=1
(y
i
h(x
i
))
2
240
Example: maximum likelihood learning
Note:
If the distribution of the noise is not Gaussian a different result is obtained.
The use of log above to simplify a maximisation problem is a standard trick.
The Gaussian assumption is sometimes, but not always a good choice. (Be-
ware the Central Limit Theorem!).
241
The next step...
We have so far concentrated throughout our coverage of machine learning on
choosing a single hypothesis.
Are we asking the right question though?
Ultimately, we want to generalise.
That means being presented with a new x and asking the question: what is the
most probable classication of x?
Is it reasonable to expect a single hypothesis to provide the optimal answer?
We need to look at what the optimal solution to this kind of problem might be...
242
Bayesian decision theory
What is the optimal approach to this problem?
Put another way: how should we make decisions in such a way that the outcome
obtained is, on average, the best possible? Say we have:
Attribute vectors x R
d
.
A set of classes {
1
, . . . ,
c
}.
Several possible actions {
1
, . . . ,
a
}.
The actions can be thought of as saying assign the vector to class 1 and so on.
There is also a loss (
i
,
j
) associated with taking action
i
when the class is
j
.
The loss will sometimes be abbreviated to (
i
,
j
) =
ij
.
243
Bayesian decision theory
Say we can also model the world as follows:
Classes have probabilities Pr() of occurring.
The probability of seeing x when the class is has density p(x|).
Think of nature choosing classes at random (although not revealing them) and
showing us a vector selected at random using p(x|).
As usual Bayes rule tells us that
Pr(|x) =
p(x|)Pr()
p(x)
and now the denominator is
p(x) =
c
i=1
p(x|
i
)Pr(
i
).
244
Bayesian decision theory
Say nature shows us x and we take action
i
.
If we always take action
i
when we see x then the average loss on seeing x is
R(
i
|x) = E
p(|x)
[
ij
|x] =
c
j=1
(
i
,
j
)Pr(
j
|x).
The quantity R(
i
|x) is called the conditional risk.
Note that this particular x is xed.
245
Bayesian decision theory
Now say we have a decision rule : R
d
{
1
, . . . ,
a
} telling us what action to
take on seeing any x R
d
.
The average loss, or risk, is
R = E
(x,)p(x,)
[((x), )]
= E
xp(x)
_
E
Pr(|x)
[((x), )|x]
= E
xp(x)
[R((x)|x)] (11)
=
_
R((x)|x)p(x)dx
where we have used the standard result from probability theory that
E[E[X|Y ]] = E[X] .
(See the supplementary notes for a proof.)
246
Bayesian decision theory
Clearly the risk is minimised for the decision rule dened as follows:
outputs the action
i
that minimises R(
i
|x), for all x R
d
.
The provides us with the minimum possible risk, or Bayes risk R
.
The rule specied is called the Bayes decision rule.
247
Example: minimum error rate classication
In supervised learning our aim is often to work in such a way that we minimise
the probability of error.
What loss should we consider in these circumstances? From basic probability
theory
Pr(A) = E[I(A)]
where
I(A) =
_
1 if A happens
0 otherwise
(See the supplementary notes for a proof.)
248
Example: minimum error rate classication
So if we are addressing a supervised learning problem with c classes {
1
, . . . ,
c
}
and we interpret action
i
as meaning the input is in class
i
, then a loss
ij
=
_
1 if i = j
0 otherwise
means that the risk R is
R = E[] = Pr((x) is in error)
and the Bayes decision rule minimises the probability of error.
249
Example: minimum error rate classication
Now, what is the Bayes decision rule?
R(
i
|x) =
c
j=1
(
i
,
j
)Pr(
j
|x)
=
i=j
Pr(
j
|x)
= 1 Pr(
i
|x)
so (x) should be the class that maximises Pr(
i
|x).
THE IMPORTANT SUMMARY: Given a new x to classify, choosing the class that
maximises Pr(
i
|x) is the best strategy if your aim is to obtain the minimum error
rate!
250
Bayesian learning II
Bayes decision theory tells us that in this context we should consider the quantity
Pr(
i
|s, x) where the involvement of the training sequence has been made explicit.
Pr(
i
|s, x) =
hH
Pr(
i
, h|s, x)
=
hH
Pr(
i
|h, s, x)Pr(h|s, x)
=
hH
Pr(
i
|h, x)Pr(h|s).
Here we have re-introduced H using marginalisation. In moving from line 2 to
line 3 we are assuming some independence properties.
251
Bayesian learning II
So our classication should be
= argmax
{
1
,...,
c
}
hH
Pr(|h, x)Pr(h|s)
If H is innite the sum becomes an integral. So for example for a neural network
= argmax
{
1
,...,
c
}
_
R
W
Pr(|w, x)Pr(w|s) dw
where W is the number of weights in w.
252
Bayesian learning II
Why might this make any difference? (Aside from the fact that we now know its
optimal!)
Example 1: Say |H| = 3 and h(x) = Pr(x is in class C
1
) for a 2 class problem.
Pr(h
1
|s) = 0.4
Pr(h
2
|s) = Pr(h
3
|s) = 0.3
Now, say we have an x for which
h
1
(x) = 1
h
2
(x) = h
3
(x) = 0
so h
MAP
says that x is in class C
1
.
253
Bayesian learning II
However,
Pr(class 1|s, x) = 1 0.4 + 0 0.3 + 0 0.3
= 0.4
Pr(class 2|s, x) = 0 0.4 + 1 0.3 + 1 0.3
= 0.6
so class C
2
is the more probable!
In this case the Bayes optimal approach in fact leads to a different answer.
254
A more in-depth example
Lets take this a step further and work through something a little more complex in
detail. For a two-class classication problem with h(x) denoting Pr(C
1
|h, x) and
x R:
Hypotheses: We have three hypotheses
h
1
(x) = exp((x 1)
2
)
h
2
(x) = exp((2x 2)
2
)
h
3
(x) = exp((1/10)(x 3)
2
)
Prior: The prior is Pr(h
1
) = 0.1, Pr(h
2
) = 0.05 and Pr(h
3
) = 0.85.
255
A more in-depth example
We see the examples (0.5, C
1
), (0.9, C
1
), (3.1, C
2
) and (3.4, C
1
).
Likelihood: For the individual hypotheses the likelihoods are given by
Pr(s|h) = h(x
1
)h(x
2
)[1 h(x
3
)]h(x
4
)
Which in this case tells us
Pr(s|h
1
) = 0.0024001365
Pr(s|h
2
) = 0.0031069836
Pr(s|h
3
) = 0.0003387476
Posterior: Multiplying by the priors and normalising gives
Pr(h
1
|s) = 0.3512575000
Pr(h
2
|s) = 0.2273519164
Pr(h
3
|s) = 0.4213905836
256
A more in-depth example
Now lets classify the point x
= 2.5.
We need
Pr(C
1
|s, x
) = Pr(C
1
|h
1
)Pr(h
1
|s) + Pr(C
1
|h
2
)Pr(h
2
|s) + Pr(C
1
|h
3
)Pr(h
3
|s)
= 0.6250705317
So: its most likely to be in class C
1
, but not with great certainty.
257
The Bayesian approach to neural networks
Lets now see how this can be applied to neural networks. We have:
A neural network computing a function f(w; x).
A training sequence s = ((x
1
, y
1
), . . . , (x
m
, y
m
)), split into
y = ( y
1
y
2
y
m
)
and
X = ( x
1
x
2
x
m
)
The prior distribution p(w) is now on the weight vectors and Bayes theorem tells
us that
p(w|s) = p(w|X, y) =
p(y|w, X)p(w|X)
p(y|X)
Nothing new so far...
258
The Bayesian approach to neural networks
As usual, we dont consider uncertainty in x and so X will be omitted. Conse-
quently
p(w|y) =
p(y|w)p(w)
p(y)
where
p(y) =
_
R
W
p(y|w)p(w)dw
p(y|w) is a model of the noise corrupting the labels and as previously is the like-
lihood function.
259
The Bayesian approach to neural networks
p(w) is typically a broad distribution to reect the fact that in the absence of any
data we have little idea of what w might be.
When we see some data the above equation tells us how to obtain p(w|y). This
will typically be more localised.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
wMAP
p
(
w
|
y
)
a
n
d
p
(
w
)
The posterior density p(w|y) becomes more localised
Prior
Posterior
To put this into practice we need expressions for p(w) and p(y|w).
260
Reminder: the general Gaussian density
Reminder: were going to be making a lot of use of the general Gaussian density
N(, ) in d dimensions
p(z) = (2)
d/2
||
1/2
exp
_
1
2
_
(z )
T
1
(z )
_
_
where is the mean vector and is the covariance matrix.
5
0
5
5
0
5
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
z1
Gaussian density, = [0 0], = I
z2
p
(
z
)
261
The Gaussian prior
A common choice for p(w) is the Gaussian prior with zero mean and
=
2
I
so
p(w) = (2)
W/2
W
exp
_
w
T
w
2
2
_
Note that controls the distribution of other parameters.
Such parameters are called hyperparameters.
Assume for now that they are both xed and known.
Hyperparameters can be learnt using s through the application of more advanced
techniques.
262
The Bayesian approach to neural networks
Physicists like to express quantities such as p(w) in terms of a measure of en-
ergy. The expression is therefore usually re-written as
p(w) =
1
Z
W
()
exp
_
2
||w||
2
_
where
E
W
(w) =
1
2
||w||
2
Z
W
() =
_
2
_
d/2
=
1
2
This is simply a re-arranged version of the more usual equation.
263
The Gaussian noise model for regression
Weve already seen that for a regression problem with zero mean Gaussian noise
having variance
2
n
y
i
= f(x
i
) +
i
p(
i
) =
1
_
2
2
n
exp
_
2
i
2
2
n
_
where f corresponds to some unknown network, the likelihood function is
p(y|w) =
1
(2
2
n
)
m/2
exp
_
1
2
2
n
m
i=1
(y
i
f(w; x
i
))
2
_
Note that there are now two variances:
2
for the prior and
2
n
for the noise.
264
The Bayesian approach to neural networks
This expression can also be rewritten in physicist-friendly form
p(y|w) =
1
Z
y
()
exp (E
y
(w))
where
=
1
2
n
Z
y
() =
_
2
_
m/2
E
y
(w) =
1
2
m
i=1
(y
i
f(w; x
i
))
2
Here, is a second hyperparameter. Again, we assume it is xed and known,
although it can be learnt using s using more advanced techniques.
265
The Bayesian approach to neural networks
Combining the two boxed equations gives
p(w|y) =
1
Z
S
(, )
exp(S(w))
where
S(w) = E
W
(w) + E
y
(w)
The quantity
Z
S
(, ) =
_
R
W
exp(S(w))dw
normalises the density. Recall that this is called the evidence.
266
Example I: gradient descent revisited...
To nd h
MAP
(in this scenario by nding w
MAP
) we therefore maximise
p(w|y) =
1
Z
S
(, )
exp((E
W
(w) + E
y
(w)))
or equivalently nd
w
MAP
= argmin
w
2
||w||
2
+
2
m
i=1
(y
i
f(w; x
i
))
2
This algorithm has also been used a lot in the neural network literature and is
called the weight decay technique.
267
Example II: two-class classication in two dimensions
3 2 1 0 1 2 3
3
2
1
0
1
2
3
Examples
x
1
x
2
10
0
10
10
0
10
0.5
1
1.5
2
x 10
3
w
1
Prior density p(w)
w
2
10
0
10
10
0
10
0
0.02
0.04
0.06
w
1
Likelihood p(y|w)
w
2
10
0
10
10
0
10
0
0.5
1
x 10
4
w
1
Posterior density p(w|y)
w
2
268
The Bayesian approach to neural networks
What happens as the number m of examples increases?
The rst term corresponding to the prior remains xed.
The second term corresponding to the likelihood increases.
So for small training sequences the prior dominates, but for large ones h
ML
is a
good approximation to h
MAP
.
269
The Bayesian approach to neural networks
Where have we got to...? We have obtained
p(w|y) =
1
Z
S
(, )
exp((E
W
(w) + E
y
(w)))
Z
S
(, ) =
_
R
W
exp((E
W
(w) + E
y
(w)))dw
Translating the expression for the Bayes optimal solution given earlier into the
current scenario, we need to compute
p(Y |y, x) =
_
R
W
p(y|w, x)p(w|y) dw
Easy huh? Unfortunately not...
270
The Bayesian approach to neural networks
In order to make further progress its necessary to perform integrals of the general
form
_
R
W
F(w)p(w|y)dw
for various functions F and this is generally not possible.
There are two ways to get around this:
1. We can use an approximate form for p(w|y).
2. We can use Monte Carlo methods.
271
Method 1: approximation to p(w|y)
The rst approach introduces a Gaussian approximation to p(w|y) by using a
Taylor expansion of
S(w) = E
W
(w) + E
y
(w)
at w
MAP
.
This allows us to use a standard integral.
The result will be approximate but we hope its good!
Lets recall how Taylor series work...
272
Reminder: Taylor expansion
In one dimension the Taylor expansion about a point x
0
R for a function f :
R R is
f(x) f(x
0
) +
1
1!
(x x
0
)f
(x
0
) +
1
2!
(x x
0
)
2
f
(x
0
) + +
1
k!
(x x
0
)
k
f
k
(x
0
)
What does this look like for the kinds of function were interested in? We can try
to approximate
exp (f(x))
where
f(x) = x
4
1
2
x
3
7x
2
5
2
x + 22
This has a form similar to S(w), but in one dimension.
273
Reminder: Taylor expansion
The functions of interest look like this:
5 0 5
0
100
200
300
400
500
600
The function f(x)
x
f
(
x
)
5 0 5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
The function exp(f(x))
x
e
x
p
(
f
(
x
)
)
By replacing f(x) with its Taylor expansion about its maximum, which is at
x
max
= 2.1437
we can see what the approximation to exp(f(x)) looks like. Note that the exp
hugely emphasises peaks.
274
Reminder: Taylor expansion
Here are the approximations for k = 1, k = 2 and k = 3.
5 0 5
600
400
200
0
Taylor expansion for k = 1
x
5 0 5
600
400
200
0
Taylor expansion for k = 2
x
5 0 5
600
400
200
0
Taylor expansion for k = 3
x
5 0 5
0
0.2
0.4
0.6
0.8
exp(f(x)) exact
x
5 0 5
0
0.2
0.4
0.6
0.8
exp(f(x)) using Taylor expansion for k = 2
x
The use of k = 2 looks promising...
275
Reminder: Taylor expansion
In multiple dimensions the Taylor expansion for k = 2 is
f(x) f(x
0
) +
1
1!
(x x
0
)
T
f(x)|
x
0
+
1
2!
(x x
0
)
T
2
f(x
0
)
x
0
(x x
0
)
where denotes gradient
f(x) =
_
f(x)
x
1
f(x)
x
2
f(x)
x
n
_
and
2
f(x) is the matrix with elements
M
ij
=
2
f(x)
x
i
x
j
(Although this looks complicated, its just the obvious extension of the 1-dimensional
case.)
276
Method 1: approximation to p(w|y)
Applying this to S(w) and expanding around w
MAP
S(w) S(w
MAP
) + (ww
MAP
)
T
S(w)|
w
MAP
+
1
2
(ww
MAP
)
T
A(ww
MAP
)
notice the following:
As w
MAP
minimises the function the rst derivatives are zero and the corre-
sponding term in the Taylor expansion disappears.
The quantity A = S(w)|
w
MAP
can be simplied.
This is because
A = (E
W
(w) + E
y
(w))|
w
MAP
= I + E
y
(w
MAP
)
277
Method 1: approximation to p(w|y)
Dening
w = ww
MAP
we now have
S(w) S(w
MAP
) +
1
2
w
T
Aw
The vector w
MAP
can be obtained using any standard optimisation method (such
as backpropagation).
The quantity E
y
(w) can be evaluated using an extended form of backpropa-
gation.
278
A useful integral
Dropping for this slide only the special meanings usually given to vectors x and
y, here is a useful standard integral:
If A R
nn
is symmetric then for b R
n
and c R
_
R
n
exp
_
1
2
_
x
T
Ax + x
T
b + c
_
_
dx
= (2)
n/2
|A|
1/2
exp
_
1
2
_
c
b
T
A
1
b
4
__
At the beginning of the course, two exercises were set involving the evaluation of
this integral.
To make this easy to refer to, lets call it the BIG INTEGRAL.
279
Method 1: approximation to p(w|y)
We now have
p(w|y)
1
Z(, )
exp
_
S(w
MAP
)
1
2
w
T
Aw
_
where w = ww
MAP
and using the BIG INTEGRAL
Z(, ) = (2)
W/2
|A|
1/2
exp(S(w
MAP
))
Our earlier discussion tells us that given a new input x we should calculate
p(Y |y, x) =
_
R
W
p(y|w, x)p(w|y)dw
p(y|w, x) is just the likelihood so...
280
Method 1: approximation to p(w|y)
The likelihood were using is
p(y|w, x) =
1
2
2
exp
_
(y f(w; x))
2
2
2
_
exp
_
2
(y f(w; x))
2
_
and plugging it into the integral gives
p(y|x, y)
_
R
W
exp
_
2
(y f(w; x))
2
_
exp
_
1
2
w
T
Aw
_
dw
which has no solution!
We need another approximation...
281
Method 1: approximation to p(w|y)
If we assume that p(w|y) is narrow (this depends on A) then we can introduce a
linear approximation of f(w; x) at w
MAP
:
f(w; x) f(w
MAP
; x) + g
T
w
where g = f(w; x)|
w
MAP
.
By linear approximation we just mean the Taylor expansion for k = 1.
This leads to
p(Y |y, x)
_
R
W
exp
_
2
_
y f(w
MAP
; x) g
T
w
_
2
1
2
w
T
Aw
_
dw
and this integral can be evaluated using the BIG INTEGRAL to give THE AN-
SWER...
282
Method 1: approximation to p(w|y)
Finally
p(Y |y, x) =
1
_
2
2
y
exp
_
(y f(w
MAP
; x))
2
2
2
y
_
where
2
y
=
1
+ g
T
A
1
g.
Hooray! But what does it mean?
283
Method 1: approximation to p(w|y)
This is a Gaussian density, so we can now see that p(Y |y, x) peaks at f(w
MAP
; x).
That is, the MAP solution.
The variance
2
y
can be interpreted as a measure of certainty.
The rst term of
2
y
is 1/ and corresponds to the noise.
The second term of
2
y
is g
T
A
1
g and corresponds to the width of p(w|y).
Or interpreted graphically...
284
Method 1: approximation to p(w|y)
0.5 0 0.5 1 1.5
2
1
0
1
2
3
4
Typical behaviour of the Bayesian solution
x
285
Method II: Markov chain Monte Carlo (MCMC) methods
The second solution to the problem of performing integrals
I =
_
F(w)p(w|y)dw
is to use Monte Carlo methods. The basic approach is to make the approximation
I
1
N
N
i=1
F(w
i
)
where the w
i
have distribution p(w|y). Unfortunately, generating w
i
with a given
distribution can be non-trivial.
286
MCMC methods
A simple technique is to introduce a random walk, so
w
i+1
= w
i
+
where is zero mean spherical Gaussian and has small variance. Obviously the
sequence w
i
does not have the required distribution. However, we can use the
Metropolis algorithm, which does not accept all the steps in the random walk:
1. If p(w
i+1
|y) > p(w
i
|y) then accept the step.
2. Else accept the step with probability
p(w
i+1
|y)
p(w
i
|y)
.
In practice, the Metropolis algorithm has several shortcomings, and a great deal
of research exists on improved methods, see:
R. Neal, Probabilistic inference using Markov chain Monte Carlo methods,
University of Toronto, Department of Computer Science Technical Report
CRG-TR-93-1, 1993.
287
Approximate inference for Bayesian networks
MCMC methods also provide a method for performing approximate inference in
Bayesian networks.
Say a system can be in a state s and moves from state to state in discrete time steps
according to a probabilistic transition
Pr(s s
)
Let
t
(s) be the probability distribution for the state after t steps, so
t+1
(s
) =
s
Pr(s s
)
t
(s)
If at some point we obtain
t+1
(s) =
t
(s) for all s then we have reached a
stationary distribution . In this case
s
(s
) =
s
Pr(s s
)(s)
There is exactly one stationary distribution for a given Pr(s s
) provided the
latter obeys some simple conditions.
288
Approximate inference for Bayesian networks
The condition of detailed balance
s, s
(s)Pr(s s
) = (s
)Pr(s
s)
is sufcient to provide a that is a stationary distribution. To see this simply sum:
s
(s)Pr(s s
) =
s
(s
)Pr(s
s)
= (s
s
Pr(s
s)
. .
=1
= (s
)
If all this is looking a little familiar, its because we now have an excellent ap-
plication for the material in Mathematical Methods for Computer Science. That
course used the alternative term local balance.
289
Approximate inference for Bayesian networks
Recalling once again the basic equation for performing probabilistic inference
Pr(Q|e) =
1
Z
Pr(Q e) =
1
Z
u
Pr(Q, u, e)
where
Q is the query variable.
e is the evidence.
u are the unobserved variables.
1/Z normalises the distribution.
We are going to consider obtaining samples from the distribution Pr(Q, U|e).
290
Approximate inference for Bayesian networks
The evidence is xed. Let the state of our system be a specic set of values for
the query variable and the unobserved variables
s = (q, u
1
, u
2
, . . . , u
n
) = (s
1
, s
2
, . . . , s
n+1
)
and dene s
i
to be the state vector with s
i
removed
s
i
= (s
1
, . . . , s
i1
, s
i+1
, . . . , s
n+1
)
To move from s to s
i
sampled according to
s
i
Pr(S
i
|s
i
, e)
This has detailed balance, and has Pr(Q, U|e) as its stationary distribution.
291
Approximate inference for Bayesian networks
To see that Pr(Q, U|e) is the stationary distribution
(s)Pr(s s
) = Pr(s|e)Pr(s
i
|s
i
, e)
= Pr(s
i
, s
i
|e)Pr(s
i
|s
i
, e)
= Pr(s
i
|s
i
, e)Pr(s
i
|e)Pr(s
i
|s
i
, e)
= Pr(s
i
|s
i
, e)Pr(s
i
, s
i
|e)
= Pr(s
s)(s
)
As a further simplication, sampling from Pr(S
i
|s
i
, e) is equivalent to sampling
S
i
conditional on its parents, children and childrens parents.
292
Approximate inference for Bayesian networks
So:
We successively sample the query variable and the unobserved variables, con-
ditional on their parents, children and childrens parents.
This gives us a sequence s
1
, s
2
, . . . which has been sampled according to Pr(Q, U|e).
Finally, note that as
Pr(Q|e) =
u
Pr(Q, u|e)
we can just ignore the values obtained for the unobserved variables. This gives
us q
1
, q
2
, . . . with
q
i
Pr(Q|e)
293
Approximate inference for Bayesian networks
To see that the nal step works, consider what happens when we estimate the
expected value of some function of Q.
E[f(Q)] =
q
f(q)Pr(q|e)
=
q
f(q)
u
Pr(q, u|e)
=
u
f(q)Pr(q, u|e)
so sampling using Pr(q, u|e) and ignoring the values for u obtained works exactly
as required.
294
A (very) brief introduction into how to learn hyperparameters
So far in our coverage of the Bayesian approach to neural networks, the hyperpa-
rameters and were assumed to be known and xed.
But this is not a good assumption because...
... corresponds to the width of the prior and to the noise variance.
So we really want to learn these from the data as well.
How can this be done?
We now take a look at one of several ways of addressing this problem.
295
The Bayesian approach to neural networks
Earlier we looked at the Bayesian approach to neural networks using the following
notation. We have:
A neural network computing a function f(w; x).
A training sequence s = ((x
1
, y
1
), . . . , (x
m
, y
m
)), split into
y = ( y
1
y
2
y
m
)
and
X = ( x
1
x
2
x
m
)
The prior distribution p(w) is now on the weight vectors and Bayes theorem tells
us that
p(w|y) =
p(y|w)p(w)
p(y)
In addition we have a Gaussian prior and a likelihood assuming Gaussian noise.
296
The Bayesian approach to neural networks
The prior and likelihood depend on and respectively so we now make this
clear and write
p(w|y, , ) =
p(y|w, )p(w|)
p(y|, )
(Dont worry about recalling the actual expressions for the prior and likelihood
just yet, they appear in a few slides time.)
In the earlier slides we found that the Bayes classier should in fact compute
p(Y |y, x, , ) =
_
R
W
p(y|w, x, )p(w|y, , ) dw
and we found an approximation to this integral. (Again, the necessary parts of the
result are repeated later.)
297
Hierarchical Bayes and the evidence
Lets write down directly something that might be useful to know:
p(, |y) =
p(y|, )p(, )
p(y)
If we know p(, |y) then a straightforward approach is to use the values for
and that maximise it.
Here is a standard trick: assume that the prior p(, ) is at, so that we can just
maximise
p(y|, )
This is called type II maximum likelihood and is one common way of doing the
job.
As usual there are other ways of handling and , some of which are regarded as
more correct.
298
Hierarchical Bayes and the evidence
The quantity
p(y|, )
is called the evidence.
When we re-wrote our earlier equation for the posterior density of the weights,
making and explicit, we found
p(w|y, , ) =
p(y|w, , )p(w|, )
p(y|, )
So the evidence is the denominator in this equation.
This is the common pattern and leads to the idea of hierarchical Bayes: the ev-
idence for the hyperparameters at one level is the denominator in the relevant
application of Bayes theorem.
299
An expression for the evidence
We have already derived everything necessary to write an explicit equation for the
evidence for the case of regression that weve been following.
First, as we know about a lot of expressions involving w we can introduce it by
the standard trick of marginalising:
p(y|, ) =
_
p(y, w|, )dw
=
_
p(y|w, , )p(w|, )dw
=
_
p(y|w, )p(w|)dw
where weve made the obvious independence simplications.
The two densities in this integral are just the likelihood and prior weve already
studied.
Weve just conditioned on and , which previously were constants but are now
being treated as random variables.
300
An expression for the evidence
Here are the actual expression for the prior and likelihood.
The prior is
p(w|) =
1
Z
W
()
exp (E
W
(w))
where
Z
W
() =
_
2
_
W/2
and E
W
(w) =
1
2
||w||
2
and the likelihood is
p(y|w, ) =
1
Z
y
()
exp (E
y
(w))
where
Z
y
() =
_
2
_
m/2
and E
y
(w) =
1
2
m
i=1
(y
i
h(w; x
i
))
2
Both of these equations have been copied directly from earlier slides: there is
nothing to add.
301
An expression for the evidence
That gives us
p(y|, ) =
_
2
_
W/2
_
2
_
m/2
_
exp (S(w)) dw
where
S(w) = E
W
(w) + E
y
(w)
This is exactly the integral we rst derived an approximation for.
Specically
_
exp (S(w)) dw (2)
W/2
|A|
1/2
exp(S(w
MAP
))
where
A = I + E
y
(w
MAP
)
and w
MAP
is the maximum a posteriori solution.
302
An expression for the evidence
Putting all that together we get an expression for the logarithm of the evidence:
log p(y|, )
W
2
log
m
2
log 2 +
m
2
log
1
2
log |A|
E
W
(w
MAP
) E
y
(w
MAP
)
Again, were using the fact that we want to maximise the evidence and this is
equivalent to maximising its logarithm which turns a product into a more friendly
sum.
303
Maximising the evidence
We want to maximise this, so lets differentiate it with respect to and .
For
log p(y|, )
=
W
2
E
W
(w
MAP
)
1
2
log |A|
How do we handle the nal term? This is straightforward if we can compute the
eigenvalues of A.
Recall that the n eigenvalues
i
and n eigenvectors v
i
of an n n matrix M are
dened such that
Mv
i
=
i
v
i
for i = 1, . . . , n
and the eigenvectors are orthonormal
v
T
i
v
j
=
_
1 if i = j
0 otherwise.
One standard result is that the determinant of a matrix is the product of its eigen-
values.
|M| =
n
i=1
i
304
Maximising the evidence
We have
A = I + E
y
(w
MAP
)
Say the eigenvalues of E
y
(w
MAP
) are
i
. (These can be computed using
standard numerical algorithms.)
Then the eigenvalues of A are +
i
and
log |A|
_
log
W
i=1
( +
i
)
_
=
_
W
i=1
log( +
i
)
_
=
W
i=1
1
+
i
( +
i
)
=
W
i=1
1
+
i
= Trace(A
1
)
because M
1
has eigenvalues 1/
i
and the trace of a matrix is equal to the sum of
its eigenvalues.
Finally, equating the derivative to zero gives:
W
2
E
W
(w
MAP
)
1
2
Trace(A
1
) = 0
or
=
1
2E
W
(w
MAP
)
_
W
W
i=1
+
i
_
which can be used to update the value for .
306
Maximising the evidence
We can now repeat the process to obtain an update for :
log p(y|, )
=
m
2
E
y
(w
MAP
)
1
2
log |A|
In this case
log |A|
_
W
i=1
log( +
i
)
_
=
W
i=1
1
+
i
( +
i
)
=
W
i=1
1
+
i
=
i
=
1
i=1
i
+
i
Equating the derivative to zero gives
=
1
2E
y
(w
MAP
)
_
m
W
i=1
i
+
i
_
which can be used to update the value for .
308
Maximising the evidence
Heres why the derivative works.
Say
M = E
y
(w
MAP
)
so were interested in
i
/ when the
i
are the eigenvalues of M. Thus
(M)v
i
=
i
v
i
and using the fact that the eigenvectors are orthonormal
v
T
i
Mv
i
=
i
v
T
i
v
i
=
i
.
So
v
T
i
Mv
i
=
i
and
= v
T
i
Mv
i
=
i
.
309
Maximising the evidence
Summary:
Dene
t
=
W
i=1
t
+
i
where the subscript denotes the fact that were using the following equations to
periodically update our estimates of and .
Collecting the two update equations together we have
t+1
=
t
2E
W
(w
MAP
)
and
t+1
=
m
t
2E
y
(w
MAP
)
310
Maximising the evidence
This suggests a method for the overall learning process:
1. Choose the initial values
0
and
0
at random.
2. Choose an initial weight vector w according to the prior.
3. Use a standard optimisation algorithm to iteratively estimate w
MAP
.
4. While the optimisation progresses, periodically use the equations above to re-
estimate and .
Step 4 requires that we compute an eigendecomposition, which might well be
time-consuming. If necessary we can make a simplication.
When m >> W it is reasonable to expect that
t
W an so we can use
t+1
=
W
2E
W
(w
MAP
)
and
t+1
=
m
2E
y
(w
MAP
)
311
An alternative: integrate the hyperparameters out
While choosing and by maximising the evidence leads to an effective algo-
rithm, it might be argued that a more correct way to deal with these parameters
would be to integrate them out.
p(w|y) =
_ _
p(w, , |y)dd.
(Recall the general equation for probabilistic inference where we integrate out
unobserved random variables.)
Re-arranging this we have
_ _
p(w, , |y)dd =
1
p(y)
_ _
p(y|w, , )p(w, , )dd
=
1
p(y)
_ _
p(y|w, , )p(w|, )p(, )dd
=
1
p(y)
_ _
p(y|w, )p(w|)p()p()dd
where were assuming and are independent.
312
An alternative: integrate the hyperparameters out
In order to continue we need to specify priors on and .
On this occasion we have a good reason to choose particular priors, as and are
scale parameters.
In general, a scale parameter is one that appears in a density of the form
p(x|) =
1
f
_
x
_
The standard deviation of a Gaussian density is an example.
What happens to this density if we scale x such that x
= cx?
313
Standard result number 1
We need to recall how to deal with transformations of continuous random vari-
ables.
Say we have a random variable x with probability density p
x
(x).
We then transform x to y = f(x) where f is strictly increasing.
What is the probability density function of y? There is a standard method for
computing this. (See NST maths, or the 1A Probability course.)
p
y
(y) =
p
x
(f
1
(y))
f
(f
1
(y))
314
An alternative: integrate the hyperparameters out
Applying this when x
= cx we have
f(x) = cx
f
1
(x
) =
x
c
f
(x) = c
and so
p
x
(x
) =
1
c
f
_
x
c
_
=
1
f
_
x
_
Thus the transformation leaves the density essentially unchanged, and in particular
we want the densities p() and p(
) to be identical.
It turns out that this forces the choice
p() =
c
.
This is an improper prior and it is conventional to take c
= 1.
315
Standard result number 2
Returning to the integral of interest
1
p(y)
_ _
p(y|w, )p(w|)p()p()dd
Taking the integral for rst we have
_
p(w|)p()d =
_
1
Z
W
()
exp(E
W
(w))d
=
_
1
2
_
W/2
exp
_
2
||w||
2
_
d
and to evaluate this we use the following standard result:
_
0
x
n
exp(ax)dx =
(n + 1)
a
n+1
where n > 1 and a > 0. So the integral becomes
(2)
W/2
(W/2)
E
W
(w)
W/2
316
An alternative: integrate the hyperparameters out
Repeating the process for and using the same standard result we have
_
p(y|w, )p()d =
_
1
2
_
m/2
exp(E
y
(w))d
= (2)
m/2
(m/2)
E
y
(w)
m/2
Combining the two expression we obtain
log p(w|y) = log
_
1
p(y)
(2)
W/2
(W/2)
E
W
(w)
W/2
(2)
m/2
(m/2)
E
y
(w)
m/2
_
=
W
2
log E
W
(w) +
m
2
log E
y
(w) + constant
and we want to minimise this so we need
W
2
1
E
W
(w)
E
W
(w)
w
+
m
2
1
E
y
(w)
E
y
(w)
w
= 0
317
An alternative: integrate the hyperparameters out
The actual value for the evidence is
log p(w|y) = log
_
1
p(y)
1
Z
y
(, )
exp((E
W
(w) + E
y
(w)))
_
= E
W
(w) + E
y
(w) + constant
and we want to minimise this so we need
E
W
(w)
w
+
E
y
(w)
w
= 0
This should make us VERY VERY HAPPY because if we equate the two boxed
equations we get
=
W
2E
W
(w)
and
=
m
2E
y
(w)
and so the result for integrating out the hyperparameters agrees with the result for
optimising the evidence.
318
Reinforcement Learning
We now examine:
Some potential shortcomings of hidden Markov models, and of supervised
learning.
An extension know as the Markov Decision Process (MDP).
The way in which we might learn from rewards gained as a result of acting
within an environment.
Specic, simple algorithms for performing such learning, and their conver-
gence properties.
Reading: Russell and Norvig, chapter 21. Mitchell chapter 13.
319
Reinforcement learning and HMMs
Hidden Markov Models (HMMs) are appropriate when our agent models the
world as follows
Pr(S
0
)
S
0
S
1
S
2
S
3
E
1
E
3
Pr(S
t
|S
t1
)
Pr(E
t
|S
t
)
E
2
and only wants to infer information about the state of the world on the basis of
observing the available evidence.
This might be criticised as un-necessarily restricted, although it is very effective
for the right kind of problem.
320
Reinforcement learning and supervised learning
Supervised learners learn from specically labelled chunks of information:
x ???
(x
1
, 1)
(x
2
, 1)
(x
3
, 0)
.
.
.
This might also be criticised as un-necessarily restricted: there are other ways to
learn.
321
Reinforcement learning: the basic case
We now begin to model the world in a more realistic way as follows:
S
0
S
1
S
2
S
3
In any state:
Perform an action a to move to a new state. (There may be many possibilities.)
Receive a reward r depending on the start state and action.
The agent can perform actions in order to change the worlds state.
If the agent performs an action in a particular state, then it gains a corresponding
reward.
322
Deterministic Markov Decision Processes
Formally, we have a set of states
S = {s
1
, s
2
, . . . , s
n
}
and in each state we can perform one of a set of actions
A = {a
1
, a
2
, . . . , a
m
}.
We also have a function
S : S A S
such that S(s, a) is the new state resulting from performing action a in state s,
and a function
R : S A R
such that R(s, a) is the reward obtained by executing action a in state s.
323
Deterministic Markov Decision Processes
From the point of view of the agent, there is a matter of considerable importance:
The agent does not have access to the functions S and R .
It therefore has to learn a policy, which is a function
p : S A
such that p(s) provides the action a that should be executed in state s.
What might the agent use as its criterion for learning a policy?
324
Measuring the quality of a policy
Say we start in a state at time t, denoted s
t
, and we follow a policy p. At each
future step in time we get a reward. Denote the rewards r
t
, r
t+1
, . . . and so on.
A common measure of the quality of a policy p is the discounted cumulative re-
ward
V
p
(s
t
) =
i=0
i
r
t+i
= r
t
+ r
t+1
+
2
r
t+2
+
where 0 1 is a constant, which denes a trade-off for how much we value
immediate rewards against future rewards.
The intuition for this measure is that, on the whole, we should like our agent to
prefer rewards gained quickly.
325
Measuring the quality of a policy
Other common measures are the average reward
lim
T
1
T
T
i=0
r
t+i
and the nite horizon reward
T
i=0
r
t+i
In these notes we will only address the discounted cumulative reward.
326
Two important issues
Note that in this kind of problem we need to address two particularly relevant
issues:
The temporal credit assignment problem: that is, how do we decide which
specic actions are important in obtaining a reward?
The exploration/exploitation problem. How do we decide between exploiting
the knowledge we already have, and exploring the environment in order to
possibly obtain new (and more useful) knowledge?
We will see later how to deal with these.
327
The optimal policy
Ultimately, our learners aim is to learn the optimal policy
p
opt
= argmax
p
V
p
(s)
for all s. We will denote the optimal discounted cumulative reward as
V
opt
(s) = V
p
opt
(s).
How might we go about learning the optimal policy?
328
Learning the optimal policy
The only information we have during learning is the individual rewards obtained
from the environment.
We could try to learn V
opt
(s) directly, so that states can be compared:
Consider s as better than s
if V
opt
(s) > V
opt
(s
).
However we actually want to compare actions, not states. Learning V
opt
(s) might
help as
p
opt
(s) = argmax
a
[R(s, a) + V
opt
(S(s, a))]
but only if we know S and R.
As we are interested in the case where these functions are not known, we need
something slightly different.
329
The Q function
The trick is to dene the following function:
Q(s, a) = R(s, a) + V
opt
(S(s, a))
This function species the discounted cumulative reward obtained if you do ac-
tion a in state s and then follow the optimal policy.
As
p
opt
(s) = argmax
a
Q(s, a)
then provided one can learn Q it is not necessary to have knowledge of S and R
to obtain the optimal policy.
330
The Q function
Note also that
V
opt
(s) = max
Q(s, )
and so
Q(s, a) = R(s, a) + max
Q(S(s, a), )
which suggests a simple learning algorithm.
Let Q
(S(s, a), )
Note that this can be done in episodes. For example, in learning to play games,
we can play multiple games, each being a single episode.
332
Convergence of Q-learning
This looks as though it might converge!
Note that, if the rewards are at least 0 and we initialise Q
to 0 then,
n, s, a Q
n+1
(s, a) Q
n
(s, a)
and
n, s, a Q(s, a) Q
n
(s, a) 0
However, we need to be a bit more rigorous than this...
333
Convergence of Q-learning
If:
1. The agent is operating in an environment that is a deterministic MDP.
2. Rewards are bounded in the sense that there is a constant > 0 such that
s, a |R(s, a)| <
3. All possible pairs s and a are visited innitely often.
Then the Q-learning algorithm converges, in the sense that
a, s Q
n
(s, a) Q(s, a)
as n .
334
Convergence of Q-learning
This is straightforward to demonstrate.
Using condition 3, take two stretches of time in which all s and a pairs occur:
All s, a occur All s, a occur
Dene
(n) = max
s,a
|Q
n
(s, a) Q(s, a)|
the maximum error in Q
at n.
What happens when Q
n
(s, a) is updated to Q
n+1
(s, a)?
335
Convergence of Q-learning
We have,
|Q
n+1
(s, a) Q(s, a)|
= |(R(s, a) + max
n
(S(s, a), )) (R(s, a) + max
n
(S(s, a), ) max
Q(S(s, a), )|
max
|Q
n
(S(s, a), ) Q(S(s, a), )|
max
s,a
|Q
n
(s, a) Q(s, a)|
= (n).
Convergence as described follows.
336
Choosing actions to perform
We have not yet answered the question of how to choose actions to perform during
learning.
One approach is to choose actions based on our current estimate Q
. For instance
action chosen in current state s = argmax
a
Q
(s, a).
However we have already noted the trade-off between exploration and exploita-
tion. It makes more sense to:
Explore during the early stages of training.
Exploit during the later stages of training.
This seems particularly important in the light of condition 3 of the convergence
proof.
337
Choosing actions to perform
One way in which to choose actions that incorporates these requirements is to
introduce a constant and choose actions probabilistically according to
Pr(action a|state s) =
Q
(s,a)
(s,a)
Note that:
If is small this promotes exploration.
If is large this promotes exploitation.
We can vary as training progresses.
338
Improving the training process
There are two simple ways in which the process can be improved:
1. If training is episodic, we can store the rewards obtained during an episode
and update backwards at the end.
This allows better updating at the expense of requiring more memory.
2. We can remember information about rewards and occasionally re-use it by
re-training.
339
Nondeterministic MDPs
The Q-learning algorithm generalises easily to a more realistic situation, where
the outcomes of actions are probabilistic.
Instead of the functions S and Rwe have probability distributions
Pr(new state|current state, action)
and
Pr(reward|current state, action).
and we now use S(s, a) and R(s, a) to denote the corresponding random vari-
ables.
We now have
V
p
= E
_
i=0
i
r
t+i
_
and the best policy p
opt
maximises V
p
.
340
Q-learning for nondeterministic MDPs
We now have
Q(s, a) = E(R(s, a)) +
Pr(|s, a)V
opt
()
= E(R(s, a)) +
Pr(|s, a) max
Q(, )
and the rule for learning becomes
Q
n+1
= (1
n+1
)Q
n
(s, a) +
n+1
_
R(s, a) + max
n
(S(s, a), )
_
with
n+1
=
1
1 + v
n+1
(s, a)
where v
n+1
(s, a) is the number of times the pair s and a has been visited so far.
341
Convergence of Q-learning for nondeterministic MDPs
If:
1. The agent is operating in an environment that is a nondeterministic MDP.
2. Rewards are bounded in the sense that there is a constant > 0 such that
s, a |R(s, a)| <
3. All possible pairs s and a are visited innitely often.
4. n
i
(s, a) is the ith time that we do action a in state s.
and also...
342
Convergence of Q-learning for nondeterministic MDPs
...we have
0
n
< 1
i=1
n
i
(s,a)
=
i=1
2
n
i
(s,a)
<
then with probability 1 the Q-learning algorithm converges, in the sense that
a, s Q
n
(s, a) Q(s, a)
as n .
343
Alternative representation for the Q
table
But theres always a catch...
We have to store the table for Q
:
Even for quite straightforward problems it is HUGE!!! - certainly big enough
that it cant be stored.
A standard approach to this problem is, for example, to represent it as a neural
network.
One way might be to make s and a the inputs to the network and train it to
produce Q