0% found this document useful (0 votes)
18 views

AI - Unit 4 - Part 4 - Full_Inference_Bayesian Networks

The document discusses Bayes' rule and its application in Bayesian networks, focusing on conditional probabilities and inference methods. It explains concepts such as independence and conditional independence, providing examples related to medical diagnosis and probabilistic reasoning in uncertain domains. The document also touches on practical applications, including the Wumpus World scenario, illustrating how to specify probability models and perform inference using these principles.

Uploaded by

lekha6613
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

AI - Unit 4 - Part 4 - Full_Inference_Bayesian Networks

The document discusses Bayes' rule and its application in Bayesian networks, focusing on conditional probabilities and inference methods. It explains concepts such as independence and conditional independence, providing examples related to medical diagnosis and probabilistic reasoning in uncertain domains. The document also touches on practical applications, including the Wumpus World scenario, illustrating how to specify probability models and perform inference using these principles.

Uploaded by

lekha6613
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Unit 4

Part 4
Bayes' rule and its use
Representing knowledge in an uncertain domain
The semantics of Bayesian networks
Exact and approximate inference in Bayesian networks

Hemamalini S
Page 1
Conditional probability
Conditional or posterior probabilities
e.g., P (cavity|toothache) = 0.8
i.e., g i v e n t h a t toothache is all I k n o w “if
toothache then 80% chance of cavity”

(Notation for conditional distributions:


P(Cavity|Toothache)

If we know more, e.g., cavity is also given, then wehave


P (cavity|toothache, cavity) = 1
Note: the less specific belief r e m a i n s valid after more evidence arrives,
but is not always useful

New evidence may be irrelevant, allowing simplification, e.g.,


P (cavity|toothache, 49yers) = P (cavity|toothache) = 0.8
This kind of inference, sanctioned by domain knowledge, iscrucial

Chapter 13 2
Conditional probability
Definition of conditional probability:
P (a ∧ b)
P (a|b) = if P (b) !=0
P (b)
Product rule gives an alternativeformulation:
P (a ∧b) = P (a|b)P(b) = P (b|a)P(a)

A general version holds for whole distributions, e.g.,


P(W eather, Cavity) = P(W eather|Cavity)P(Cavity)

Chain rule is derived by successive application of product rule:


P(X 1 , . . . , X n ) = P(X 1 , . . . , Xn−1) P(X n |X 1 , . . . , Xn−1)
= P(X 1 , . . . , Xn−2) P(X n 1 |X 1 , . . . , Xn−2) P(X n |X 1 , . . . , Xn−1)

=...
= Πn P ( X |X , . . . , X )
i =1 i 1 i−1

Chapter 13 3
Inference by enumeration
Start with the joint distribution:

toothache ¬ toothache
catch ¬ catch catch ¬ catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576
For any proposition φ, sum the atomic events where it is true:
P (φ) = Σω:ω|=φP (ω)

Chapter 13 4
Inference by enumeration
Start with the joint distribution:
toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576

For any proposition φ, sum the atomic events where it is true:


P (φ) = Σω:ω|=φP (ω)
P (toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

Chapter 13 5
Inference by enumeration
Start with the joint distribution:

toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576
For any proposition φ, sum the atomic events where it is true:
P (φ) = Σω:ω|=φP (ω)
P(cavity∨toothache) = 0.108+0.012+0.072+0.008+0.016+0.064 = 0.28

Chapter 13 6
Inference by enumeration
Start with the joint distribution:
toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576

Can also compute conditional probabilities:


P (¬cavity ∧ toothache)
P (¬cavity|toothache) =
P (toothache)
0.016 + 0.064
= = 0.4
0.108 + 0.012 + 0.016 + 0.064

Chapter 13 7
P(cavity | toothache) = P(cavity ∧
toothache)/P(toothache)

= ( 0.108 + 0.012)/(0.108 + 0.012 + 0.016 + 0.064)

= 0.6 .
From the example, we can extract a general inference
procedure. We begin with the case in which the query
involves a single variable, X (Cavity in the example).
Let E be the list of evidence variables (just Toothache in the
example), let e be the list of observed values for them, and
let Y be the remaining unobserved variables (just Catch in
the example).
The query is P(X | e) and can be evaluated as
P(X | e) = α P(X, e) = α Σ y P(X, e, y)
where the summation is over all possible ys (i.e., all possible
combinations of values of the unobserved variables Y).
Notice that together the variables X, E, and Y constitute the
complete set of variables for the domain, so P(X, e, y) is
simply a subset of probabilities from the full joint
distribution.
Normalization

toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576

Denominator can be viewed as a normalization constant α


P(Cavity|toothache) = α P(Cavity, toothache)
= α [P(Cavity, toothache, catch) + P(Cavity, toothache, ¬catch)]
= α [<0.108, 0.016>+ <0.012, 0.064>]
= α <0.12, 0.08>= <0.6,0.4>

General idea: compute distribution on queryvariable


by fixing evidence variables and summing over hidden variables
Chapter 13 10
Independence
A and B are independent iff
P (A|B) = P (A) or P (B|A) = P (B) or P(A, B) = P(A)P(B)

decomposes into

Cavity Cavity
Toothache Catch Toothache Catch

Weather
Weather

P(T oothache, Catch, Cavity, Weather)


= P(T oothache, Catch, Cavity)P(W eather)

Chapter 13 11
Conditional independence
P(Toothache, Cavity, Catch) has 23 − 1 = 7 independententries
If I have a cavity, the probability that the probe catches in it doesn’t depend
on whether I have a toothache:
(1) P (catch|toothache, cavity) = P (catch|cavity)

The same independence holds if I haven’t got a cavity:


(2) P (catch|toothache, ¬cavity) = P (catch|¬cavity)

Catch is conditionally independent of Toothache given Cavity:


P(Catch|T oothache, Cavity) = P(Catch|Cavity)

Equivalent statements:
P(T oothache|Catch, Cavity) = P(T oothache|Cavity)
P(T oothache, Catch|Cavity) =
P(T oothache|Cavity)P(Catch|Cavity)

Chapter 13 12
Conditional independence contd.
Write out full joint distribution using chain rule:
P(T oothache, Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
= P(T oothache|Cavity)P(Catch|Cavity)P(Cavity)

I.e., 2 + 2 + 1 = 5 independent numbers.


In most cases, the use of conditional independence reduces the size of the
representation of the joint distribution from exponential in n to linear in n.
C o n d i t i o n a l i n d e p e n d e n c e is o u r m o s t b a s i c a n d r o b u s t
f o r m of k n o w l e d g e a b o u t u n c e r t a i n e n v i r o n m e n t s .

Chapter 13 13
Bayes’ Rule
Product rule P (a ∧ b) = P (a|b)P(b) = P (b|a)P(a)
P (b|a)P(a)
⇒ Bayes’ rule P (a|b) =
P (b)
or in distribution form
P(X|Y )P(Y )
P(Y |X) = = αP(X|Y )P(Y )
P(X)
Useful for assessing diagnostic probability from causal probability:
P (Effect|Cause)P (Cause)
P (Cause|Effect) =
P (Effect)

Chapter 13 14
For example, a doctor knows that the disease meningitis
causes the patient to have a stiff neck, say, 70% of the
time.
The doctor also knows some unconditional facts: the prior
probability that a patient has meningitis is 1/50,000, and
the prior probability that any patient has a stiff neck is 1%.
Letting s be the proposition that the patient has a stiff neck
and m be the proposition that the patient has meningitis,
we have
P(s|m) = 0.7
P(m) = 1/50000
P(s) = 0.01
P(m|s) =(P(s|m)P(m))/P(s)
= (0.7 × 1/50000)/ 0.01
= 0.0014 .
The notion of independence in the previous section
provides a clue, but needs refining.
It would be nice if Toothache and Catch were
independent, but they are not:
if the probe catches in the tooth, then it is likely that the
tooth has a cavity and that the cavity causes a toothache.
These variables are independent, however, given the
presence or the absence of a cavity.
Each is directly caused by the cavity, but neither has a
direct effect on the other: toothache depends on the state
of the nerves in the tooth, whereas the probe’s accuracy
depends on the dentist’s skill, to which the toothache is
irrelevant. Mathematically, this property is written as
P(toothache ∧ catch | Cavity) = P(toothache |
Cavity)P(catch | Cavity)
This equation expresses the conditional independence
of toothache and catch given Cavity.
P(Cavity | toothache ∧ catch)
= α P(toothache | Cavity) P(catch | Cavity) P(Cavity)

The general definition of conditional independence


of two variables X and Y , given a third variable Z, is
P(X, Y | Z) = P(X | Z)P(Y | Z) .

In the dentist domain, for example, it seems


reasonable to assert conditional independence of the
variables Toothache and Catch, given Cavity:
P(Toothache, Catch | Cavity) = P(Toothache |
Cavity)P(Catch | Cavity)
As with absolute independence, the equivalent forms
P(X | Y,Z)=P(X |Z) and P(Y | X,Z)=P(Y | Z)
Bayes’ Rule and conditional independence

P(Cavity|toothache ∧catch)
= α P(toothache ∧catch|Cavity)P(Cavity)
= α P(toothache|Cavity)P(catch|Cavity)P(Cavity)
( | ) ( | ) ( )

This is an example of a naive Bayes model:


P(Cause, Effect 1 , . . . , Effect n ) = α P(Cause)ΠiP(Effecti|Cause)

Cavity Cause

Toothache Catch Effect1 Effectn

Total number of parameters is l i n e a r inn

Chapter 13 19
Wumpus World
1,4 2,4 3,4 4,4

1,3 2,3 3,3 4,3

1,2 2,2 3,2 4,2


B
OK
1,1 2,1 3,1 4,1
B
OK OK

P i j = true iff [i, j] containsa pit


B i j = true iff [i, j] is breezy
Include only B1,1, B1,2, B2,1 in the probability model

Chapter 13 20
Specifying the probability model
The full joint distribution is P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1)
Apply product rule: P(B1,1, B1,2, B2,1 |P1,1, . . . , P4,4)P(P1,1, . . . , P4,4)
(Do it this way to get P (Effect|Cause).)

First term: 1 if pits are adjacent to breezes, 0 otherwise


Second term: pits are placed randomly, probability 0.2 per square:
4,4
P(P 1,1, . . . , P 4,4) = Πi , j = 1,1 P(P i , j ) = 0.2n × 0.816−n
for n pits.

Chapter 13 21
Observations and query
We know the following facts:
b= ¬b1,1 ∧ b1,2 ∧b2,1
known = ¬p1,1 ∧ ¬p1,2 ∧¬p2,1
Query is P(P1,3|known, b)

Define Unknown = P ij s other than P1,3 and Known


For inference by enumeration, wehave
P(P1,3|known, b) = αΣunknownP(P1,3, unknown, known, b)

Grows exponentially with number ofsquares!

Chapter 13 22
Using conditional independence
Basic insight: observations are conditionally independent of other hidden
squares given neighbouring hidden squares
1,4 2,4 3,4 4,4

1,3 2,3 3,3 4,3


OTHER
QUERY

1,2 2,2 3,2 4,2

1,1 2,1 FRIN3G,1E 4,1


KNOWN

Define Unknown = Fringe ∪Other


P(b|P1,3, Known, Unknown) = P(b|P1,3, Known, Fringe)

Manipulate query into a form where we can usethis!

Chapter 13 23
Using conditional independence contd.

P(P1,3|known, b) = α X
P(P1,3, unknown, known, b)
unk nown
= α X
P(b|P1,3, known, unknown)P(P1,3, known, unknown)
unknown
= α X X
P(b|known, P1,3, fringe, other)P(P1,3, known, fringe, other)
f ringe other
= α X X
P(b|known, P1,3, fringe)P(P1,3, known, fringe, other)
f ringe other
= α X
P(b|known, P1,3, fringe) X
P(P 1,3, known, fringe, other)
f r inge other
= α X
P(b|known, P1,3, fringe) X
P(P1,3)P (known)P (fringe)P (other)
f ringe other
= α P (known)P(P1,3) X
P(b|known, P1,3, fringe)P (fringe) X
P (other)
f r inge ot her
= α0P(P
( 1,3) X
( |
P(b|known, ) (fringe)
P1,3, fringe)P ( )
f r inge

Chapter 13 24
Using conditional independence contd.
1,3 1,3 1,3 1,3 1,3

1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2
B B B B B

OK OK OK OK OK
1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1
B B B B B

OK OK OK OK OK OK OK OK OK OK

0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16

P(P1,3|known, b) = α0h0.2(0.04 + 0.16 + 0.16),0.8(0.04 + 0.16)i


≈ h0.31, 0.69i

P(P2,2|known, b) ≈ h0.86, 0.14i

Chapter 13 25
Summary
Probability is a rigorous formalism for uncertainknowledge
Joint probability distribution specifies probability of every atomic event
Queries can be answered by summing over atomic events
For nontrivial domains, we must find a way to reduce the joint size
Independence and conditional independence provide the tools

Chapter 13 26
Bayesian n e t w o r k s

C h a p t e r 14.1–2

Chapter 14.1–3 27
Bayesian networks
A simple, graphical notation for conditional independence
assertions and hence for compact specification of full joint
distributions.
Definition:
A Bayesian network is a directed graph in which each
node is annotated with quantitative probability
information. The full specification is as follows:
1. Each node corresponds to a random variable, which
may be discrete or continuous.
2. A set of directed links or arrows connects pairs of
nodes. If there is an arrow from node X to node Y , X is
said to be a parent of Y. The graph has no directed cycles
(and hence is a directed acyclic graph, or DAG).
3. Each node Xi has a conditional probability
distribution P(Xi | Parents(Xi)) that quantifies the effect
Chapter 14.1–3 28
of the parents on the node.
Example
Topology of network encodes conditional independence assertions:

Weather Cavity

Toothache Catch

W eather is independent of the other variables


T oothache and Catch are conditionally independent given Cavity

Intuitively, the network represents the fact that Cavity is a


direct cause of Toothache and Catch, whereas no direct
causal relationship exists between Toothache and Catch.
Chapter 14.1–3 29
You have a new burglar alarm installed at home.

It is fairly reliable at detecting a burglary, but also responds on


occasion to minor earthquakes.
(This example is due to Judea Pearl, a resident of Los Angeles—
hence the acute interest in earthquakes.)

You also have two neighbors, John and Mary, who have promised
to call you at work when they hear the alarm.

John nearly always calls when he hears the alarm, but sometimes
confuses the telephone ringing with the alarm and calls then, too.

Mary, on the other hand, likes rather loud music and often misses
the alarm altogether. Given the evidence of who has or has not
called, we would like to estimate the probability of a burglary.
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighbor
Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a
burglar?

Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls


Network topology reflects “causal” knowledge:
– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call

Chapter 14.1–3 31
Example contd.

P(B) P(E)
Burglary Earthquake
.001 .002

B E P(A|B,E)
T T .95
T F .94 Alarm
F T .29
F F .001

A P(J|A) A P(M|A)
JohnCalls T .90 MaryCalls T .70
F .05 F .01

Chapter 14.1–3 32
The network structure shows that burglary and
earthquakes directly affect the probability of the
alarm’s going off, but whether John and Mary call
depends only on the alarm.

The network thus represents our assumptions


that they do not perceive burglaries directly, they
do not notice minor earthquakes,
and they do not confer before calling.
Compactness
A CPT for Boolean X i with k Boolean parents has
2k rows for the combinations of parent values B E

Each row requires one number p for X i = true A


(the number for X i = false is just 1 − p)
J M

If each variable has no more than k parents,


the complete network requires O(n ·2k) numbers

I.e., grows linearly with n, vs. O(2n) for the full joint distribution

For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31)

Chapter 14.1–3 34
A generic entry in the joint distribution is the probability of a
conjunction of particular assignments to each variable, such
as P(X1 = x1 ∧ ... ∧ Xn = xn).

We use the notation P(x1,...,xn) as an abbreviation for this.

Thus, each entry in the joint distribution is represented by


the product of the appropriate elements of the conditional
probability tables (CPTs) in the Bayesian network

Chapter 14.1–3 35
Global semantics
“Global” semantics defines the full joint distribution
as the product of the local conditional distributions: B E

P (x1, . . . , x n) = Πni= 1 P (xi|parents(Xi)) A

e.g., To illustrate this, we can calculate the J M

probability that the alarm has sounded, but


neither a burglary nor an earthquake has
occurred, and both John and Mary call.

P (j ∧ m ∧ a ∧ ¬b∧¬e)
= P (j|a)P (m|a)P (a|¬b,¬e)P(¬b)P(¬e)
= 0.9× 0.7× 0.001× 0.999× 0.998
≈ 0.00063

Chapter 14.1–3 36
Constructing Bayesian networks
Need a method such that a series of locally testable assertions of
conditional independence guarantees the required global semantics

1. Choose an ordering of variables X1 ,. . . , X n


2. For i = 1 to n
add X i to the network
select parents from X1 ,. . . , X i−1 such that
P(Xi|Parents(Xi)) = P(Xi |X1 , . . . , X i−1 )
This choice of parents guarantees the global semantics:

P(X 1, . . . , Xn ) = Πni= 1 P (X i |X 1, . . . , X i− 1) (chain rule)


= Πin= 1 P(X i |Parents(Xi)) (by construction)

Chapter 14.1–3 37
The topological semantics2 specifies that each variable is
conditionally independent of its non-descendants, given its
parents.
MARKOV BLANKET
A node is conditionally independent of all other nodes in the
network, given its parents, children, and children’s parents—
that is, given its Markov blanket.
Intuitively, the parents of node Xi should contain all those
nodes in X1, ..., Xi−1 that directly influence Xi.

For example, suppose we have completed the network in


Figure 14.2 except for the choice of parents for
MaryCalls.

MaryCalls is certainly influenced by whether there is a


Burglary or an Earthquake, but not directly influenced.

Intuitively, our knowledge of the domain tells us that


these events influence Mary’s calling behavior only
through their effect on the alarm.

Also, given the state of the alarm, whether John calls has
no influence on Mary’s calling.
Formally speaking, we believe that the following conditional
independence statement holds:

P(MaryCalls | JohnCalls , Alarm, Earthquake, Burglary) =


P(MaryCalls | Alarm) .

Thus, Alarm will be the only parent node for MaryCalls.

Because each node is connected only to earlier nodes, this


construction method guarantees that the network is acyclic.

Another important property of Bayesian networks is that they


contain no redundant probability values. If there is no redundancy,
then there is no chance for inconsistency:

it is impossible for the knowledge engineer or domain expert to


create a Bayesian network that violates the axioms of probability
Example
Suppose we choose the ordering M , J , A, B, E

MaryCalls

JohnCalls

Alarm

Burglary

Earthquake

Chapter 14.1–3 17
The process goes as follows:
• Adding MaryCalls: No parents.
• Adding JohnCalls: If Mary calls, that probably means the alarm
has gone off, which of course would make it more likely that John
calls. Therefore, JohnCalls needs MaryCalls as a parent
. • Adding Alarm: Clearly, if both call, it is more likely that the alarm
has gone off than if just one or neither calls, so we need both
MaryCalls and JohnCalls as parents.
• Adding Burglary: If we know the alarm state, then the call from
John or Mary might give us information about our phone ringing or
Mary’s music, but not about burglary:
P(Burglary | Alarm, JohnCalls , MaryCalls) = P(Burglary | Alarm) .
Hence we need just Alarm as parent.
• Adding Earthquake: If the alarm is on, it is more likely that there
has been an earthquake. (The alarm is an earthquake detector of
sorts.) But if we know that there has been a burglary, then that
explains the alarm, and the probability of an earthquake would be
only slightly above normal. Hence, we need both Alarm and
Burglary as parents.
Example contd.

MaryCalls

JohnCalls

Alarm

Burglary

Earthquake

Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed

Chapter 14.1–3 42
Example: C a r diagnosis
Initial evidence: car won’t start (red)
Testable variables (green), “broken, so fix it” variables (orange)
Hidden variables (gray) ensure sparse structure, reduce parameters

battery age alternator fanbelt


broken broken

battery no charging
dead

battery battery fuel line starter


flat no oil no gas blocked broken
meter

car won’t
lights oil light gas gauge start dipstick

Chapter 14.1–3 43
Inference in Bayesian Networks
Outline

♦ Exact inference by enumeration


♦ Approximate inference by stochastic simulation
♦ Exact inference by variable elimination
♦ Approximate inference by Markov chain Monte Carlo

Chapter 14.4–5 45
Inference tasks
Simple queries: compute posterior marginal P(Xi|E = e)
e.g., P (N oGas|Gauge = empty, Lights = on, Starts = false)
Conjunctive queries: P(X i , X j |E = e) = P(Xi|E = e)P(X j|X i , E = e)
Optimal decisions: decision networks include utility information;
probabilistic inference required for P (outcome|action, evidence)
Value of information: which evidence to seek next?
Sensitivity analysis: which probability values are most critical?
Explanation: why do I need a new startermotor?

Chapter 14.4–5 46
Exact Inference in Bayesian Networks

The basic task for any probabilistic inference system is to compute the
posterior probability distribution for a set of query variables, given
some observed event—that is, some assignment of values to a set of
evidence variables.

To simplify the presentation, we will consider only one query variable


at a time; the algorithms can easily be extended to queries with
multiple variables.
We will use the following notation :
X denotes the query variable; E denotes the set of evidence variables
E1, . . . ,Em, and e is a particular observed event; Y will denotes the
nonevidence, non query variables Y1, . . . , Yl (called the hidden
variables).

Thus, the complete set of variables is X={X}∪E ∪Y.

A typical query asks for the posterior probability distribution P(X | e).
A query P(X | e) can be answered using
P(X | e) = α P(X, e) = α P(X, e, y) .

P(x, e, y) in the joint distribution can be written as products


of conditional probabilities from the network.

Therefore, a query can be answered using a Bayesian


network by computing sums of products of conditional
probabilities from the network.

Consider the query P(Burglary | JohnCalls =true,


MaryCalls =true).
The hidden variables for this query are Earthquake and
Alarm.
Inference by enumeration
Slightly intelligent way to sum out variables from the joint without actually
constructing its explicit representation

Simple query on the burglary network:


P(B|j, m) B E
= P(B, j, m)/P(j, m)
A
= αP(B, j, m)
= α Σe Σa P(B, e, a, j, m) J M

Rewrite full joint entries using product of CPT entries:


P(B|j, m)
= α Σe Σa P(B)P (e)P(a|B, e)P(j|a)P (m|a)
= αP(B) Σe P (e) Σa P(a|B, e)P(j|a)P (m|a)

Recursive depth-first enumeration: O(n) space, O(dn) time

Chapter 14.4–5 49
Enumeration algorithm

f u n c t i o n E n u m e r a t i o n - A s k ( X , e, bn) r e t u r n s a distribution over X


i n p u t s : X, the query variable
e, observed values for variables E
bn, a Bayesian network with variables { X } ∪ E ∪ Y
Q(X ) ← a distribution over X, initiallyempty
for e a c h value x i of X d o
extend e with value x i for X
Q(xi ) ← E n u m e r a t e - A l l ( V a r s [ b n ] , e)
r e t u r n N o r m a l i z e ( Q ( X ))

f u n c t i o n E n u m e r a t e - A l l ( v a r s , e) r e t u r n s a real number
if Empty?(vars) t h e n r e t u r n 1.0
Y ← First(vars)
if Y has value y in e
t h e n r e t u r n P (y | P a(Y )) × E n u m e r a t e - A l l ( R e s t ( v a r s ) , e) else
r e t u r n P y P (y | P a(Y )) × E n u m e r a t e - A l l ( R e s t ( v a r s ) , e y )
where e y is e extended with Y = y

Chapter 14.4–5 50
Example contd.

P(B) P(E)
Burglary Earthquake
.001 .002

B E P(A|B,E)
T T .95
T F .94 Alarm
F T .29
F F .001

A P(J|A) A P(M|A)
JohnCalls T .90 MaryCalls T .70
F .05 F .01

Chapter 14.1–3 51
Evaluation tree
P(b)
.001
0.00119705 0.59223
P(e) P( e)
.002 .998
0.598525
P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e)
0.5985 .95 .05 .94 .06
0.000025

P(j|a) P(j| a) P(j|a) P(j| a)


.90 .05 .90 .05

P(m|a) P(m| a) P(m|a) P(m| a)


.70 .01 .70 .01

The evaluation proceeds top down, multiplying values along


each path and summing at the “+” nodes.

Enumeration is inefficient: repeated computation e.g., computes Chapter 14.4–5 52

P (j|a)P (m|a) for each value of e


Using the numbers from the Figure 14.2,
we obtain P(b | j, m) = α×0.00059224.
The corresponding computation for ¬b yields
α×0.0014919;
Hence, P(B | j, m) = α< 0.00059224, 0.0014919> ≈
<0.284, 0.716> .
That is, the chance of a burglary, given calls from both
neighbors, is about 28%.
Inference by variable elimination
Variable elimination: carry out summations right-to-left,
storing intermediate results (factors) to avoid recomputation
P(B|j, m)
= α P(B) Σe P (e) Σa P(a|B, e) P (j|a) P (m|a)
| {z } | {z } | {z } | {z } | {z }
B E A J M
= αP(B)ΣeP (e)ΣaP(a|B, e)P(j|a)f M (a)
= αP(B)ΣeP (e)ΣaP(a|B, e)fJ (a)f M (a)
= αP(B)ΣeP (e)ΣafA(a, b,e)fJ (a)f M (a)
= αP(B)ΣeP (e)fA¯JM (b,e) (sum out A)
= αP(B)fE¯A¯J M (b) (sum outE)
= αfB(b) × fE¯A¯JM (b)

Chapter 14.4–5 54
Variable elimination: Basic
operations
Summing out a variable from a product of factors:
move any constant factors outside the summation
add up submatrices in pointwise product of remaining factors
Σ x f 1 × ···× f k = f 1 × ···× f i Σ x f i+1 × ···× f k = f 1 × ···× f i × fX¯
assuming f 1 , . . . ,f i do not depend on X
Pointwise product of factors f 1 and f 2:
f1(x1, . . . , x j , y1, . . . , yk) × f2(y1, . . . , yk, z1, . . . , zl)
= f (x1, . . . , x j , y1, . . . , yk, z1, . . . , zl)
E.g., f1(a, b)× f2(b, c) = f (a, b,c)

Chapter 14.4–5 55
Variable elimination algorithm

f u n c t i o n E l i m i n a t i o n - A s k ( X , e, bn) r e t u r n s a distribution over X


i n p u t s : X, the query variable
e, evidence specified as an event
bn, a belief network specifying joint distribution P(X 1 , . . . , X n )
factors ← []; vars ← Reverse(Vars[bn])
for e a c h var i n vars d o
factors ← [ M a k e - F a c t o r ( v a r , e)|factors]
if var is a hidden variable t h e n factors ← Sum-Out(var, factors)
r e t u r n Normalize(Pointwise-Product(factors))

Chapter 14.4–5 56
Irrelevant variables
Consider the query P (JohnCalls|Burglary = true)
B E

P (J|b) = αP (b) Xe P (e) Xa P (a|b,e)P (J|a) XmP (m|a) A

Sum over m is identically 1; M is irrelevant to the query J M

Thm 1: Y is irrelevant unless Y ∈ Ancestors({X} ∪E)


Here, X = J ohnCalls, E = {Burglary}, and
Ancestors({X} ∪E) = {Alarm, Earthquake}
so MaryCalls is irrelevant

(Compare this to backward chaining from the query in Horn clause KBs)

Chapter 14.4–5 57
Complexity of exact inference
Singly connected networks (or polytrees):
– any two nodes are connected by at most one (undirected) path
– time and space cost of variable elimination are O(dkn)

Multiply connected networks:


⇒ NP-hard
⇒ #P-complete
0.5 0.5 0.5 0.5

A B C D

1. A v B v C
2. C v D v A 1 2 3

3. B v C v D

AND

Chapter 14.4–5 58
Inference by stochastic simulation
Basic idea:
1) Draw N samples from a sampling distributionS
2) Compute an approximate posterior probability Pˆ 0.5
3) Show this converges to the true probability P

Outline:
Coin
– Sampling from an empty network
– Rejection sampling: reject samples disagreeing with evidence
– Likelihood weighting: use evidence to weight samples
– Markov chain Monte Carlo (MCMC): sample from a stochastic process
whose stationary distribution is the true posterior

Chapter 14.4–5 59
Sampling from an empty network

f u n c t i o n P r i o r - S a m p l e ( b n ) r e t u r n s an event sampled from bn


i n p u t s : bn, a belief network specifying joint distribution P(X 1 , . . . , X n )
x ← an event with n elements
for i = 1 t o n d o
xi ← a random sample from P (X i | parents(Xi))
given the values of Parents(Xi) in x
return x

Chapter 14.4–5 60
We can illustrate its operation on the network in Figure
14.12(a), assuming an ordering [Cloudy, Sprinkler , Rain,
WetGrass ]:
1. Sample from P(Cloudy) = <0.5, 0.5>, value is true.
2. Sample from P(Sprinkler |Cloudy =true) =<0.1, 0.9>,
value is false.
3. Sample from P(Rain |Cloudy =true) =< 0.8, 0.2>,
value is true.
4. Sample from P(WetGrass | Sprinkler =false, Rain
=true) = <0.9, 0.1>, value is true.
In this case, PRIOR-SAMPLE returns the event [true, false,
true, true].
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 62
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 63
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 64
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 65
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 66
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 67
Example
P(C)
.50

Cloudy

C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01

Chapter 14.4–5 68
Sampling from an empty network
contd.
Probability that P r i o r S a m p l e generates a particular event
S P S(x 1 . . .xn) = Πni = 1P(x i|parents(X i)) = P (x 1 . . . x n)
i.e., the true prior probability
E.g., SPS (t, f , t, t) = 0.5× 0.9× 0.8 × 0.9 = 0.324 = P (t, f , t, t)

Let N PS (x1 . . . xn) be the number of samples generated for event x1, . . . , xn

Then we have
lim Pˆ(x1,. . . , xn) = Nlim
N →∞ →∞
N PS (x1, . . . , xn)/N
= SPS (x1, . . . , xn)
= P (x1 . . . xn)
That is, estimates derived from P r i o r S a m p l e are consistent
Shorthand: Pˆ(x1,. . . , xn) ≈ P (x1 . . . xn)

Chapter 14.4–5 69
Rejection sampling

Pˆ(X|e) estimated from samples agreeing with e

f u n c t i o n R e j e c t i o n - S a m p l i n g ( X , e, bn, N) r e t u r n s an estimate of P(X |e)


local variables: N, a vector of counts over X, initially zero
for j = 1 to N d o
x ← Prior-Sample(bn)
if x is consistent with e t h e n
N[x] ← N[x]+1 where x is the value of X in x
r e t u r n Normalize(N[X])

E.g., estimate P(Rain|Sprinkler = true) using 100 samples


27 samples have Sprinkler =true
Of these, 8 have Rain = true and 19 have Rain = false..
Pˆ(Rain|Sprinkler = true) = Normalize(<8, 19>) =
<0.296,0.704>

Similar to a basic real-world empirical estimation procedure


Chapter 14.4–5 70
Analysis of rejection sampling

Pˆ(X|e) = αN PS (X, e) (algorithm defn.) (normalized by N PS (e))


= N PS (X, e)/NPS (e) (property of Pr i o r S a m p l e )
≈ P(X, e)/P(e)
= P(X|e) (defn. of conditional probability)
Hence rejection sampling returns consistent posterior estimates
Problem: hopelessly expensive if P (e) is small

P (e) drops off exponentially with number of evidence variables!

Chapter 14.4–5 71
Likelihood weighting
Idea: fix evidence variables, sample only nonevidence variables,
and weight each sample by the likelihood it accords the evidence

f u n c t i o n L i k e l i h o o d - W e i g h t i n g ( X , e, bn N) r e t u r n s an estimate of P(X
bn, N X |e)
local variables: W , a vector of weighted counts over X, initially zero
for j = 1 to N d o
x, w← W e i g h t e d - S a m p l e ( b n )
W[x ] ← W[x ] + w where x is the value of X inx
r e t u r n N o r m a l i z e ( W [ X ])

f u n c t i o n W e i g h t e d - S a m p l e ( b n , e) r e t u r n s an event and a weight


x ← an event with n elements; w← 1
for i = 1 t o n d o
if Xi has a value xi in e
t h e n w ← w × P (Xi = xi | parents(Xi ))
else xi ← a random sample from P(X i | parents(Xi ))
r e t u r n x, w

Chapter 14.4–5 72
Let us apply the algorithm to the network shown in Figure
14.12(a), with the query P(Rain |Cloudy =true, WetGrass
=true) and the ordering Cloudy, Sprinkler, Rain, Wet-
Grass. (Any topological ordering will do.)

The process goes as follows:

First, the weight w is set to 1.0.

Then an event is generated:


1. Cloudy is an evidence variable with value true.
Therefore, we set
w ← w×P(Cloudy =true) = 0.5 .

2. Sprinkler is not an evidence variable, so sample from


P(Sprinkler |Cloudy =true) = <0.1, 0.9>; suppose this
returns false.
3. Similarly, sample from P(Rain |Cloudy =true) =
< 0.8, 0.2>; suppose this returns true.

4. WetGrass is an evidence variable with value true.


Therefore, we set w ← w×P(WetGrass =true | Sprinkler
=false, Rain =true) = 0.5 * 0.90 = 0.45

Here WEIGHTED-SAMPLE returns the event [true, false,


true, true] with weight 0.45, and this is tallied under Rain
=true.
Likelihood weighting analysis
Sampling probability for W e i g h t e d S a m p l e is
SW S (z, e) = Πli = 1 P (zi|parents(Z i))
Note: pays attention to evidence in ancestors only Cloudy
⇒ somewhere “in between” prior and
posterior distribution Sprinkler Rain

Wet
Weight for a given sample z, e is Grass

w(z, e) = Πm i = 1 P (ei|parents(E i))

Weighted sampling probability is


SWS (z, e)w(z, e)
m
= Πil = 1P (zi|parents(Z i)) Πi = 1 P (ei|parents(E i))
= P (z, e) (by standard global semantics of network)

Hence likelihood weighting returns consistent estimates


but performance still degrades with many evidence variables
because a few samples have nearly all the total weight

Chapter 14.4–5 75
Approximate inference using M C M C
“State” of network = current assignment to all variables.
Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed

f u n c t i o n M C M C - A s k ( X , e, bn, N) r e t u r n s an estimate of P(X |e)


local variables: N[X ], a vector of counts over X, initially zero
Z, the nonevidence variables in bn
x, the current state of the network, initially copied from e
initialize x with random values for the variables in Y
for j = 1 to N d o
for e a c h Zi in Z d o
sample the value of Zi in x from P(Z i |mb(Zi ))
given the values of MB(Z
(Z i ) in x
N[x ] ← N[x ] + 1 where x is the value of X in x
r e t u r n N o r m a l i z e ( N [ X ])

Can also choose a variable to sample at random each time


Chapter 14.4–5 76
T h e Markov chain
With Sprinkler = true, W etGrass = true, there are four states:

Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Wet
Grass Grass

Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Wet
Grass Grass

Wander about for a while, average what you see

Chapter 14.4–5 77
M C M C example contd.
Estimate P(Rain|Sprinkler = true, WetGrass = true)
Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.

E.g., visit 100 states


31 have Rain = true, 69 have Rain =false
Pˆ(Rain|Sprinkler = true, WetGrass = true)
= Normalize(<31, 69>) = <0.31,0.69>

Theorem: chain approaches stationary distribution:


long-run fraction of time spent in each state is exactly
proportional to its posterior probability

Chapter 14.4–5 78
Markov blanket sampling
Markov blanket of Cloudy is Cloudy
Sprinkler and Rain
Markov blanket of Rain is Sprinkler Rain

Cloudy, Sprinkler, and W etGrass Wet


Grass

Probability given the Markov blanket is calculated as follows:


P (x0i |mb(X i )) = P (x0i |parents(X i ))ΠZ j ∈Children(X i)P (zj |parents(Z j ))
Easily implemented in message-passing parallel systems, brains
Main computational problems:
1) Difficult to tell if convergence has been achieved
2) Can be wasteful if Markov blanket is large:
P (Xi|mb(Xi)) won’t change much (law of large numbers)

Chapter 14.4–5 79
Summary
Exact inference by variable elimination:
– polytime on polytrees, NP-hard on general graphs
– space = time, very sensitive to topology
Approximate inference by LW, MCMC:
– LW does poorly when there is lots of (downstream) evidence
– LW, MCMC generally insensitive to topology
– Convergence can be very slow with probabilities close to 1 or 0
– Can handle arbitrary combinations of discrete and continuous variables

Chapter 14.4–5 80

You might also like