AI - Unit 4 - Part 4 - Full_Inference_Bayesian Networks
AI - Unit 4 - Part 4 - Full_Inference_Bayesian Networks
Part 4
Bayes' rule and its use
Representing knowledge in an uncertain domain
The semantics of Bayesian networks
Exact and approximate inference in Bayesian networks
Hemamalini S
Page 1
Conditional probability
Conditional or posterior probabilities
e.g., P (cavity|toothache) = 0.8
i.e., g i v e n t h a t toothache is all I k n o w “if
toothache then 80% chance of cavity”
Chapter 13 2
Conditional probability
Definition of conditional probability:
P (a ∧ b)
P (a|b) = if P (b) !=0
P (b)
Product rule gives an alternativeformulation:
P (a ∧b) = P (a|b)P(b) = P (b|a)P(a)
=...
= Πn P ( X |X , . . . , X )
i =1 i 1 i−1
Chapter 13 3
Inference by enumeration
Start with the joint distribution:
toothache ¬ toothache
catch ¬ catch catch ¬ catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576
For any proposition φ, sum the atomic events where it is true:
P (φ) = Σω:ω|=φP (ω)
Chapter 13 4
Inference by enumeration
Start with the joint distribution:
toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576
Chapter 13 5
Inference by enumeration
Start with the joint distribution:
toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬cavity .016 .064 .144 .576
For any proposition φ, sum the atomic events where it is true:
P (φ) = Σω:ω|=φP (ω)
P(cavity∨toothache) = 0.108+0.012+0.072+0.008+0.016+0.064 = 0.28
Chapter 13 6
Inference by enumeration
Start with the joint distribution:
toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576
Chapter 13 7
P(cavity | toothache) = P(cavity ∧
toothache)/P(toothache)
= 0.6 .
From the example, we can extract a general inference
procedure. We begin with the case in which the query
involves a single variable, X (Cavity in the example).
Let E be the list of evidence variables (just Toothache in the
example), let e be the list of observed values for them, and
let Y be the remaining unobserved variables (just Catch in
the example).
The query is P(X | e) and can be evaluated as
P(X | e) = α P(X, e) = α Σ y P(X, e, y)
where the summation is over all possible ys (i.e., all possible
combinations of values of the unobserved variables Y).
Notice that together the variables X, E, and Y constitute the
complete set of variables for the domain, so P(X, e, y) is
simply a subset of probabilities from the full joint
distribution.
Normalization
toothache ¬ toothache
catch ¬catch catch ¬catch
cavity .108 .012 .072 .008
¬ cavity .016 .064 .144 .576
decomposes into
Cavity Cavity
Toothache Catch Toothache Catch
Weather
Weather
Chapter 13 11
Conditional independence
P(Toothache, Cavity, Catch) has 23 − 1 = 7 independententries
If I have a cavity, the probability that the probe catches in it doesn’t depend
on whether I have a toothache:
(1) P (catch|toothache, cavity) = P (catch|cavity)
Equivalent statements:
P(T oothache|Catch, Cavity) = P(T oothache|Cavity)
P(T oothache, Catch|Cavity) =
P(T oothache|Cavity)P(Catch|Cavity)
Chapter 13 12
Conditional independence contd.
Write out full joint distribution using chain rule:
P(T oothache, Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch, Cavity)
= P(T oothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
= P(T oothache|Cavity)P(Catch|Cavity)P(Cavity)
Chapter 13 13
Bayes’ Rule
Product rule P (a ∧ b) = P (a|b)P(b) = P (b|a)P(a)
P (b|a)P(a)
⇒ Bayes’ rule P (a|b) =
P (b)
or in distribution form
P(X|Y )P(Y )
P(Y |X) = = αP(X|Y )P(Y )
P(X)
Useful for assessing diagnostic probability from causal probability:
P (Effect|Cause)P (Cause)
P (Cause|Effect) =
P (Effect)
Chapter 13 14
For example, a doctor knows that the disease meningitis
causes the patient to have a stiff neck, say, 70% of the
time.
The doctor also knows some unconditional facts: the prior
probability that a patient has meningitis is 1/50,000, and
the prior probability that any patient has a stiff neck is 1%.
Letting s be the proposition that the patient has a stiff neck
and m be the proposition that the patient has meningitis,
we have
P(s|m) = 0.7
P(m) = 1/50000
P(s) = 0.01
P(m|s) =(P(s|m)P(m))/P(s)
= (0.7 × 1/50000)/ 0.01
= 0.0014 .
The notion of independence in the previous section
provides a clue, but needs refining.
It would be nice if Toothache and Catch were
independent, but they are not:
if the probe catches in the tooth, then it is likely that the
tooth has a cavity and that the cavity causes a toothache.
These variables are independent, however, given the
presence or the absence of a cavity.
Each is directly caused by the cavity, but neither has a
direct effect on the other: toothache depends on the state
of the nerves in the tooth, whereas the probe’s accuracy
depends on the dentist’s skill, to which the toothache is
irrelevant. Mathematically, this property is written as
P(toothache ∧ catch | Cavity) = P(toothache |
Cavity)P(catch | Cavity)
This equation expresses the conditional independence
of toothache and catch given Cavity.
P(Cavity | toothache ∧ catch)
= α P(toothache | Cavity) P(catch | Cavity) P(Cavity)
P(Cavity|toothache ∧catch)
= α P(toothache ∧catch|Cavity)P(Cavity)
= α P(toothache|Cavity)P(catch|Cavity)P(Cavity)
( | ) ( | ) ( )
Cavity Cause
Chapter 13 19
Wumpus World
1,4 2,4 3,4 4,4
Chapter 13 20
Specifying the probability model
The full joint distribution is P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1)
Apply product rule: P(B1,1, B1,2, B2,1 |P1,1, . . . , P4,4)P(P1,1, . . . , P4,4)
(Do it this way to get P (Effect|Cause).)
Chapter 13 21
Observations and query
We know the following facts:
b= ¬b1,1 ∧ b1,2 ∧b2,1
known = ¬p1,1 ∧ ¬p1,2 ∧¬p2,1
Query is P(P1,3|known, b)
Chapter 13 22
Using conditional independence
Basic insight: observations are conditionally independent of other hidden
squares given neighbouring hidden squares
1,4 2,4 3,4 4,4
Chapter 13 23
Using conditional independence contd.
P(P1,3|known, b) = α X
P(P1,3, unknown, known, b)
unk nown
= α X
P(b|P1,3, known, unknown)P(P1,3, known, unknown)
unknown
= α X X
P(b|known, P1,3, fringe, other)P(P1,3, known, fringe, other)
f ringe other
= α X X
P(b|known, P1,3, fringe)P(P1,3, known, fringe, other)
f ringe other
= α X
P(b|known, P1,3, fringe) X
P(P 1,3, known, fringe, other)
f r inge other
= α X
P(b|known, P1,3, fringe) X
P(P1,3)P (known)P (fringe)P (other)
f ringe other
= α P (known)P(P1,3) X
P(b|known, P1,3, fringe)P (fringe) X
P (other)
f r inge ot her
= α0P(P
( 1,3) X
( |
P(b|known, ) (fringe)
P1,3, fringe)P ( )
f r inge
Chapter 13 24
Using conditional independence contd.
1,3 1,3 1,3 1,3 1,3
1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2
B B B B B
OK OK OK OK OK
1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1
B B B B B
OK OK OK OK OK OK OK OK OK OK
0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16
Chapter 13 25
Summary
Probability is a rigorous formalism for uncertainknowledge
Joint probability distribution specifies probability of every atomic event
Queries can be answered by summing over atomic events
For nontrivial domains, we must find a way to reduce the joint size
Independence and conditional independence provide the tools
Chapter 13 26
Bayesian n e t w o r k s
C h a p t e r 14.1–2
Chapter 14.1–3 27
Bayesian networks
A simple, graphical notation for conditional independence
assertions and hence for compact specification of full joint
distributions.
Definition:
A Bayesian network is a directed graph in which each
node is annotated with quantitative probability
information. The full specification is as follows:
1. Each node corresponds to a random variable, which
may be discrete or continuous.
2. A set of directed links or arrows connects pairs of
nodes. If there is an arrow from node X to node Y , X is
said to be a parent of Y. The graph has no directed cycles
(and hence is a directed acyclic graph, or DAG).
3. Each node Xi has a conditional probability
distribution P(Xi | Parents(Xi)) that quantifies the effect
Chapter 14.1–3 28
of the parents on the node.
Example
Topology of network encodes conditional independence assertions:
Weather Cavity
Toothache Catch
You also have two neighbors, John and Mary, who have promised
to call you at work when they hear the alarm.
John nearly always calls when he hears the alarm, but sometimes
confuses the telephone ringing with the alarm and calls then, too.
Mary, on the other hand, likes rather loud music and often misses
the alarm altogether. Given the evidence of who has or has not
called, we would like to estimate the probability of a burglary.
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighbor
Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a
burglar?
Chapter 14.1–3 31
Example contd.
P(B) P(E)
Burglary Earthquake
.001 .002
B E P(A|B,E)
T T .95
T F .94 Alarm
F T .29
F F .001
A P(J|A) A P(M|A)
JohnCalls T .90 MaryCalls T .70
F .05 F .01
Chapter 14.1–3 32
The network structure shows that burglary and
earthquakes directly affect the probability of the
alarm’s going off, but whether John and Mary call
depends only on the alarm.
I.e., grows linearly with n, vs. O(2n) for the full joint distribution
Chapter 14.1–3 34
A generic entry in the joint distribution is the probability of a
conjunction of particular assignments to each variable, such
as P(X1 = x1 ∧ ... ∧ Xn = xn).
Chapter 14.1–3 35
Global semantics
“Global” semantics defines the full joint distribution
as the product of the local conditional distributions: B E
P (j ∧ m ∧ a ∧ ¬b∧¬e)
= P (j|a)P (m|a)P (a|¬b,¬e)P(¬b)P(¬e)
= 0.9× 0.7× 0.001× 0.999× 0.998
≈ 0.00063
Chapter 14.1–3 36
Constructing Bayesian networks
Need a method such that a series of locally testable assertions of
conditional independence guarantees the required global semantics
Chapter 14.1–3 37
The topological semantics2 specifies that each variable is
conditionally independent of its non-descendants, given its
parents.
MARKOV BLANKET
A node is conditionally independent of all other nodes in the
network, given its parents, children, and children’s parents—
that is, given its Markov blanket.
Intuitively, the parents of node Xi should contain all those
nodes in X1, ..., Xi−1 that directly influence Xi.
Also, given the state of the alarm, whether John calls has
no influence on Mary’s calling.
Formally speaking, we believe that the following conditional
independence statement holds:
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
Chapter 14.1–3 17
The process goes as follows:
• Adding MaryCalls: No parents.
• Adding JohnCalls: If Mary calls, that probably means the alarm
has gone off, which of course would make it more likely that John
calls. Therefore, JohnCalls needs MaryCalls as a parent
. • Adding Alarm: Clearly, if both call, it is more likely that the alarm
has gone off than if just one or neither calls, so we need both
MaryCalls and JohnCalls as parents.
• Adding Burglary: If we know the alarm state, then the call from
John or Mary might give us information about our phone ringing or
Mary’s music, but not about burglary:
P(Burglary | Alarm, JohnCalls , MaryCalls) = P(Burglary | Alarm) .
Hence we need just Alarm as parent.
• Adding Earthquake: If the alarm is on, it is more likely that there
has been an earthquake. (The alarm is an earthquake detector of
sorts.) But if we know that there has been a burglary, then that
explains the alarm, and the probability of an earthquake would be
only slightly above normal. Hence, we need both Alarm and
Burglary as parents.
Example contd.
MaryCalls
JohnCalls
Alarm
Burglary
Earthquake
Chapter 14.1–3 42
Example: C a r diagnosis
Initial evidence: car won’t start (red)
Testable variables (green), “broken, so fix it” variables (orange)
Hidden variables (gray) ensure sparse structure, reduce parameters
battery no charging
dead
car won’t
lights oil light gas gauge start dipstick
Chapter 14.1–3 43
Inference in Bayesian Networks
Outline
Chapter 14.4–5 45
Inference tasks
Simple queries: compute posterior marginal P(Xi|E = e)
e.g., P (N oGas|Gauge = empty, Lights = on, Starts = false)
Conjunctive queries: P(X i , X j |E = e) = P(Xi|E = e)P(X j|X i , E = e)
Optimal decisions: decision networks include utility information;
probabilistic inference required for P (outcome|action, evidence)
Value of information: which evidence to seek next?
Sensitivity analysis: which probability values are most critical?
Explanation: why do I need a new startermotor?
Chapter 14.4–5 46
Exact Inference in Bayesian Networks
The basic task for any probabilistic inference system is to compute the
posterior probability distribution for a set of query variables, given
some observed event—that is, some assignment of values to a set of
evidence variables.
A typical query asks for the posterior probability distribution P(X | e).
A query P(X | e) can be answered using
P(X | e) = α P(X, e) = α P(X, e, y) .
Chapter 14.4–5 49
Enumeration algorithm
f u n c t i o n E n u m e r a t e - A l l ( v a r s , e) r e t u r n s a real number
if Empty?(vars) t h e n r e t u r n 1.0
Y ← First(vars)
if Y has value y in e
t h e n r e t u r n P (y | P a(Y )) × E n u m e r a t e - A l l ( R e s t ( v a r s ) , e) else
r e t u r n P y P (y | P a(Y )) × E n u m e r a t e - A l l ( R e s t ( v a r s ) , e y )
where e y is e extended with Y = y
Chapter 14.4–5 50
Example contd.
P(B) P(E)
Burglary Earthquake
.001 .002
B E P(A|B,E)
T T .95
T F .94 Alarm
F T .29
F F .001
A P(J|A) A P(M|A)
JohnCalls T .90 MaryCalls T .70
F .05 F .01
Chapter 14.1–3 51
Evaluation tree
P(b)
.001
0.00119705 0.59223
P(e) P( e)
.002 .998
0.598525
P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e)
0.5985 .95 .05 .94 .06
0.000025
Chapter 14.4–5 54
Variable elimination: Basic
operations
Summing out a variable from a product of factors:
move any constant factors outside the summation
add up submatrices in pointwise product of remaining factors
Σ x f 1 × ···× f k = f 1 × ···× f i Σ x f i+1 × ···× f k = f 1 × ···× f i × fX¯
assuming f 1 , . . . ,f i do not depend on X
Pointwise product of factors f 1 and f 2:
f1(x1, . . . , x j , y1, . . . , yk) × f2(y1, . . . , yk, z1, . . . , zl)
= f (x1, . . . , x j , y1, . . . , yk, z1, . . . , zl)
E.g., f1(a, b)× f2(b, c) = f (a, b,c)
Chapter 14.4–5 55
Variable elimination algorithm
Chapter 14.4–5 56
Irrelevant variables
Consider the query P (JohnCalls|Burglary = true)
B E
(Compare this to backward chaining from the query in Horn clause KBs)
Chapter 14.4–5 57
Complexity of exact inference
Singly connected networks (or polytrees):
– any two nodes are connected by at most one (undirected) path
– time and space cost of variable elimination are O(dkn)
A B C D
1. A v B v C
2. C v D v A 1 2 3
3. B v C v D
AND
Chapter 14.4–5 58
Inference by stochastic simulation
Basic idea:
1) Draw N samples from a sampling distributionS
2) Compute an approximate posterior probability Pˆ 0.5
3) Show this converges to the true probability P
Outline:
Coin
– Sampling from an empty network
– Rejection sampling: reject samples disagreeing with evidence
– Likelihood weighting: use evidence to weight samples
– Markov chain Monte Carlo (MCMC): sample from a stochastic process
whose stationary distribution is the true posterior
Chapter 14.4–5 59
Sampling from an empty network
Chapter 14.4–5 60
We can illustrate its operation on the network in Figure
14.12(a), assuming an ordering [Cloudy, Sprinkler , Rain,
WetGrass ]:
1. Sample from P(Cloudy) = <0.5, 0.5>, value is true.
2. Sample from P(Sprinkler |Cloudy =true) =<0.1, 0.9>,
value is false.
3. Sample from P(Rain |Cloudy =true) =< 0.8, 0.2>,
value is true.
4. Sample from P(WetGrass | Sprinkler =false, Rain
=true) = <0.9, 0.1>, value is true.
In this case, PRIOR-SAMPLE returns the event [true, false,
true, true].
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 62
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 63
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 64
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 65
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 66
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 67
Example
P(C)
.50
Cloudy
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 68
Sampling from an empty network
contd.
Probability that P r i o r S a m p l e generates a particular event
S P S(x 1 . . .xn) = Πni = 1P(x i|parents(X i)) = P (x 1 . . . x n)
i.e., the true prior probability
E.g., SPS (t, f , t, t) = 0.5× 0.9× 0.8 × 0.9 = 0.324 = P (t, f , t, t)
Let N PS (x1 . . . xn) be the number of samples generated for event x1, . . . , xn
Then we have
lim Pˆ(x1,. . . , xn) = Nlim
N →∞ →∞
N PS (x1, . . . , xn)/N
= SPS (x1, . . . , xn)
= P (x1 . . . xn)
That is, estimates derived from P r i o r S a m p l e are consistent
Shorthand: Pˆ(x1,. . . , xn) ≈ P (x1 . . . xn)
Chapter 14.4–5 69
Rejection sampling
Chapter 14.4–5 71
Likelihood weighting
Idea: fix evidence variables, sample only nonevidence variables,
and weight each sample by the likelihood it accords the evidence
f u n c t i o n L i k e l i h o o d - W e i g h t i n g ( X , e, bn N) r e t u r n s an estimate of P(X
bn, N X |e)
local variables: W , a vector of weighted counts over X, initially zero
for j = 1 to N d o
x, w← W e i g h t e d - S a m p l e ( b n )
W[x ] ← W[x ] + w where x is the value of X inx
r e t u r n N o r m a l i z e ( W [ X ])
Chapter 14.4–5 72
Let us apply the algorithm to the network shown in Figure
14.12(a), with the query P(Rain |Cloudy =true, WetGrass
=true) and the ordering Cloudy, Sprinkler, Rain, Wet-
Grass. (Any topological ordering will do.)
Wet
Weight for a given sample z, e is Grass
Chapter 14.4–5 75
Approximate inference using M C M C
“State” of network = current assignment to all variables.
Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed
Cloudy Cloudy
Wet Wet
Grass Grass
Cloudy Cloudy
Wet Wet
Grass Grass
Chapter 14.4–5 77
M C M C example contd.
Estimate P(Rain|Sprinkler = true, WetGrass = true)
Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.
Chapter 14.4–5 78
Markov blanket sampling
Markov blanket of Cloudy is Cloudy
Sprinkler and Rain
Markov blanket of Rain is Sprinkler Rain
Chapter 14.4–5 79
Summary
Exact inference by variable elimination:
– polytime on polytrees, NP-hard on general graphs
– space = time, very sensitive to topology
Approximate inference by LW, MCMC:
– LW does poorly when there is lots of (downstream) evidence
– LW, MCMC generally insensitive to topology
– Convergence can be very slow with probabilities close to 1 or 0
– Can handle arbitrary combinations of discrete and continuous variables
Chapter 14.4–5 80