0% found this document useful (0 votes)
2 views

cs188-su24-lec08

Uploaded by

Parv Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

cs188-su24-lec08

Uploaded by

Parv Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Bayes Nets: Independence

Instructor: Evgeny Pobachienko — UC Berkeley


[Slides credit: Dan Klein, Pieter Abbeel, Anca Dragan, Stuart Russell, Satish Rao, and many others]
Bayesian Networks: Recall…
o A directed acyclic graph (DAG), one node per
random variable
o A conditional probability table (CPT) for each node
o Probability of X, given a combination of values for
parents.

o Bayes nets implicitly encode joint distributions as a


product of local conditional distributions
o To see what probability a BN gives to a full assignment,
multiply all the relevant conditionals together:
Independence Assumptions so far…
o Each node, given its parents, is Each node, given its MarkovBlanket, is
conditionally independent of all its conditionally independent of all other
non-descendants in the graph nodes in the graph

MarkovBlanket refers to the parents,


children, and children's other parents.
Example: Alarm Network
B P(B) E P(E)
B E
+b 0.001 +e 0.002
-b 0.999 -e 0.998

A
A J P(J|A) A M P(M|A)
B E A P(A|B,E)
+a +j 0.9 +a +m 0.7
+b +e +a 0.95
+a -j 0.1 +a -m 0.3
-a +j 0.05 J M -a +m 0.01
+b +e -a 0.05
+b -e +a 0.94
-a -j 0.95 -a -m 0.99
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
Example: Alarm Network
B P(B) E P(E)
B E
+b 0.001 +e 0.002
-b 0.999 -e 0.998

A
A J P(J|A) A M P(M|A)
B E A P(A|B,E)
+a +j 0.9 +a +m 0.7
+b +e +a 0.95
+a -j 0.1 +a -m 0.3
-a +j 0.05 J M -a +m 0.01
+b +e -a 0.05
+b -e +a 0.94
-a -j 0.95 -a -m 0.99
+b -e -a 0.06
-b +e +a 0.29
-b +e -a 0.71
-b -e +a 0.001
-b -e -a 0.999
Conditional Independence
o X and Y are independent iff

o Given Z, we say X and Y are conditionally independent iff

o (Conditional) independence is a property of a distribution

o Example:
Bayes Nets: Assumptions
o Assumptions we are required to make to define the
Bayes net when given the graph:

o Important for modeling: understand assumptions


made when choosing a Bayes net graph
Example

X Y Z W

o Conditional independence assumptions directly from simplifications in chain rule:

o Additional implied conditional independence assumptions?


Independence in a BN

o Important question about a BN:


o Are two nodes independent given certain evidence?
o Question: are X and Z guaranteed to be independent?
o Answer: no. Example: low pressure causes rain, which causes traffic.
o X can influence Z, Z can influence X (via Y)
o Addendum: they could be independent: how?

X Y Z
D-separation: Outline
D-separation: Outline
o Study independence properties for triples
o Why triples?

o Analyze complex cases in terms of member triples

o D-separation: a condition / algorithm for answering


such queries
Causal Chains
o This configuration is a “causal chain” Is X guaranteed to be independent of Z?
No!
One example set of CPTs for which X is not
independent of Z is sufficient to show this
independence is not guaranteed.
Example:
Low pressure causes rain causes traffic,
high pressure causes no rain causes no
traffic

In numbers:
X: Low pressure Y: Rain Z: Traffic
P( +y | +x ) = 1, P( -y | - x ) = 1,
P( +z | +y ) = 1, P( -z | -y ) = 1
X Y Z
Causal Chains
o This configuration is a “causal chain” Given Y, is X guaranteed to be independent
of Z?

X: Low pressure Y: Rain Z: Traffic

Yes!
X Y Z
Evidence along the chain “blocks” the
influence
Common Causes
o This configuration is a “common cause” Guaranteed X independent of Z ?
No!
Y: Project Y
One example set of CPTs for which X is not
due independent of Z is sufficient to show this
independence is not guaranteed.

X Z Example:
Project due causes both forums busy
and lab full

In numbers:
X: Forums
Z: Lab full
busy
P( +x | +y ) = 1, P( -x | -y ) = 1,
P( +z | +y ) = 1, P( -z | -y ) = 1
Common Cause
o This configuration is a “common cause” Guaranteed X and Z independent given
Y?
Y: Project
Y
due

X Z

X: Forums
Z: Lab full
busy Yes!
Observing the cause blocks influence
between effects.
Common Effect
o Last configuration: two causes of Are X and Y independent?
one effect (v-structures)
Yes: the ballgame and the rain cause traffic, but
they are not correlated
X: Raining Y: Ballgame
Proof:

X Y

Z: Traffic Z
Common Effect
o Last configuration: two causes of Are X and Y independent?
one effect (v-structures)
Yes: the ballgame and the rain cause traffic, but
they are not correlated
X: Raining Y: Ballgame
(Proved previously)

Are X and Y independent given Z?


No: seeing traffic puts the rain and the ballgame
in competition as explanation.

X Y
This is backwards from the other cases
Observing an effect activates influence between
Z: Traffic Z
possible causes.
The General Case
The General Case

o General question: in a given BN, are two variables


independent (given evidence)?

o Solution: analyze the graph

o Any complex example can be broken


into repetitions of the three canonical cases
Active / Inactive Paths
o Question: Are X and Y conditionally independent Active Triples Inactive Triples
given evidence variables {Z}?
o Yes, if X and Y “d-separated” by Z
o Consider all (undirected) paths from X to Y
o No active paths = independence!

o A path is active if each triple is active:


o Causal chain A -> B -> C where B is unobserved (either
direction)
o Common cause A <- B -> C where B is unobserved
o Common effect (aka v-structure)
A -> B <- C where B or one of its descendants is observed

o All it takes to block a path is a single inactive


segment
D-Separation
Query: ?
Check all (undirected!) paths between and
If one or more active paths, then independence not guaranteed

Otherwise (i.e. if all paths are inactive),


then independence is guaranteed
Example

Yes R B

T

Example

Yes
R B
Yes

D T

Yes
T

Example
o Variables:
o R: Raining
R
o T: Traffic
o D: Roof drips
o S: I’m sad T D

o Questions:
S

Yes
Another Perspective: Bayes Ball
Structure Implications
o Given a Bayes net structure, can run d-
separation algorithm to build a complete
list of conditional independences that are
necessarily true of the form

o This list determines the set of probability


distributions that can be represented
Topology Limits Distributions
o Given some graph topology G,
only certain joint distributions Y
can be encoded Y
X Z
X Z
o The graph structure guarantees Y
certain (conditional) X Z
independences
Y
o (There might be more
independence) X Z

o Adding arcs increases the set of


distributions, but has several
costs Y Y Y

o Full conditioning can encode X Z X Z X Z


any distribution Y Y Y

X Z X Z X Z
Bayes Nets Representation Summary

o Bayes nets compactly encode joint distributions (by


making use of conditional independences!)

o Guaranteed independencies of distributions can be


deduced from BN graph structure

o D-separation gives precise conditional


independence guarantees from graph alone

o A Bayes net’s joint distribution may have further


(conditional) independence that is not detectable
until you inspect its specific distribution
Bayesian Networks: Sampling

Instructor: Evgeny Pobachienko – UC Berkeley


[Slides credit: Dan Klein, Pieter Abbeel, Anca Dragan, Stuart Russell, Ketrina Yim, and many others]
Approximate Inference: Sampling
Sampling
o Sampling is a lot like repeated Why sample?
Learning: get samples from a
simulation distribution you don’t know
Inference: getting a sample is faster
o Predicting the weather, basketball than computing the right answer (e.g.
with variable elimination)
games, …

o Basic idea
o Draw N samples from a sampling
distribution S
o Compute an approximate
posterior probability
o Show this converges to the true
Sampling
o Sampling from given Example
distribution
C P(C)
o Step 1: Get sample u from uniform red 0.6
distribution over [0, 1)
o E.g. random() in python
green 0.1
blue 0.3
o Step 2: Convert this sample u into
an outcome for the given
distribution by having each target If random() returns u = 0.83,
outcome associated with a sub- then our sample is C = blue
interval of [0,1) with sub-interval E.g, after sampling 8 times:
size equal to probability of the
outcome
Sampling in Bayes’ Nets

o Prior Sampling

o Rejection Sampling

o Likelihood Weighting

o Gibbs Sampling
Prior Sampling
Prior Sampling
+c 0.5
-c 0.5

Cloudy
+s 0.1 +r 0.8
+c -s 0.9 +c -r 0.2
+s 0.5 Sprinkler +r 0.2
Rain
-c -s 0.5 -c -r 0.8

WetGrass Samples:
+w 0.99
+s +r -w 0.01 +c, -s, +r, +w
+w 0.90 -c, +s, -r, +w
-r -w 0.10
+w 0.90 …
-s +r -w 0.10
+w 0.01
-r -w 0.99
Prior Sampling

o For i = 1, 2, …, n in topological order

o Sample xi from P(Xi | Parents(Xi))

o Return (x1, x2, …, xn)


Prior Sampling
o This process generates samples with probability:

…i.e. the BN’s joint probability

o Let the number of samples of an event be

o Then

o I.e., the sampling procedure is consistent


Example

o We’ll get a bunch of samples from the BN:


+c, -s, +r, +w C

+c, +s, +r, +w S R


-c, +s, +r, -w W
+c, -s, +r, +w
-c, -s, -r, +w
o If we want to know P(W)
o We have counts <+w:4, -w:1>
o Normalize to get P(W) = <+w:0.8, -w:0.2>
o This will get closer to the true distribution with more samples
oWhat about P(C | +r, +w)?
Rejection Sampling
Rejection Sampling

o Let’s say we want P(C)


o Just tally counts of C as we go C

S R

o Let’s say we want P(C | +s) W

o Same thing: tally C outcomes,


but ignore (reject) samples
which don’t have S=+s
o We can toss out samples early!
o It is also consistent for
conditional probabilities (i.e.,
correct in the limit)
Rejection Sampling
o Input: evidence instantiation
o For i = 1, 2, …, n in topological order
o Sample xi from P(Xi | Parents(Xi))
o If xi not consistent with evidence
o Reject: return – no sample is generated in this cycle
o Return (x1, x2, …, xn)
Likelihood Weighting
Likelihood Weighting
o Problem with rejection sampling: Idea: fix evidence variables and sample
o If evidence is unlikely, rejects lots of the rest
samples Problem: sample distribution not consistent!
o Consider P( Shape | blue ) Solution: weight by probability of evidence
given parents
pyramid, green pyramid, blue
pyramid, red pyramid, blue
Shape Color sphere, blue Shape Color sphere, blue
cube, red cube, blue
sphere, green sphere, blue
Likelihood Weighting
+c 0.5
-c 0.5

Cloudy
+s 0.1 +r 0.8
+c -s 0.9 +c -r 0.2
+s 0.5 +r 0.2
-c -s 0.5 Sprinkler Rain -c -r 0.8

+w 0.99 WetGrass
+s +r -w 0.01
+w 0.90
-r -w 0.10 Samples:
+w 0.90 w = 1.0 x 0.1 x 0.99
-s
+c, +s, +r, +w
+r -w 0.10 w = 1.0 x 0.5 x 0.90
+w 0.01
-c, +s, -r, +w
-r -w 0.99 …
Likelihood Weighting
o Input: evidence instantiation
o w = 1.0
o for i = 1, 2, …, n in topological order
o if Xi is an evidence variable
o Xi = observation xi for Xi
o Set w = w * P(xi | Parents(Xi))
o else
o Sample xi from P(Xi | Parents(Xi))
o return (x1, x2, …, xn), w
Likelihood Weighting
o Sampling distribution if z sampled and e fixed evidence

o Now, samples have weights

o Together, weighted sampling distribution is consistent


Likelihood Weighting
o Likelihood weighting is good Likelihood weighting doesn’t solve all our
o All samples are used problems
o More of our samples will reflect the state The values of upstream variables are unaffected by
downstream evidence
of the world suggested by the evidence With evidence in k leaf nodes, weights will be O(2-k)
o Values of downstream variables are With high probability, one lucky sample will have much
influenced by upstream evidence larger weight than the others, dominating the result
We would like to consider evidence when
we sample every variable (leads to Gibbs
sampling)
C

S R

W
Example: Car Insurance: P(PropertyCost|e)
Gibbs Sampling
Markov Chain Monte Carlo
o Gibbs sampling is a MCMC technique (Metropolis-
Hastings)
o MCMC (Markov chain Monte Carlo) is a family of randomized
algorithms for approximating some quantity of interest over a very
large state space
o Markov chain = a sequence of randomly chosen states (“random walk”),
where each state is chosen conditioned on the previous state
o Monte Carlo = a very expensive city in Monaco with a famous casino
o Monte Carlo = an algorithm (usually based on sampling) that has some
probability of producing an incorrect answer
o MCMC = wander around for a bit, average what you see
Gibbs sampling
o A particular kind of MCMC
o States are complete assignments to all variables
o (local search: closely related to simulated annealing!)
o Evidence variables remain fixed, other variables change
o To generate the next state, pick a variable and sample a value for it
conditioned on all the other variables: Xi’ ~ P(Xi | x1,..,xi–1,xi+1,..,xn)
o Will tend to move towards states of higher probability, but can go down too
o In a Bayes net, P(Xi | x1,..,xi–1,xi+1,..,xn) = P(Xi | markovblanket(Xi))
o Theorem: Gibbs sampling is consistent*
o Provided all Gibbs distributions are bounded away from 0 and 1 and variable selection is fair
Gibbs Sampling Example: P( S | +r)
o Step 1: Fix evidence C
Step 2: Initialize other variables C
o R = +r Randomly
S +r S +r

W W
Steps 3: Repeat:
Choose a non-evidence variable X
Resample X from P( X | MarkovBlanket(X))

C C C C C C
S +r S +r S +r S +r S +r S +r
W W W W W W
Resampling of One Variable
o Sample from P(S | +c, +r, -w) C

S +r

o Many things cancel out – only CPTs with S remain!


o More generally: only CPTs that have resampled variable need to be considered,
and joined together
Why would anyone do this?
Samples soon begin to
reflect all the evidence
in the network

Eventually they are


being drawn from the
true posterior!
Car Insurance: P(PropertyCost | e)
Car Insurance: P(Age | costs)
Why does it work? (see AIMA 13.4.2 for details)
o Suppose we run it for a long time and predict the probability of
reaching any given state at time t: πt(x1,...,xn) or πt(x)
o Each Gibbs sampling step (pick a variable, resample its value) applied
to a state x has a probability k(x’ | x) of reaching a next state x’
o So πt+1(x’) = ∑x k(x’ | x) πt(x) or, in matrix/vector form πt+1 = Kπt
o When the process is in equilibrium πt+1 = πt = π so Kπ = π
o This has a unique* solution π = P(x1, ..., xn | e1, ..., ek)
o So for large enough t the next sample will be drawn from the true
posterior
o “Large enough” depends on CPTs in the Bayes net; takes longer if nearly
deterministic
Bayes’ Net Sampling Summary
o Prior Sampling P( Q ) Rejection Sampling P( Q | e )

Gibbs Sampling P( Q | e )
o Likelihood Weighting P( Q | e)
CS 188: Artificial Intelligence
Hidden Markov Models

Instructor: Evgeny Pobachienko — UC Berkeley


[Slides Credit: Dan Klein, Pieter Abbeel, Anca Dragan, Stuart Russell, and many others]
Reasoning over Time or Space

o Often, we want to reason about a sequence of


observations
o Speech recognition
o Robot localization
o User attention
o Medical monitoring

o Need to introduce time (or space) into our models


Example Markov Chain: Weather

o States: X = {rain, sun}

§ Initial distribution: P(X0)


sun rain
1 0.0

§ CPT P(Xt | Xt-1): Two new ways of representing the same CPT

Xt-1 Xt P(Xt|Xt-1) 0.9


0.3
sun sun 0.9 0.9
sun sun
sun rain 0.1 rain sun 0.1
rain sun 0.3 0.3
rain rain
rain rain 0.7 0.7 0.7
0.1
Markov Chains
o Value of X at a given time is called the state

X1 X2 X3 X4
P ( Xt ) = ?

o Transition probabilities (dynamics): P(Xt | Xt–1) specify how the state


evolves over time
Markovian Assumption

o Basic conditional independence:


o Given the present, the future is independent of the
past!
o Each time step only depends on the previous
o This is called the (first order) Markov property
Example Markov Chain: Weather
0.9
0.3
o Initial distribution: 1.0 sun
rain sun

0.7 0.1

o What is the probability distribution after one step?

P( X2 = sun) = Â P(x1 , X2 = sun) = Â P(X2 = sun| x1 ) P(x1 )


x1 x1

You might also like