Bayesian Belief Networks: CS 2740 Knowledge Representation
Bayesian Belief Networks: CS 2740 Knowledge Representation
Lecture 19
Milos Hauskrecht
[email protected]
5329 Sennott Square
Probabilistic inference
Various inference tasks:
P( Pneumonia | Fever = T )
• Prediction task. (from cause to effect)
P( Fever | Pneumonia = T )
• Other probabilistic queries (queries on joint distributions).
P (Fever )
P ( Fever , ChestPain )
1
Inference
Any query can be computed from the full joint distribution !!!
• Joint over a subset of variables is obtained through
marginalization
P( A = a, C = c) = ∑∑ P( A = a, B = bi , C = c, D = d j )
i j
∑∑ P( A = a, B = b , C = c, D = d )
i j
i j
Inference
Any query can be computed from the full joint distribution !!!
• Any joint probability can be expressed as a product of
conditionals via the chain rule.
P( X 1 , X 2 ,K X n ) = P( X n | X 1, K X n −1 ) P( X 1, K X n −1 )
= P( X n | X 1, K X n −1 ) P( X n −1 | X 1, K X n − 2 ) P( X 1, K X n − 2 )
= ∏i =1 P( X i | X 1, K X i −1 )
n
2
Modeling uncertainty with probabilities
• Defining the full joint distribution makes it possible to
represent and reason with uncertainty in a uniform way
• We are able to handle an arbitrary inference problem
Problems:
– Space complexity. To store a full joint distribution we
need to remember O(d n ) numbers.
n – number of random variables, d – number of values
– Inference (time) complexity. To compute some queries
n
requires O(d. ) steps.
– Acquisition problem. Who is going to define all of the
probability entries?
3
Modeling uncertainty with probabilities
• Knowledge based system era (70s – early 80’s)
– Extensional non-probabilistic models
– Solve the space, time and acquisition bottlenecks in
probability-based models
– froze the development and advancement of KB systems
and contributed to the slow-down of AI in 80s in general
P ( A, B | C ) = P ( A | C ) P ( B | C )
P( A | C , B) = P( A | C )
4
Alarm system example.
• Assume your house has an alarm system against burglary.
You live in the seismically active area and the alarm system
can get occasionally set off by an earthquake. You have two
neighbors, Mary and John, who do not know each other. If
they hear the alarm they call you, but this is not guaranteed.
• We want to represent the probability distribution of events:
– Burglary, Earthquake, Alarm, Mary calls and John calls
Alarm
JohnCalls MaryCalls
Alarm P(A|B,E)
P(J|A) P(M|A)
JohnCalls MaryCalls
5
Bayesian belief network.
2. Local conditional distributions
• relate variables and their parents
Alarm P(A|B,E)
P(J|A) P(M|A)
JohnCalls MaryCalls
P(B) P(E)
T F T F
Burglary 0.001 0.999 Earthquake 0.002 0.998
P(A|B,E)
B E T F
T T 0.95 0.05
Alarm T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
P(J|A) P(M|A)
A T F A T F
JohnCalls T 0.90 0.1 MaryCalls T 0.7 0.3
F 0.05 0.95 F 0.01 0.99
6
Bayesian belief networks (general)
Two components: B = ( S , Θ S ) B E
• Directed acyclic graph
– Nodes correspond to random variables A
– (Missing) links encode independences
J M
• Parameters
– Local conditional probability distributions
for every variable-parent configuration P(A|B,E)
B E T F
P ( X i | pa ( X i ))
T T 0.95 0.05
Where: T F 0.94 0.06
F T 0.29 0.71
pa ( X i ) - stand for parents of Xi F F 0.001 0.999
P ( X 1 , X 2 ,.., X n ) = ∏ P( X
i =1,.. n
i | pa ( X i ))
B E
Example:
Assume the following assignment A
of values to random variables
B = T, E = T, A = T, J = T, M = F J M
7
Bayesian belief networks (BBNs)
Bayesian belief networks
• Represent the full joint distribution over the variables more
compactly using the product of local conditionals.
• But how did we get to local parameterizations?
Answer:
• Graphical structure encodes conditional and marginal
independences among random variables
• A and B are independent P ( A , B ) = P ( A ) P ( B )
• A and B are conditionally independent given C
P( A | C , B) = P( A | C )
P ( A, B | C ) = P ( A | C ) P ( B | C )
• The graph structure implies the decomposition !!!
CS 2740 Knowledge Representation M. Hauskrecht
Independences in BBNs
3 basic independence structures:
1. Burglary
2. 3.
Burglary Earthquake
Alarm
Alarm
JohnCalls
8
Independences in BBNs
1. Burglary
2. 3.
Burglary Earthquake
Alarm
Alarm
Alarm JohnCalls MaryCalls
JohnCalls
Independences in BBNs
1. Burglary
2. 3.
Burglary Earthquake
Alarm
Alarm
JohnCalls
9
Independences in BBNs
1. Burglary
2. 3.
Burglary Earthquake
Alarm
Alarm
Alarm JohnCalls MaryCalls
JohnCalls
Independences in BBN
• BBN distribution models many conditional independence
relations among distant variables and sets of variables
• These are defined in terms of the graphical criterion called d-
separation
• D-separation and independence
– Let X,Y and Z be three sets of nodes
– If X and Y are d-separated by Z, then X and Y are
conditionally independent given Z
• D-separation :
– A is d-separated from B given C if every undirected path
between them is blocked with C
• Path blocking
– 3 cases that expand on three basic independence structures
10
Undirected path blocking
A is d-separated from B given C if every undirected path
between them is blocked
Z in C
X in A Y in B
X Y
Z in C
X in A Y in B
11
Undirected path blocking
A is d-separated from B given C if every undirected path
between them is blocked
Independences in BBNs
Burglary Earthquake
RadioReport
Alarm
JohnCalls MaryCalls
12
Independences in BBNs
Burglary Earthquake
RadioReport
Alarm
JohnCalls MaryCalls
Independences in BBNs
Burglary Earthquake
RadioReport
Alarm
JohnCalls MaryCalls
13
Independences in BBNs
Burglary Earthquake
RadioReport
Alarm
JohnCalls MaryCalls
Independences in BBNs
Burglary Earthquake
RadioReport
Alarm
JohnCalls MaryCalls
14
Bayesian belief networks (BBNs)
Bayesian belief networks
• Represents the full joint distribution over the variables more
compactly using the product of local conditionals.
• So how did we get to local parameterizations?
P ( X 1 , X 2 ,.., X n ) = ∏ P( X
i =1,.. n
i | pa ( X i ))
P(B = T, E = T, A = T, J = T, M = F) = J M
15
Full joint distribution in BBNs
B E
Rewrite the full joint probability using the
product rule: A
P(B = T, E = T, A = T, J = T, M = F) = J M
= P(J = T | B = T, E = T, A = T, M = F)P(B = T, E = T, A = T, M = F)
= P(J = T | A = T)P(B = T, E = T, A = T, M = F)
P(B = T, E = T, A = T, J = T, M = F) = J M
= P(J = T | B = T, E = T, A = T, M = F)P(B = T, E = T, A = T, M = F)
= P(J = T | A = T)P(B = T, E = T, A = T, M = F)
P(M = F | B = T, E = T, A = T)P(B = T, E = T, A = T)
P(M = F | A = T)P(B = T, E = T, A = T)
16
Full joint distribution in BBNs
B E
Rewrite the full joint probability using the
product rule: A
P(B = T, E = T, A = T, J = T, M = F) = J M
= P(J = T | B = T, E = T, A = T, M = F)P(B = T, E = T, A = T, M = F)
= P(J = T | A = T)P(B = T, E = T, A = T, M = F)
P(M = F | B = T, E = T, A = T)P(B = T, E = T, A = T)
P(M = F | A = T)P(B = T, E = T, A = T)
P( A = T | B = T, E = T)P(B = T, E = T)
P(B = T, E = T, A = T, J = T, M = F) = J M
= P(J = T | B = T, E = T, A = T, M = F)P(B = T, E = T, A = T, M = F)
= P(J = T | A = T)P(B = T, E = T, A = T, M = F)
P(M = F | B = T, E = T, A = T)P(B = T, E = T, A = T)
P(M = F | A = T)P(B = T, E = T, A = T)
P( A = T | B = T, E = T)P(B = T, E = T)
P(B = T)P(E = T)
17
Full joint distribution in BBNs
B E
Rewrite the full joint probability using the
product rule: A
P(B = T, E = T, A = T, J = T, M = F) = J M
= P(J = T | B = T, E = T, A = T, M = F)P(B = T, E = T, A = T, M = F)
= P(J = T | A = T)P(B = T, E = T, A = T, M = F)
P(M = F | B = T, E = T, A = T)P(B = T, E = T, A = T)
P(M = F | A = T)P(B = T, E = T, A = T)
P( A = T | B = T, E = T)P(B = T, E = T)
P(B = T)P(E = T)
= P(J = T | A = T)P(M = F | A = T)P(A = T | B = T, E = T)P(B = T)P(E = T)
CS 2740 Knowledge Representation M. Hauskrecht
Alarm P(A|B,E)
P(J|A) P(M|A)
JohnCalls MaryCalls
18
Bayesian belief network
2. Local conditional distributions
• relate variables and their parents
P(B) P(E)
T F T F
Burglary 0.001 0.999 Earthquake 0.002 0.998
P(A|B,E)
B E T F
T T 0.95 0.05
Alarm T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
P(J|A) P(M|A)
A T F A T F
JohnCalls T 0.90 0.1 MaryCalls T 0.7 0.3
F 0.05 0.95 F 0.01 0.99
CS 2740 Knowledge Representation M. Hauskrecht
P ( X 1 , X 2 ,.., X n ) = ∏ P( X
i =1,.. n
i | pa ( X i ))
B E
Example:
Assume the following assignment A
of values to random variables
B = T, E = T, A = T, J = T, M = F J M
19
Parameter complexity problem
• In the BBN the full joint distribution is defined as:
P ( X 1 , X 2 ,.., X n ) = ∏ P ( X i | pa ( X i ) )
i = 1 ,.. n
• What did we save?
Alarm example: 5 binary (True, False) variables
# of parameters of the full joint:
2 5 = 32 Burglary Earthquake
20
Parameter complexity problem
• In the BBN the full joint distribution is defined as:
P ( X 1 , X 2 ,.., X n ) = ∏ P ( X i | pa ( X i ) )
i = 1 ,.. n
• What did we save?
Alarm example: 5 binary (True, False) variables
# of parameters of the full joint:
2 5 = 32 Burglary Earthquake
21