0% found this document useful (0 votes)

34 views

Data Mining - Utrecht University - 10. Slides

The document discusses Bayesian networks and directed graphical models. It introduces the concept of a directed acyclic graph (DAG) to represent dependencies between variables, where edges represent conditional dependencies. The document shows how to construct a DAG from a set of variables based on their conditional independence relationships, and how the DAG encodes a factorization of the joint probability distribution into conditional probabilities. It then discusses how to infer independence relationships, called d-separation, directly from the DAG structure.

Uploaded by

Leonardo Vida

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Data Mining - Utrecht University - 10. Slides

Uploaded by

Leonardo Vida

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Mining 2018

Bayesian Networks (1)

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 49

Do you like noodles?

Do you like
noodles?
Race Gender Yes No
Black Male 10 40
Female 30 20
White Male 100 100
Female 120 80

Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 49

Do you like noodles? Undirected

G R

G⊥
⊥R|A

Strange: Gender and Race are prior to Answer, but this model says they
are independent given Answer!
Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 49
Do you like noodles?

Marginal table for Gender and Race:

Race
Gender Black White
Male 50 200
Female 50 200

From this table we conclude that Race and Gender are independent in the
data.

Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 49

Do you like noodles? Directed

G R

G⊥
⊥ R, G 6⊥
⊥ R |A

Gender and Race are marginally independent

(but dependent given Answer).
Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 49
Do you like noodles?

Table for Gender and Race given Answer=yes:

Race
Gender Black White
Male 10 100
Female 30 120

Table for Gender and Race given Answer=no:

Race
Gender Black White
Male 40 100
Female 20 80

From these tables we conclude that Race and Gender are dependent given
Answer.

Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 49

Directed Independence Graphs

G = (K , E ), K is a set of vertices and E is a set of edges with ordered

pairs of vertices.
No directed cycles (DAG)
parent/child
ancestor/descendant
ancestral set
Because G is a DAG, there exists a complete ordering of the vertices that
is respected in the graph (edges point from lower ordered to higher
ordered nodes).

Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 49

Parents Of Node i: pa(i)

Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 49

Ancestors Of Node i: an(i)

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 49

Ancestral Set Of Node i: an+ (i)

Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 49

Children Of Node i: ch(i)

Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 49

Descendants Of Node i: de(i)

Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 49

Construction of DAG

Suppose that prior knowledge tells us the variables can be labeled

X1 , X2 , . . . , Xk such that Xi is prior to Xi+1 .
(for example: causal or temporal ordering)

Corresponding to this ordering we can use the product rule to factorize the
joint distribution of X1 , X2 , . . . , Xk as

P(X ) = P(X1 )P(X2 | X1 ) · · · P(Xk | Xk−1 , Xk−2 , . . . , X1 )

This is an identity of probability theory, no independence assump-

tions have been made yet!

Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 49

Constructing a DAG from pairwise independencies

In constructing a DAG, an arrow is drawn from i to j, where i < j, unless

P(Xj | Xj−1 , . . . , X1 ) does not depend on Xi , in other words, unless

j⊥
⊥ i | {1, . . . , j} \ {i, j}

More loosely
j⊥
⊥ i | prior variables
Compare this to pairwise independence

j⊥
⊥ i | rest

in undirected independence graphs.

Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 )P(X2 |X1 )P(X3 |X1 , X2 )P(X4 |X1 , X2 , X3 )

Suppose the following independencies are given:
1 X1 ⊥⊥ X2
2 X4 ⊥⊥ X3 |(X1 , X2 )
3 X1 ⊥⊥ X3 |X2

Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 ) P(X2 |X1 ) P(X3 |X1 , X2 )P(X4 |X1 , X2 , X3 )

| {z }
P(X2 )

1 If X1 ⊥⊥ X2 , then P(X2 |X1 ) = P(X2 ).

The edge 1 → 2 is removed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 )P(X2 )P(X3 |X1 , X2 )P(X4 |X1 , X2 , X3 )

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 )P(X2 )P(X3 |X1 , X2 ) P(X4 |X1 , X2 , X3 )

| {z }
P(X4 |X1 ,X2 )

2 If X4 ⊥⊥ X3 |(X1 , X2 ), then P(X4 |X1 , X2 , X3 ) = P(X4 |X1 , X2 ).

The edge 3 → 4 is removed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 )P(X2 )P(X3 |X1 , X2 )P(X4 |X1 , X2 )

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 )P(X2 ) P(X3 |X1 , X2 ) P(X4 |X1 , X2 )

| {z }
P(X3 |X2 )

3 If X1 ⊥⊥ X3 |X2 , then P(X3 |X1 , X2 ) = P(X3 |X2 )

The edge 1 → 3 is removed.

Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 49

Construction Of DAG

1 2

4 3

P(X ) = P(X1 )P(X2 )P(X3 |X2 )P(X4 |X1 , X2 )

Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 49

Joint density of Bayesian Network

We can write the joint density more elegantly as

k
Y
P(X1 , . . . , Xk ) = P(Xi | Xpa(i) )
i=1

Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 49

Independence Properties of DAGs: d-separation and Moral
Graphs

Can we infer other/stronger independence statements from the directed

graph like we did using separation in the undirected graphical models?
Yes, the relevant concept is called d-separation.
establishing d-separation directly (Pearl)
establishing d-separation via the moral graph and “normal” separation
We discuss each in turn.

Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 49

Independence Properties of DAGs: d-separation

A path p is blocked by a set of nodes Z if and only if:

1 p contains a chain of nodes A → B → C , or a fork A ← B → C ,
such that the middle node B is in Z ; or
2 p contains a collider A → B ← C such that the collision node B is
not in Z , and no descendant of B is in Z either.
If Z blocks every path between two nodes X and Y , then X and Y are
d-separated by Z , and thus X and Y are independent given Z .

Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 49

Independence Properties of DAGs: Moral Graph

Given a DAG G = (K , E ) we construct the moral graph G m by marrying

parents, and deleting directions, that is,
1 For each i ∈ K , we connect all vertices in pa(i) with undirected edges.
2 We replace all directed edges in E with undirected ones.

Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 49

Independence Properties of DAGs: Moral Graph

The directed independence graph G possesses the conditional

independence properties of its associated moral graph G m . Why?

We have the factorisation:

k
Y
P(X ) = P(Xi | Xpa(i) )
i=1
Yk
= gi (Xi , Xpa(i) )
i=1

by setting gi (Xi , Xpa(i) ) = P(Xi | Xpa(i) ).

Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 49

Independence Properties of DAGs: Moral Graph

We have the factorisation:

k
Y
P(X ) = gi (Xi , Xpa(i) ) (1)
i=1

We thus have a factorisation of the joint probability distribution in

terms of functions gi (Xai ) where ai = {i} ∪ pa(i).
By application of the factorisation criterion the sets ai become cliques
in the undirected independence graph.
Such cliques are formed by moralization.

Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 49

Moralisation: Example

X1 X2

X4 X3

Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 49

Moralisation: Example

X1 X2

X4 X3

{i} ∪ pa(i) becomes a complete subgraph in the moral graph

(by marrying all unmarried parents).

Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 49

Moralisation Continued

Warning: the complete moral graph can obscure independencies!

To verify
i⊥
⊥j |S
construct the moral graph on

A = an+ ({i, j} ∪ S),

that is i, j, S and all their ancestors.

Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 49

Moralisation Continued

Since for ` ∈ A, pa(`) ∈ A, we know that the joint distribution of XA is

given by Y
P(XA ) = P(X` | Xpa(`) )
`∈A

which corresponds to the subgraph GA of G .

1 This is a product of factors P(X` |Xpa(`) ), involving the variables
X{`}∪pa(`) only.
2 So it factorizes according to GAm , and thus the independence
properties for undirected graphs apply.
3 So, if S separates i from j in GAm , then i ⊥
⊥ j | S.

Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 49

Full moral graph may obscure independencies: example

G R

P(G , R, A) = P(G )P(R)P(A | G , R)

Does G ⊥⊥ R hold? Summing out A we obtain:
X
P(G , R) = P(G , R, A = a) (sum rule)
a
X
= P(G )P(R)P(A = a | G , R) (BN factorisation)
a
X
= P(G )P(R) P(A = a | G , R) (rule of summation)
a
P
= P(G )P(R) ( a P(A = a | G , R) = 1)

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 49

Moralisation Continued: example

X1 X2

X4 X3

Are X3 and X4 independent?

Are X1 and X3 independent?

Are X3 and X4 independent given X5 ?

Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 49
Equivalence
When no marrying of parents is required (there are no “immoralities” or
“v-structures”), then the independence properties of the directed graph are
identical to those of its undirected version.

These three graphs express the same independence properties:

1 2 3

1 2 3
Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 49
Learning Bayesian Networks

1 Parameter learning: structure known/given; we only need to estimate

the conditional probabilities from the data.
2 Structure learning: structure unknown; we need to learn the networks
structure as well as the corresponding conditional probabilities from
the data.

Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 49

Maximum Likelihood Estimation

Find value of unknown parameter(s) that maximize the probability of the

observed data.

n independent observations on binary variable X ∈ {1, 2}. We observe

n(1) outcomes X = 1 and n(2) = n − n(1) outcomes X = 2.
What is the maximum likelihood estimate of p(1)?
The likelihood function (probability of the data) is given by:

L = p(1)n(1) (1 − p(1))n−n(1)

Taking the log we get

L = n(1) log p(1) + (n − n(1)) log(1 − p(1))

Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 49

Maximum Likelihood Estimation

Take derivative with respect to p(1), equate to zero, and solve for p(1).

dL n(1) n − n(1)
= − = 0,
dp(1) p(1) 1 − p(1)
d log x 1
since dx = x (where log is the natural logarithm).

Solving for p(1), we get

n(1)
, p(1) =
n
i.e., the fraction of one’s in the sample!

Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 49

ML Estimation of Multinomial Distribution

Let X ∈ {1, 2, . . . , J}.

Estimate the probabilities p(1), p(2), . . . , p(J) of getting outcomes

1, 2, . . . , J. If in n trials, we observe n(1) outcomes of 1, n(2) of 2, . . .,
n(J) of J, then the obvious guess is to estimate

n(j)
p(j) = , j = 1, 2, . . . , J
n
This is also the maximum likelihood estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 49

BN-Factorisation

For a given BN-DAG, the joint distribution factorises according to

k
Y
P(X ) = p(Xi | Xpa(i) )
i=1

So to specify the distribution we have to estimate the probabilities

p(Xi | Xpa(i) ) i = 1, 2, . . . , k

for the conditional distribution of each variable given its parents.

Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 49

ML Estimation of BN
The joint probability for n independent observations is
n
Y
P(X (1) , . . . , X (n) ) = P(X (j) )
j=1
n Y
k
(j) (j)
Y
= p(Xi | Xpa(i) ),
j=1 i=1

where X (j) denotes the j-th row in the data table.

The likelihood function is therefore given by

k
Y Y
L= p(xi | xpa(i) )n(xi ,xpa(i) )
i=1 xi ,xpa(i)

where n(xi , xpa(i) ) is a count of the number of records with

Xi = xi , and Xpa(i) = xpa(i) .

Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 49

ML Estimation of BN

Taking the log of the likelihood, we get

k
X X
L= n(xi , xpa(i) ) log p(xi | xpa(i) )
i=1 xi ,xpa(i)

Maximize the log-likelihood function with respect to the

unknown parameters p(xi | xpa(i) ).
This decomposes into a collection of independent multinomial
estimation problems.
Separate estimation problem for each Xi and configuration of Xpa(i) .

Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 49

ML Estimation of BN

The maximum likelihood estimate of p(xi | xpa(i) ) is given by:

n(xi , xpa(i) )
p̂(xi | xpa(i) ) = ,
n(xpa(i) )

where
n(xi , xpa(i) ) is the number of records in the data with
Xi = xi and Xpa(i) = xpa(i) , and
n(xpa(i) ) is the number of records in the data with Xpa(i) = xpa(i) .

Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 49

Example BN and Factorisation

1 2

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )

Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 49

Example BN: Parameters

P(X1 , X2 , X3 , X4 ) = p1 (X1 )p2 (X2 )p3|12 (X3 |X1 , X2 )p4|3 (X4 |X3 )
Now we have to estimate the following parameters (X4 ternary, rest binary):

p1 (1) p1 (2) = 1 − p1 (1)

p2 (1) p2 (2) = 1 − p2 (1)

p3|1,2 (1|1, 1) p3|1,2 (2|1, 1) = 1 − p3|1,2 (1|1, 1)

p3|1,2 (1|1, 2) p3|1,2 (2|1, 2) = 1 − p3|1,2 (1|1, 2)
p3|1,2 (1|2, 1) p3|1,2 (2|2, 1) = 1 − p3|1,2 (1|2, 1)
p3|1,2 (1|2, 2) p3|1,2 (2|2, 2) = 1 − p3|1,2 (1|2, 2)

p4|3 (1|1) p4|3 (2|1) p4|3 (3|1) = 1 − p4|3 (1|1) − p4|3 (2|1)
p4|3 (1|2) p4|3 (2|2) p4|3 (3|2) = 1 − p4|3 (1|2) − p4|3 (2|2)