ST Flour Notes
ST Flour Notes
Lauritzen
September 4, 2011
Springer
Your dedication goes here
Preface
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Markov Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 General conditional independence . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Markov Properties for Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Markov Properties for Directed Acyclic Graphs . . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
ix
x Contents
6 Estimation of Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1 Estimation of Structure and Bayes Factors . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Estimating Trees and Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Learning Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Model search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.2 Constraint-based search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 1
Introduction
Conditional independence
Graphical models
2 4
u u
1 @ 5 @ 7
u @u
@ @u
@
@ @
@u
@ @u
@
3 6
For several variables, complex systems of conditional independence can be de-
scribed by undirected graphs.
Then a set of variables A is conditionally independent of set B, given the values
of a set of variables C if C separates A from B.
1
2 1 Introduction
Has tuberculosis
Positive X-ray?
Visit to Asia?
Tuberculosis or cancer
Smoker?
Has bronchitis
Fig. 1.1 An example of a directed graphical model describing the relationship between risk fac-
tors, lung diseases and symptoms. This model was used by Lauritzen and Spiegelhalter (1988) to
illustrate important concepts in probabilistic expert systems.
A pedigree
Graphical model for a pedigree from study of Werner’s syndrome. Each node is
itself a graphical model.
1 Introduction 3
p43p280
p314
p86p281
p72p103 p50p317
p288 p295
p292
p318
p294
p31p291
p293
p282
p25p297
p296 p10p287
p343 p73p46p341
p193
p247 p106
p81p358 p365
p354 p340
p304
p355 p357
p356 p1 p245
p148 p337
p306 p171
p299
p278 p125
p336 p368
p279 p342 p11p69p198
p284 p170
p275 p12p26p9 p313
p321 p259
p176 p330
p262 p257
p235 p260
p244 p53p329
p242 p243 p34p327
p274 p248
p3 p241 p95p271
p249 p230 p228
p273 p226
p4 p236
p348 p94p270
p234 p7 p22p346
p23p298
p272
p233 p227
p326 p307
p338 p240
p285 p38p74p303
p325 p309 p76p239
p302 p344
p322
p323
p324 p195
p238 p180
p415
p187
p311
p550
p551
p310
p277p334
p55 p96
p485
p335p67
p45 p143p488
p552p487
p560
p434
p305p486
p567 p578
p483p566
p565
p481
p568
p312
p478
p537
p145p579
p184p246
p139p580
p587p581
p316
p191
p359
p360 p267
p366
p361 p423p563
p480p527
p528p562
p577p522
p529p530
p532p424
p430p452
p349 p232
p427p520
p194
p422p142
p453p456p437
p459p372p144
p455p436
p521 p331
p192
p535 p369p328
p258 p263
p454p457p505
p460p395
p438
p477p283
p467p458p435
p475p558p469
p557p251p396
p173p465
p463
p559p541
p286p265p538
p470 p433
p461p141
p426
p431
p471p231
p24 p253p503
p300 p269
p525p416
p428
p429p229
p531p264
p462p504
p419p472
p169p301
p250p448p417
p449p254p445
p494p495
p266
p497
p418p261
p255p493
p252
p450p491
p237 p443
p492p447
p444
p564p539
p379p381
p466p155
p376 p308
p347 p561
p482p588
p352p290
p351 p289
p136
p196 p339
p350
p553
p353
p554p911
p315 p910
p523p332
p573p526
p524 p907
p574 p507
p912
p276
p333 p908
p895
p889 p909
p137
p156 p484
p575
p913 p345
p190
p882 p320
p383
p548
p938
p555
p547p620
p546 p549
p140p189
p939
p948
p940
p545 p582
p946
p941 p929
p451 p618
p388
p891 p35 p902
p619
p386
p883p595
p134p363
p896p924
p425 p928
p926
p135 p585
p384
p925 p420
p914p536
p421p440
p594 p542
p892
p881 p598
p377 p543
p888p893
p597p508
p596
p364
p880p512
p608 p625
p432 p904
p714
p654 p906
p441
p544 p138p511
p514
p397p936
p905
p510 p385 p506
p509
p921
p609 p622
p937
p637 p927
p446
p887 p479p606
p920 p930
p367 p901
p934
p468 p373
p256
p917 p918
p916p935
p919
p915 p626
p621p540
p571
p474p886
p723
p636 p884
p533 p931
p502
p501
p890 p656
p150p933
p500 p632
p885 p645
p168
p603 p623
p644
p268
p688p903
p633
p607 p572
p515
p605p387 p518
p655 p513
p593p517
p392
p394
p464p439
p146
p393
p635 p599
p687
p390 p686
p556 p391
p200
p378 p602
p398
p638
p473 p516
p685p684
p206p496
p771p172
p399
p628p768
p770
p629p570
p534 p569
p627 p932
p374
p630 p576
p202
p197 p769
p942
p203
p205 p1585
p223p702
p208 p701
p900
p114p989
p898
p490p726
p982
p981 p959
p703p923
p204 p993
p738
p984p899 p760
p728
p983 p727
p980p489p664
p1004
p985 p951 p955
p1077
p1003
p1006
p682 p584
p1017 p894
p922
p943 p954
p211
p757 p750
p992 p945
p991
p499 p944
p161p986
p165 p949
p631
p147
p362 p958
p953
p987
p1182 p950
p947p952
p319p677 p956
p1070
p731
p613 p957
p1078
p442
p129 p1014
p1020
p201
p370 p704
p970
p735
p754
p1122 p737
p734
p973
p162p118p966
p1016
p1124 p1042
p611p1043
p968 p586
p616p971
p614
p1123 p742
p743
p740
p212 p744
p676
p617p749
p1075
p1015
p1013 p1038
p711
p710
p666 p709
p733
p967
p753p1035
p659
p660p706
p1101
p1588
p612p662
p712
p988p683 p1097
p159
p1018 p1100
p1111p716
p222 p1096
p476p1045
p657
p1008 p1036
p604
p1007 p1039
p1074
p640p1099p1037
p643
p1055
p1009 p1071
p641
p590
p1048 p1027
p610
p218p962
p836
p673p1026
p1024
p1022
p1054 p1073
p1010 p1098
p639
p1053 p976p725
p977
p679p975 p1025
p724
p1049
p1052
p678 p979
p680 p960
p715p974
p157 p965
p961
p897p729
p642
p130 p1040
p1072
p408
p978
p1011 p1041
p1044
p752 p751
p407
p409
p651
p160 p601
p648
p401
p1051 p838
p404 p1083
p661
p1023
p382p403
p1021 p719
p405 p658
p600
p389p674
p498p624
p730p718
p739
p1012 p380p964
p667p1080
p406p672p1031
p649p720
p668 p1076
p721p717
p402p120p1587
p110
p722p759
p132
p671 p1079
p375
p113 p1586
p221p400
p210
p1459
p1558
p995
p216p163
p217 p777
p775
p1509
p873p839
p111 p1127
p1508
p788
p793p787 p998
p772
p1267
p990
p1458
p790
p149 p647
p1375
p841
p789 p1455
p792
p1005p1372
p774
p1374
p1191
p220 p1368
p1510
p791
p785
p786p1376
p1271
p1332
p225 p1066
p1371
p1370
p669
p1266 p1069
p1063
p1086
p1187
p994
p207p1064
p997
p1269p1068
p782
p755 p1061
p776
p1144
p1194 p1060
p1084
p1193 p1065
p1148
p1270p646
p1373
p1454p1002
p1369
p1337
p1333
p1057 p1149
p1593p1001
p1336p1087
p1335 p1456
p1265
p1272 p1088
p784
p1140
p116
p1121p778
p1329 p779
p1338
p1059
p1125p1085
p689
p1203
p848 p1190
p1142
p1204
p996
p1334
p1331
p1268
p1330
p857
p371 p1126
p1520
p213p1521
p736
p854 p1457
p867p855 p1062
p1185
p732
p107 p780
p1145
p844 p1000
p1143
p846 p1067
p1186
p1141
p834
p1578 p1184
p802 p856
p783
p1146
p1209
p969 p797
p835
p615
p188 p781
p1028 p1219
p801
p1197
p1046 p1147
p696
p1202
p1205
p1207p1188
p697
p1607
p833p758
p1606 p705
p1198
p695
p663p1208
p1206
p1611p798
p1605 p999
p1189
p1613
p1047 p1092
p1104
p707
p741 p1105
p803 p1108
p1030
p1609
p866 p1110
p800
p1608
p1612
p1577p1177
p691p1106
p1107
p1102
p690
p1614 p1128
p1109
p699
p837 p1175
p1176
p158p963
p215p1103
p1135
p1130
p1174
p1139
p670
p874p799
p692
p698
p1610p858 p1171
p1029
p1019
p796
p127 p1137
p1050p1419
p410
p209 p1094
p1132p1095
p1136
p1133
p1199 p1134
p1201p1243
p166
p681 p1183
p589
p794
p1495
p117 p1032
p1232
p634
p694p1230
p1089
p828
p1056
p713 p1034
p1033
p1218
p700
p591
p1181
p795 p708
p652
p214p1231
p1093
p830
p827 p1172
p1421
p519
p761
p412 p1082
p1058p1090
p1229
p1418
p1200
p829 p1223
p1228
p762
p109 p1170
p1227
p764
p763 p1091
p748
p831
p122 p767
p766
p1420
p119 p745
p653p746
p747
p852 p1081
p693
p826p1436
p1129
p1244
p832
p650 p1225
p1590
p1168
p843
p1417
p972
p112 p1438
p1169
p1435
p1437
p1166
p1165
p1220
p1180p765
p1163
p1162
p1226
p1434
p1222
p864
p1221
p414p413 p1592
p224 p1589
p871 p1439
p1349
p1591
p1138
p851p123 p1161
p850
p411p1173
p1224
p1179
p131
p845
p853
p1260
p1286
p1392p1294
p1394
p1512p1393
p1388
p1490p1391
p128p1396
p1389p1291
pp1544
1390
p121 p1328
p1293
p1289
p1464pp1288
p8631292
p1257
p1287
p1450p1261
p1451
p1453p1290
p1415
p1449p1403
p879 p1262
p1264
p1326
p773
p1506
p108 p1256
p133 p1411
p840p1263
p1463p1246
p1192 p1255
p1195 p1440
p1414
p1258
p1395 p1584
p1250
p1327
p1452p1410 p1441
p1252p1253
p1245
p1402p1412
p1466 p1325
p1416
p1259
p1524p1523
p1522p1311
p1150
p1254
p1308
p1405p1413p1345
p1249
p1404p1247
p1473 p1348
p1400
p1155 p1344
p1251
p1248
p1505p1504
p1497 p1401
p1324
p1409
p756p1467
p1579 p1340
p1304
p1498
p1283
p1566
p1275 p1342
p1210p1347
p1496
p1279
p878 p1341
p1407
p1399 p1339
p1550p1343
p1406
p1276 p1346
p1233
p1427
p1431
p1474
p1277 p1309
p1397 p1468
p1433
p1429
p1299
p1478
p1274
p126 p1428 p1323
p1398
p1280 p1307
pp1408
p1282
p583p1285p1432
p1284
1281p1297
p1296
p1430
p823p818
p819
p824p1319
p849p1300p1301
p1237
pp1196
820 p822p1320
p1318
p1273
p825p1242 p1310
pp1321
1302
p1313
p1303
p821
p167 p1322
p1298
p1422
p814 p1315
p1317
p1239
p1551 p1306
pp816
p8121316
p1314
p817p1295
p1442
p1426
p1352 p1113
p1312
p1350
p1278p1241p1240
p1541
p860p862
p813
p665 p1114
p1443p1305
p1236
p1238
p815
p859 p1569
p1530
p1235
p1465
p870
p1213 p869
p1517
p811 p1462
p1117
p1131
p1447
p847p1537 p1120
p1519
p592p1115
p1351
p1516 p1152
p1445
p1444p1118
p1159
p1423
p1538
p1518 p1116
p1153
p1158
p1424
p1446 p1156
p1160
p877
p1425p1151
p164p876
p1234
p1476 p1555
p1448 p1154
p1212
p872p1211
p1214
p861 p1556pp1112
p1565
p865
p1178
p1477 p868
1564
p875
p1216p1217
p1489p1472p1119
p1380
p1470p1461
p1471
p810p1557p1387
p1378
p1215
p1540
p675 p1601
p1386
p808p807
p806
p804p1377
p1381
p1603
p1552p1595
p1167
p805p1460
p842
p1385
p1600
p1164
p1539p1599
p1525 p1594
p1597
p1382
p1583
p219 pp1563
1596
p1529
p809 p1157
p1598
p1383
p1379
p115p1384
p1514
p1602
p1488
p1604
p1567p1486
p1527p1526
p1528p1487
p124
p1560p1511p1484p1559p1572p1507p1492p1571p1494p1493p1562p1491p1570p1574p1535p1502p1500p1503p1576p1501p1549p1548p1469p1480p1533p1582p1534p1536p1475p1361p1362p1482p1483p1360p1481p1547p1546p1353p1355p1356p1357p1358p1359p1354p1545p1575p1513p1479p1531p1542p1485p1561p1499p1367p1365p1532p1366p1363p1364p1573p1515p1568p1580p1581p1554p1553p1543
Similarly, one can show that for random variables X, Y , Z, and W it holds
5
6 2 Markov Properties
(C1) if X ⊥⊥ Y | Z then Y ⊥⊥ X | Z;
(C2) if X ⊥⊥ Y | Z and U = g(Y ), then X ⊥⊥ U | Z;
(C3) if X ⊥⊥ Y | Z and U = g(Y ), then X ⊥⊥ Y | (Z,U);
(C4) if X ⊥⊥ Y | Z and X ⊥⊥ W | (Y, Z), then X ⊥⊥ (Y,W ) | Z;
If the joint distribution of the random variables have a density w.r.t. a product mea-
sure which is strictly positive, it further holds that
(C5) if X ⊥⊥ Y | (Z,W ) and X ⊥⊥ Z | (Y,W ) then X ⊥⊥ (Y, Z) |W .
Without additional conditions on the joint distribution, (C5) does not hold, but pos-
itivity of the density is not necessary for (C5). For example, in the case where W is
constant it is enough that f (y, z) > 0 for all (y, z) or f (x, z) > 0 for all (x, z). In the
discrete and finite case it is sufficient that the bipartite graphs G+ = (Y ∪ Z , E+ )
defined by
y ∼+ z ⇐⇒ f (y, z) > 0,
are all connected, or alternatively if the same condition is satisfied with X replacing
Y.
Conditional independence can be seen as encoding irrelevance in a fundamental
way. If we give A ⊥⊥ B |C the interpretation: Knowing C, A is irrelevant for learning
B, the properties (C1)–(C4) translate to:
(I1) If, knowing C, learning A is irrelevant for learning B, then B is irrelevant for
learning A;
(I2) If, knowing C, learning A is irrelevant for learning B, then A is irrelevant for
learning any part D of B;
(I3) If, knowing C, learning A is irrelevant for learning B, it remains irrelevant
having learnt any part D of B;
(I4) If, knowing C, learning A is irrelevant for learning B and, having also learnt
A, D remains irrelevant for learning B, then both of A and D are irrelevant for
learning B.
The property (C5) does not have immediate intuitive appeal for general irrelevance.
Also the symmetry (C1) is a special property of probabilistic conditional indepen-
dence, rather than of general irrelevance, so (I1) does not have the same immediate
appeal as the others.
A ⊥σ B | S ⇐⇒ XA ⊥⊥ XB | XS ;
Sets of random variables A and B are partially uncorrelated for fixed C if their
residuals after linear regression on XC are uncorrelated:
in other words, if the partial correlations ρAB·C are equal to zero. If this holds we
write A ⊥2 B |C. The relation ⊥2 satisfies the semigraphoid axioms (S1) -(S4), and
the graphoid axioms if there is no non-trivial linear relation between the variables
in V .
8 2 Markov Properties
Let G = (V, E) be a finite and simple undirected graph (no self-loops, no multiple
edges). For subsets A, B, S of V , let A ⊥G B | S denote that S separates A from B in
G , i.e. that all paths from A to B intersect S. It then holds that the relation ⊥G on
subsets of V is a graphoid. Indeed, this is the reason for choosing this name for such
separation relations.
Geometric orthogonality
L ⊥ M | N ⇐⇒ (L N) ⊥ (M N),
L ⊥ M | N and L ⊥ N | M,
Variation independence
∗
Let U ⊆ X = ×v∈V Xv and define for S ⊆ V and u∗S ∈ XS ∩ U the S-section U uS
of U as
∗
U uS = {uV \S : uS = u∗S , u ∈ U }.
Define further the conditional independence relation ‡U as
∗ ∗ ∗
A ‡U B | S ⇐⇒ ∀u∗S : U uS = {U uS }A × {U uS }B
i.e. if and only if the S-sections all have the form of a product space. The relation
‡U satisfies the semigraphoid axioms. Note in particular that A ‡U B | S holds if U
is the support of a probability measure satisfying A ⊥⊥ B | S.
2.2 Markov Properties for Undirected Graphs 9
2 4
u u
1 @ 5 @ 7
u @u
@ @u
@
@ @
@u
@ @u
@
3 6
Fig. 2.1 Undirected graph used to illustrate the different Markov properties
α 6∼ β ⇒ α ⊥σ β |V \ {α, β }.
For example, in Fig. 2.1 the pairwise Markov property states that
If the relation ⊥σ satisfies the pairwise Markov property, we also write that ⊥σ
satisfies (P).
The semigraphoid relation ⊥σ satisfies the local Markov property w.r.t. G if every
variable is conditionally independent of the remaining, given its neighbours
∀α ∈ V : α ⊥σ V \ cl(α) | bd(α).
For example, if ⊥σ satisfies the local Markov property w.r.t. the graph in Fig. 2.1
it holds that 5 ⊥σ {1, 4} | {2, 3, 6, 7} and 7 ⊥σ {1, 2, 3} | {4, 5, 6}. If the relation ⊥σ
satisfies the local Markov property, we also write that ⊥σ satisfies (L).
10 2 Markov Properties
The semigraphoid relation ⊥σ satisfies the global Markov property w.r.t. G if any
two sets which are separated by a third are conditionally independent given the
separating set
A ⊥G B | S ⇒ A ⊥σ B | S.
To identify conditional independence relations in the graph of Fig. 2.1 one should
look for separating sets, such as {2, 3}, {4, 5, 6}, or {2, 5, 6}. For example, it follows
that 1 ⊥σ 7 | {4, 5, 6} and 2 ⊥σ 6 | {3, 4, 5}. If the relation ⊥σ satisfies the global
Markov property, we also write that ⊥σ satisfies (G).
(P) ⇒ (G)
The latter holds in particular for ⊥⊥ , when f (x) > 0, so that for probability distri-
butions with positive densities, all the Markov properties coincide.
Proof. Since this result is so fundamental and the proof illustrates the use of
graphoid axioms very well, we give the full argument here, following Lauritzen
(1996).
α ⊥σ (V \ cl(α)) |V \ {α, β }.
The proof uses reverse induction to establish this for a general undirected graph.
Before we proceed to give this proof, due to Pearl and Paz (1987), it is helpful to
note that the graphoid condition (S5):
A ⊥σ B | (C ∪ D) and A ⊥σ C | (B ∪ D) ⇒ A ⊥σ (B ∪C) | D
exactly expresses that the pairwise Markov property (P) implies the global Markov
property (G) on the graph in Fig. 2.2.
W tY
X t t
tZ
HH
Fig. 2.2 The graphoid condition (S5) expresses that the pairwise Markov property (P) implies the
global Markov property (G) on this particular graph.
Example 2.1 (Pairwise Markov but not local Markov). Let X = Y = Z with P{X =
1} = P{X = 0} = 1/2. This distribution satisfies (P) but not (L) with respect to the
graph below.
s s s
X Y Z
The pairwise Markov property says that X ⊥⊥ Y | Z and X ⊥⊥ Z |Y , which both are
satisfied. However, we have that bd(X) = 0/ so (L) would imply X ⊥⊥ (Y, Z) which
is false.
It can be shown that (L) ⇐⇒ (P) if and only if Gˇ has no induced subgraph
GA = (A, ĚA ) with |A| = 3 and |ĚA | ∈ {2, 3} (Matúš 1992).
ˇ
Example 2.2 (Local Markov but not global Markov). Let U and Z be independent
with
P(U = 1) = P(Z = 1) = P(U = 0) = P(Z = 0) = 1/2,
W = U, Y = Z, and X = WY . This satisfies (L) but not (G) w.r.t. the graph below.
s s s s s
U W X Y Z
The local Markov property follows because all variables depend deterministically
on their neighbours. But the global Markov property fails; for example it is false
that W ⊥⊥ Y | X.
It can be shown that (G) ⇐⇒ (L) if and only if the dual graph Gˇ does not have
the 4-cycle as an induced subgraph (Matúš 1992).
xa = ya ⇒ ψa (x) = ψa (y).
f (x) = ∏ ψa (x),
a∈A
Example 2.3. The cliques of the graph in Fig. 2.1 are the maximal complete subsets
{1, 2}, {1, 3}, {2, 4}, {2, 5}, {3, 5, 6}, {4, 7}, and {5, 6, 7} and a complete set is
any subset of these sets, for example {2} or {5, 7}. The graph corresponds to a
factorization as
Consider a distribution with density w.r.t. a product measure and let (G), (L) and
(P) denote Markov properties w.r.t. the semigraphoid relation ⊥⊥ .
Without the positivity restriction (G) and (F) are genuinely different, as illustrated
in the example below, due to Moussouris (1974).
1 r r1
1 r r1
1 r r1 1 r r1
1 r r0 0 r r1
1 r r0 0 r r1
1 r r0 0 r r1
0 r r0 0 r r0
1 r r0 0 r r1
0 r r0
0 r r0
Fig. 2.3 The distribution which is uniform on these 8 configurations satisfies (G) w.r.t. the 4-cycle.
Yet it does not factorize with respect to this graph.
Example 2.4 (Global but not factorizing). Consider the uniform distribution on the
8 configurations displayed in Fig. 2.3. Conditioning on opposite corners renders one
corner deterministic and therefore the global Markov property is satisfied.
However, the density does not factorize. To see this we assume the density fac-
torizes. Then e.g.
0 6= 1/8 = f (0, 0, 0, 0) = ψ12 (0, 0)ψ23 (0, 0)ψ34 (0, 0)ψ41 (0, 0)
14 2 Markov Properties
so these factors are all positive. Continuing for all possible 8 configurations yields
that all factors ψa (x) are strictly positive, since all four possible configurations are
possible for every clique.
But this contradicts the fact that only 8 out of 16 possible configurations have
positive probability. t
u
In fact, we shall see later that (F) ⇐⇒ (G) if and only if G is chordal, i.e. does not
have an n-cycle as an induced subgraph with n ≥ 4.
so for all n it holds that 1 ⊥⊥Pn 3 | 2. The critical feature is that Kn does not converge,
hence the densities do not converge.
and this relation is clearly stable under pointwise limits. Hence (G), (L) and (P) are
closed under pointwise limits in the discrete case.
In general, conditional independence is preserved if Pn → P in total variation (A.
Klenke, personal communication, St Flour 2006).
Example 2.5 (Instability of factorization under limits). Even in the discrete case,
(F) is not closed under pointwise limits in general. Consider four binary variables
X1 , X2 , X3 , X4 with joint distribution
2.3 Markov Properties for Directed Acyclic Graphs 15
2 s s3
1 s s4
It holds that fn (x) = n/(8 + 8n) for each of the configurations below
whereas fn (x) = 1/(8 + 8n) for the remaining 8 configurations. Thus, when n → ∞
the density fn converges to f (x) = 1/8 for each of the configurations above and
f (x) = 0 otherwise, i.e. to the distribution in Example 2.4 which is globally Markov
but does not factorize.
Markov faithfulness
A ⊥G B | S ⇐⇒ A ⊥⊥P B | S.
A directed acyclic graph D over a finite set V is a simple graph with all edges
directed and no directed cycles in the sense that that following arrows in the graph,
it is impossible to return to any point.
Graphical models based on DAGs have proved fundamental and useful in a
wealth of interesting applications, including expert systems, genetics, complex
biomedical statistics, causal analysis, and machine learning, see for example Fig. 1.1
and other examples in Chapter 1.
16 2 Markov Properties
A semigraphoid relation ⊥σ satisfies the local Markov property (L) w.r.t. a directed
acyclic graph D if all variables are conditionally independent of its non-descendants
given its parents.
The local Markov property for the DAG in Fig. 2.4 yields, for example, that
4 ⊥σ {1, 3, 5, 6} | 2, 5 ⊥σ {1, 4} | {2, 3}, and 3 ⊥σ {2, 4} | 1.
Suppose the vertices V of a DAG D are well-ordered in the sense that they are
linearly ordered in a way which is compatible with D, i.e. so that
α ∈ pa(β ) ⇒ α < β .
We then say that the semigraphoid relation ⊥σ satisfies the ordered Markov prop-
erty (O) w.r.t. a well-ordered DAG D if
Here pr(α) are the predecessors of α, i.e. those which are before α in the well-
ordering.
The numbering in Fig. 2.4 corresponds to a well-ordering. The ordered Markov
property says for example that 4 ⊥σ {1, 3} | 2, 5 ⊥σ {1, 4} | {2, 3}, and 3 ⊥σ {2} | 1.
2.3 Markov Properties for Directed Acyclic Graphs 17
Separation in DAGs
The global Markov property for directed acyclic graphs is expressed in terms of a
type of separation which is somewhat involved compared to the undirected case.
A trail τ from α to β is a sequence v1 , v2 , . . . , vn of edges with α = v1 , β = vn
and all consecutive vertices being adjacent. A trail τ in D is blocked by a set S if it
contains a vertex γ ∈ τ such that
• either γ ∈ S and edges of τ do not meet head-to-head at γ, or
• γ and all its descendants are not in S, and edges of τ meet head-to-head at γ.
A trail that is not blocked is active. Two subsets A and B of vertices are d-separated
by S if all trails from A to B are blocked by S. We write A ⊥D B | S.
In the DAG of Fig. 2.4 we have, for example, that for S = {5}, the trail
(4, 2, 5, 3, 6) is active, whereas the trails (4, 2, 5, 6) and (4, 7, 6) are blocked. For
S = {3, 5} all these trails are blocked. Hence it holds that 4 ⊥D 6 | 3, 5, but it is not
true that 4 ⊥D 6 | 5 nor that 4 ⊥D 6.
A semigraphoid relation ⊥σ satisfies the global Markov property (G) w.r.t. a di-
rected acyclic graph D if
A ⊥D B | S ⇒ A ⊥σ B | S.
In Fig. 2.4 the global Markov property thus entails that 4 ⊥⊥ 6 | 3, 5 and 2 ⊥⊥ 3 | 1.
In the directed case the relationship between the alternative Markov properties is
much simpler than in the undirected case.
Proposition 2.1. It holds for any directed acyclic graph D and any semigraphoid
relation ⊥σ that all directed Markov properties are equivalent:
We omit the proof of this fact and refer to Lauritzen et al (1990) for details. FiXme
Fatal: give the proof here?
There is also a pairwise property (P), but it is less natural than in the undirected
case and it is weaker than the others, see Lauritzen (1996, page 51).
18 2 Markov Properties
i.e. it follows from (F) that kv in fact are conditional densities. The graph in Fig. 2.4
thus corresponds to the factorization
Assume that the probability distribution P has a density w.r.t. some product measure
on X . It is then always true that (F) holds if and only if ⊥⊥P satisfies (G), so all
directed Markov properties are equivalent to the factorization property!
Ancestral marginals
The directed Markov properties are closed under marginalization to ancestral sub-
sets, i.e. sets which contain the parents of all its vertices
α ∈ A ⇒ pa(α) ∈ A.
2 4 2 4 2 4
s -s s -s s s
1 @ 5 @ 7 1 @ 5 @ 7 1 @ 5 @ 7
s Rs
@ Rs s
@
- Rs
@ Rs s
@
- @s @s
Rs Rs @s Rs @s @s
@ @ @ @ @ @
@ @
- R @
-
3 6 3 6 3 6
Fig. 2.5 Illustration of the moralization process. Undirected edges are added to parents with a
common child. Directions on edges are subsequently dropped.
Perfect DAGs
The skeleton σ (D) of a DAG is the undirected graph obtained from D by ignoring
directions.
A DAG D is perfect if all parents are married or, in other words if σ (D) = D m .
It follows directly from Proposition 2.3 that the directed and undirected properties
are identical for a perfect DAG D:
Corollary 2.1. P factorizes w.r.t a perfect DAG D if and only if it factorizes w.r.t. its
skeleton σ (D).
Not that a rooted tree with arrows pointing away from the root is a perfect DAG.
Thus for such a rooted tree the directed and undirected Markov properties are the
same.
In particular this yields the well-known fact that any Markov chain is also a
Markov field.
20 2 Markov Properties
We shall later see that an undirected graph G can be oriented to form a perfect
DAG if and only if G is chordal.
The criterion of d-separation can be difficult to verify in some cases, although ef-
ficient algorithms to settle d-separation queries exists. For example, Geiger et al
(1990) describe an algorithm with worst case complexity O(|E|) for finding all ver-
tices α which satisfy α ⊥D B | S for fixed sets B and S.
Algorithms for settling such queries can also be based on the following alterna-
tive separation criterion given by Lauritzen et al (1990) which is based on Proposi-
tions 2.2 and 2.3. For a query involving three sets A, B, S we perform the following
2 4 2 4 2 4
s -s s -s s s
1 @ 5 1 @ 5 1 @ 5
s Rs
@ s Rs
@ s @s
Rs Rs Rs Rs @s @s
@ @ @ @ @ @
@ @
- @ @
-
3 6 3 6 3 6
Fig. 2.6 To settle the query “4 ⊥m 6 | 3, 5?” we first form the subgraph induced by all ancestors of
vertices involved. The moralization adds an undirected edge between 2 and 3 with common child
5 and drops directions. Since {3, 5} separates 4 from 6 in the resulting graph, we conclude that
4 ⊥m 6 | 3, 5.
operations:
1. Reduce to subgraph induced by ancestral set DAn(A∪B∪S) of A ∪ B ∪ S;
2. Moralize to form (DAn(A∪B∪S) )m ;
3. Say that S m-separates A from B and write A ⊥m B | S if and only if S separates
A from B in this undirected graph.
The procedure is illustrated in Fig. 2.6. It now follows directly from Propositions
2.2 and 2.3 that
Corollary 2.2. If P factorizes w.r.t. D it holds that
A ⊥m B | S ⇒ A ⊥⊥ B | S.
Note however that Richardson (2003) has pointed out that the proof given in Lau-
ritzen et al (1990) and Lauritzen (1996) needs to allow self-intersecting paths to be
correct. FiXme Fatal: give the correct proof here
It holds for any DAG D that ⊥D (and hence ⊥m ) satisfies graphoid axioms
(Verma and Pearl 1990).
To show this is true, it is sometimes easy to use ⊥m , sometimes ⊥D . For exam-
ple, (S2) is trivial for ⊥D , whereas (S5) is trivial for ⊥m . So, equivalence of ⊥D
and ⊥m can be very useful.
Faithfulness
Markov equivalence
Two DAGS D and D 0 are said to be Markov equivalent if the separation relations
⊥D and ⊥D 0 are identical. Markov equivalence between DAGs is easy to identify,
as shown by Frydenberg (1990a) and Verma and Pearl (1990).
s s s s
@ ≡ I
@
@ @ 6≡ @
6
s @
s -? Rs
@ s -?
s @s s @
s -? Rs
@
- s -s @
Rs
@
-
Fig. 2.7 The two DAGs to the left are Markov equivalent whereas those to the right are not. Al-
though those to the right have the same skeleton they do not share the same unmarried parents.
Proposition 2.5. Two directed acyclic graphs D and D 0 are Markov equivalent if
and only if D and D 0 have the same skeleton and the same unmarried parents.
The use of this result is illustrated in Fig. 2.7.
A DAG D is Markov equivalent to an undirected G if the separation relations
⊥D and ⊥G are identical.
This happens if and only if D is perfect and G = σ (D). So the graphs below are
all equivalent
22 2 Markov Properties
r r r r -r -r r
r -r r r
r
r r
- r
2.4 Summary
We conclude by a summary of the most important definitions and facts given in the
present chapter.
2 4
s s
1 @ 5 @ 7
s @s @s
@s @s
@ @
3 6
Fig. 3.1 An example of a prime graph. This graph has no complete separators.
posed into its uniquely defined prime components (Wagner 1937; Tarjan 1985; Di-
estel 1987, 1990), as illustrated in Fig. 3.3.
25
26 3 Graph Decompositions and Algorithms
2 4 2 2 4
s s s s s
1 @ 5 @ 7 1 @ 5 @ 5 @ 7
s @s @s s @s @s @s
@s @s @s @s
@ @ @ @
3 6 3 6
Fig. 3.2 Decomposition with A = {1, 3}, B = {4, 6, 7} and S = {2, 5}.
2 4 2 4
s s 2 s s
1 @ 5 @ 7 s @ 5 @ 7
s @s @s 1 @ 5 @s @s
s @s 5 7
@
@s
@
@s s s
@s
@
3 6 @
@s
3
6
Fig. 3.3 Recursive decomposition of a graph into its unique prime components.
Combinatorial consequences
If in (3.1) we let Xv = {0, 1} and f be uniform, i.e. f (x) = 2−|V | , this yields
Similarly the right and left hand sides of (3.1) must have the same number of factors
as every decomposition yields an extra factor on both sides of the equation and
hence it holds that
|Q| = ∑ ν(S) + 1.
S∈S
implying in particular
|E| = |V | + 1.
There are several algorithms for identifying chordal graphs. Here is a greedy algo-
rithm for checking chordality based on the fact that chordal graphs are those that
admit perfect numberings:
Algorithm 3.1. Greedy algorithm for checking chordality of a
graph and identifying a perfect numbering:
1. Look for a vertex v∗ with bd(v∗ ) complete.
If no such vertex exists, the graph is not chordal.
2. Form the subgraph GV \v∗ and let v∗ = |V |;
3. Repeat the process under 1;
4. If the algorithm continues until only one vertex is left,
the graph is chordal and the numbering is perfect.
The worst-case complexity of this algorithm is O(|V |2 ) as |V | − k vertices must be
queried to find the vertex to be numbered as |V Z| − k. The algorithm is illustrated in
Fig. 3.4 and Fig. 3.5.
3.2 Chordal Graphs and Junction Trees 29
5
t t t t t t
@ @ @ @ 6 @ @ 6
t @t @t t @t @t t @t @t
@ @ @ @ @ @
@t @t @t @t @t @t
7 7 7
Fig. 3.4 The greedy algorithm at work. This graph is not chordal, as there is no candidate for
number 4.
5 5
t t t t
@ @ 6 3 @ @ 6
t @t @t t @t @t
@ @ @ @
@t @t @t @t
4 7 4 7
5 1 5
t t t t
3 @ 2 @ 6 3 @ 2 @ 6
t @t @t t @t @t
@ @ @ @
@t @t @t @t
4 7 4 7
Fig. 3.5 The greedy algorithm at work. Initially the algorithm proceeds as in Fig. 3.4. This graph
is chordal and the numbering obtained is a perfect numbering.
This simple algorithm is due to Tarjan and Yannakakis (1984) and has complexity
O(|V | + |E|). It checks chordality of the graph and generates a perfect numbering
if the graph is chordal. In addition, as we shall see in a moment, the cliques of the
chordal graph can be identified as the algorithm runs.
Algorithm 3.2 (Maximum Cardinality Search). Checking chordality of
a chordal graph and identifying a perfect numbering:
1. Choose v0 ∈ V arbitrary and let v0 = 1;
2. When vertices {1, 2, . . . , j} have been identified, choose v =
j + 1 among V \ {1, 2, . . . , j} with highest cardinality of its
numbered neighbours;
3. If bd( j +1)∩{1, 2, . . . , j} is not complete, G is not chordal;
4. Repeat from 2;
5. If the algorithm continues until only one vertex is left,
the graph is chordal and the numbering is perfect.
The algorithm is illustrated in Fig. 3.7 and Fig. 3.6.
30 3 Graph Decompositions and Algorithms
* 1
s s s s
@ * @ *
s @s @s s @s @s
@ @
@s @s @s @s
@ @ @ @
2 1 2 1
s s s s
* @ ** @ * * @ 3 @ **
s @s @s s @s @s
@s @s @s @s
@ @ @ @
* *
2 1 2 1
s s s s
* @ 3 @ 4 * @ 3 @ 4
s @s @s s @s @s
@s @s @s @s
@ @ @ @
* ** * 5
2 1 2 1
s s s s
6 @ 3 @ 4 6 @ 3 @ 4
s @s @s s @s @s
@s @s @s @s
@ @ @ @
** 5 7 5
Fig. 3.6 Maximum Cardinality Search at work. When a vertex is numbered, a counter for each of
its unnumbered neighbours is increased with one, marked here with the symbol ∗. The counters
keep track of the numbered neighbours of any vertex and are used to identify the next vertex to
be numbered. This graph is not chordal as discovered at the last step because 7 does not have a
complete boundary.
2 1
t t
6 @ 3 @ 4
t @t @t
@ @
@t @t
7 5
Fig. 3.7 MCS numbering for a chordal graph. The algorithm runs essentially as in the non-chordal
case.
3.2 Chordal Graphs and Junction Trees 31
Finding the cliques of a general graph is an NP-complete problem. But the cliques
of a chordal graph can be found in a simple fashion from a MCS numbering V =
{1, . . . , |V |}. More precisely we let
Sλ = bd(λ ) ∩ {1, . . . , λ − 1}
Example 3.2. For the MCS ordering in Fig. 3.7 we find πλ = (0, 1, 2, 2, 2, 1, 1) yield-
ing the ladder nodes {3, 4, 5, 6, 7} and the corresponding cliques
C = {{1, 2, 3}, {1, 3, 4}, {3, 4, 5}, {2, 6}, {6, 7}}.
Junction tree
for some λ ∗ < λ . A junction tree is now easily constructed by attaching Cλ to any
Cλ ∗ satisfying the above. Although λ ∗ may not be uniquely determined, Sλ is. In-
deed, the sets Sλ are the minimal complete separators and the numbers ν(S) are
ν(S) = |{λ ∈ Λ : Sλ = S}|. Junction trees can be constructed in many other ways as
well (Jensen and Jensen 1994). FiXme Fatal: make figure to illustrate
32 3 Graph Decompositions and Algorithms
An abstract perspective
and distributivity:
(φ ⊗ φC )↓B = φ ↓B ⊗ φC if C ⊆ B. (3.5)
The conditions (3.3), (3.4) and (3.5) are known as the Shenoy–Shafer axioms after
Shenoy and Shafer (1990) who first studied local computation in an abstract per-
spective. The specific algorithms described here only work when the semigroup of
valuations is also separative, i.e. satisfies
φA ⊗ φB = φA ⊗ φA = φB ⊗ φB ⇒ φA = φB ,
which implies that division of valuations can be partially defined (Lauritzen and
Jensen 1997).
Computational challenge
φ = ⊗C∈C φC
p(x) = ∏ φC (x).
C∈C
The potentials φC (x) depend on xC = (xv , v ∈ C) only. The basic task to calculate a
marginal (likelihood)
for E ⊆ V and fixed xE∗ , but the sum has too many terms. A second purpose is to
calculate the predictive probabilities p(xv | xE∗ ) = p(xv , xE∗ )/p(xE∗ ) for v ∈ V .
Example 3.4 (Sparse linear equations). Here valuations φC are equation systems in-
volving variables with labels C. The combination operation φA ⊗ φB concatenates
equation systems. The marginal φB↓A eliminates variables in B \ A, resulting in an
equation system involving only variables in A. The marginal φ ↓A of the joint valu-
ation thus reduces the system of equations to a smaller one. A second computation
finds a solution of the equation system.
Computational structure
Basic computation
∏C∈C φC (xC )
p(x | XE = xE∗ ) = .
p(xE∗ )
2. Marginals p(xE∗ ) and p(xC | xE∗ ) are then calculated by a local message passing
algorithm, to be described in further detail below.
Assigning potentials
Between any two cliques C and D which are neighbours in the junction tree their
intersection S = C ∩ D is one of the minimal separators appearing in the decomposi-
tion sequence. We now explicitly represent these separators in the junction tree and
also assign potentials to them, initially φS ≡ 1 for all S ∈ S , where S is the set of
separators. We also let
36 3 Graph Decompositions and Algorithms
∏C∈C φC (xC )
κ(x) = , (3.6)
∏S∈S φS (xS )
and now it holds that p(x | xE∗ ) = κ(x)/p(xE∗ ). The expression (3.6) will be invariant
under the message passing.
Marginalization
Messages
Note that this computation is local, involving only variables within the pair of
cliques. The expression in (3.6) is invariant under the message passing since
φC φD /φS is:
↓S
φC
φC φD φS φC φD
↓S
= .
φC φS
After the message has been sent, D contains the D-marginal of φC φD /φS . To see
this, we calculate
φC φD ↓D φD ↓D φD ↓S
= φ = φ ,
φS φS C φS C
where we have used distributivity and consonance.
3.4 Local Computation 37
Second message
Before we proceed to discuss the case of a general junction tree, we shall investigate
what happens when D returns message to C:
-
↓S
φC↓S
φ
First message: φC φD φCS
↓S ↓S
φ φ
Second message: φC φDS φ ↓S φD φCS
Now all sets contain the relevant marginal of φ = φC φD /φS , including the separator.
This is seen as follows. The separator contains
!↓S
φ ↓S φC↓S φD↓S
↓S
↓S φC φD ↓D ↓S
φ = = (φ ) = φD C = .
φS φS φS
To describe the message passing algorithm fully we need to arrange for a schedul-
ing of messages to be delivered. As we have seen above, it never harms to send a
message, since the expression (3.6) is invariant under the operation. However, for
computational efficiency it is desirable to send messages in such a way that redun-
dant messages are avoided. The schedule to be described here is used in HUGIN and
has two phases:
C OLL I NFO:
In this first phase, messages are sent from leaves towards arbitrarily chosen root R.
It then holds that after C OLL I NFO, the root potential satisfies φR (xR ) = p(xR , xE∗ ).
38 3 Graph Decompositions and Algorithms
D IST I NFO:
In the second phase messages are sent from the root R towards the leaves of the
junction tree. After C OLL I NFO and subsequent D IST I NFO, it holds that
Hence p(xE∗ ) = ∑xS φS (xS ) for any S ∈ S and p(xv | xE∗ ) can readily be computed
from any φS with v ∈ S.
Another efficient way of scheduling the messages is via local control. We then allow
clique to send a message if and only if it has already received message from all other
of its neighbours. Such messages are live. Using this protocol, there will be one
clique who first receives messages from all its neighbours. This is effectively the
root R in C OLL I NFO and D IST I NFO. Exactly two live messages along every branch
are needed to ensure that (3.7) holds.
Maximization
Another interesting task is to find the configuration with maximum probability, also
known as the MAP. To solve this, we simply replace the standard sum-marginal with
max-marginal:
φB↓A (x) = max φB (y).
yB :yA =xA
This marginalization also satisfies consonance and distributivity, and hence the same
message passing schemes as above will apply. After C OLL I NFO and subsequent
D IST I NFO, the potentials satisfy
φB (xB ) = max p(xB , xE∗ , yV \(B∪E) ) = p(xB , xE∗ , x̂V \(B∪E) ) for all B ∈ C ∪ S ,
yV \(B∪E)
and the most probable configuration can now readily be identified (Cowell et al
1999, page 98). Viterbi’s decoding algorithm for Hidden Markov Models (Viterbi
1967) is effectively a special instance of max-propagation.
It is also possible to find the k most probable configurations by a local computa-
tion algorithm (Nilsson 1998).
Since (3.6) remains invariant, one can switch freely between max- and sum-
propagation without reloading original potentials.
3.5 Summary 39
Random propagation
Another variant of the message passing scheme picks a random configuration with
distribution p(x | xE∗ ). Recall that after C OLL I NFO, the root potential is φR (x) ∝
p(xR | xE ). We then modify D IST I NFO as follows:
1. Pick random configuration x̌R from φR ;
2. Send message to neighbours C as x̌R∩C = x̌S where S = C ∩ R is the separator;
3. Continue by picking x̌C according to φC (xC\S , x̌S ) and send message further
away from root.
When the sampling stops at the leaves of the junction tree, a configuration x̌ has
been generated from p(x | xE∗ ).
There is an abundance of variants of the basic propagation algorithm; see Cowell
et al (1999) for many of these.
3.5 Summary
Graph decompositions
Chordal graphs
A graph is chordal if it has no induced cycles of length greater than three. The
following are equivalent for any undirected graph G .
(i) G is chordal;
(ii) G is decomposable;
(iii) All prime components of G are cliques;
(iv) G admits a perfect numbering;
(v) Every minimal (α, β )-separator are complete.
Trees are chordal graphs and thus decomposable. The prime components are the
branches.
Maximum Cardinality Search (MCS) (Tarjan and Yannakakis 1984) identifies
whether a graph is chordal or not. If a graph G is chordal, MCS yields a perfect
numbering of the vertices. In addition it finds the cliques of G :
40 3 Graph Decompositions and Algorithms
Junction tree
Message passing
Initially the junction tree has potentials φC , c ∈ C ∪ S so that the joint distribution
of interest satisfies
∏ φC (xC )
p(x | xE∗ ) ∝ C∈C .
∏S∈S φS (xS )
The expression on the right-hand side is invariant under message passing. A mes-
sage sent from a clique which has already received message from all other of its
neighbours is live. When exactly two live messages have been sent along every
branch of the junction tree it holds that
f (x) = ∏ ψa (x).
a∈A
log mi jk = αi + β j + γk (4.1)
or
log mi jk = αi j + β jk (4.2)
or
log mi jk = αi j + β jk + γik , (4.3)
or (with redundancy)
To make the connection between this notation and the one used here, we assume
that we have observations X 1 = x1 , . . . , X n = xn and V = {I, J, K}. We then write
41
42 4 Specific Graphical Models
Thus if we let
we have
log mi jk = αi j + β jk .
The main difference is the assumption of positivity needed for the logarithm to be
well defined. This is not necessary when using the multiplicative definition above. It
is typically an advantage to relax the restriction of positivity although it also creates
technical difficulties.
The logarithm of the factors φa = log ψa are known as interaction terms of or-
der |a| − 1 or |a|-factor interactions. Interaction terms of 0th order are called main
effects. In the following we also refer to the factors themselves as interactions and
main effects, rather than their logarithms.
α ⊥⊥P β |V \ {α, β }.
X will then satisfy the pairwise Markov w.r.t. G(P) and G(P) is the smallest graph
with this property, i.e. P is pairwise Markov w.r.t. G iff
G(P) ⊆ G .
The dependence graph G(P) for a family P of probability measures is the smallest
graph G so that all P ∈ P are pairwise Markov w.r.t. G :
For any generating class A we construct the dependence graph G(A ) = G(PA )
of the log–linear model PA . This is determined by the relation
α ∼ β ⇐⇒ ∃a ∈ A : α, β ∈ a.
4.1 Log-linear Models 43
t
J
As a generating class defines a dependence graph G(A ), the reverse is also true.
The set C (G ) of cliques of G is a generating class for the log–linear model of
distributions which factorize w.r.t. G .
If the dependence graph completely summarizes the restrictions imposed by A ,
i.e. if A = C (G(A )), we say that A is conformal. The generating classes for the
models given by (4.1) and (4.2) are conformal, whereas this is not the case for (4.3).
44 4 Specific Graphical Models
Factor graphs
The factor graph of A is the bipartite graph with vertices V ∪ A and edges define
by
α ∼ a ⇐⇒ α ∈ a.
Using this graph even non-conformal log–linear models admit a simple visual rep-
resentation, as illustrated in Figure 4.1. which displays the factor graph of the non-
conformal model in Example 4.3 with no second-order interaction.
φIK
I @ K
t @t
@
t @
φIJ J φJK
Fig. 4.1 The factor graph of the model in Example 4.3 with no second-order interaction.
If F = F(A ) is the factor graph for A and G = G(A ) the corresponding de-
pendence graph, it is not difficult to see that for A, B, S being subsets of V
A ⊥G B | S ⇐⇒ A ⊥F B | S
and hence conditional independence properties can be read directly off the factor
graph also. In that sense, the factor graph is more informative than the dependence
graph.
Contingency Table
Likelihood function
Assume now p ∈ PA but otherwise unknown. The likelihood function can be ex-
pressed as
n
L(p) = ∏ p(xν ) = ∏ p(x)n(x) .
ν=1 x∈X
but this only affects the likelihood function by a constant factor. The likelihood func-
tion is clearly continuous as a function of the (|X |-dimensional vector) unknown
probability distribution p. Since the closure PA is compact (bounded and closed),
L attains its maximum on PA (not necessarily on PA itself).
Indeed, it is also true that L has a unique maximum over PA , essentially because
the likelihood function is log-concave.The proof is indirect: Assume p1 , p2 ∈ PA
with p1 6= p2 and
L(p1 ) = L(p2 ) = sup L(p). (4.5)
p∈PA
Define p
p12 (x) = c p1 (x)p2 (x),
where c−1 = {∑x p1 (x)p2 (x)} is a normalizing constant. Then p12 ∈ PA because
p
46 4 Specific Graphical Models
p
p12 (x) = c p1 (x)p2 (x)
q
= lim c ∏ ψan
n→∞
1 (x)ψ 2 (x) = lim
an n→∞
∏ ψan12 (x),
a∈A a∈A
12 = c1/|A |
p
where e.g. ψan 1 (x)ψ 2 (x). The Cauchy–Schwarz inequality yields
ψan an
r r
c−1 = ∑ p1 (x)p2 (x) < ∑ p1 (x) ∑ p2 (x) = 1.
p
x x x
Hence
n p on(x)
L(p12 ) = ∏ p12 (x)n(x) = ∏ c{ p1 (x)p2 (x)
x x
p n(x) p n(x)
= cn ∏ p1 (x) ∏ p2 (x)
x x
n
p p
=c L(p1 )L(p2 ) > L(p1 )L(p2 ) = L(p1 ) = L(p2 ),
Likelihood equations
To show that the equations (4.6) indeed have a solution, we simply describe a
convergent algorithm which solves it. This cycles (repeatedly) through all the a-
marginals in A and fit them one by one. For a ∈ A define the following scaling
operation on p:
n(xa )
(Ta p)(x) ← p(x) , x∈X
np(xa )
4.1 Log-linear Models 47
where 0/0 = 0 and b/0 is undefined if b 6= 0. Fitting the marginals The operation Ta
fits the a-marginal if p(xa ) > 0 when n(xa ) > 0:
n(ya )
n(Ta p)(xa ) = n ∑ p(y)
y:ya =xa np(ya )
n(xa )
=n p(y)
np(xa ) y:y∑
a =xa
n(xa )
=n p(xa ) = n(xa ).
np(xa )
lim pn = p̂,
n→∞
These data are concerned with student admissions from Berkeley (Bickel et al 1973)
and adapted by Edwards (2000). We consider the model with A ⊥⊥ S, correspond-
ing to A = {{A}, {S}}. We should then fit the A-marginal and the S-marginal. For
illustration we shall do so iteratively. The initial values are uniform:
Admitted?
Sex Yes No S-marginal
Male 1131.5 1131.5 2691
Female 1131.5 1131.5 1835
A-marginal 1755 2771 4526
Initially all entries are equal to 4526/4. Gives initial values of np0 . Next, we fit the
S-marginal:
Admitted?
Sex Yes No S-marginal
Male 1345.5 1345.5 2691 .
Female 917.5 917.5 1835
A-marginal 1755 2771 4526
Admitted
Sex Yes No S-marginal
Male 1043.46 1647.54 2691 .
Female 711.54 1123.46 1835
A-marginal 1755 2771 4526
For example
1755
711.54 = 917.5
917.5 + 1345.5
and so on. The algorithm has now converged, so there is no need to use more steps.
If we wish, we can normalize to obtain probabilities. Dividing everything by 4526
yields p̂.
4.1 Log-linear Models 49
Admitted?
Sex Yes No S-marginal
Male 0.231 0.364 0.595
Female 0.157 0.248 0.405
A-marginal 0.388 0.612 1
n(xa )
p(x) ← p(x) , x∈X. (4.7)
np(xa )
This moves through all possible values of x ∈ X , which in general can be huge,
hence impossible.
Jiroušek and Přeučil (1995) realized that the algorithm could be implemented
using probability propagation as follows: A chordal graph G with cliques C so that
for all a ∈ A , a are complete subsets of G is a chordal cover of A . The steps of the
efficient implementation are now:
1. Find chordal cover G of A ;
2. Arrange cliques C of G in a junction tree;
3. Represent p implicitly as
∏C∈C ψC (x)
p(x) = ;
∏S∈S ψS (x)
4. Replace the step (4.7) with
n(xa )
ψC (xC ) ← ψC (xC ) , xC ∈ XC ,
np(xa )
In some cases the IPS algorithm converges after a finite number of cycles. An ex-
plicit formula is then available for the MLE of p ∈ PA .
50 4 Specific Graphical Models
∏C∈C n(xC )
p̂(x) = (4.8)
n ∏S∈S n(xS )ν(S)
∏C∈C p(xC )
p(x) = .
∏S∈S p(xS )ν(S)
For the specific case where G is a tree, (4.8) reduces to
where we have used that the degree of a vertex exactly is equal to the number of
times this vertex occurs as an endpoint of an edge.
The definition (4.10) makes sense if and only if λ > Σ λ ≥ 0, i.e. if Σ is positive
semidefinite.
If Σ is positive definite, i.e. if λ > Σ λ > 0 for λ 6= 0, the multivariate distribution
has density w.r.t. Lebesgue measure on R d
> K(x−ξ )/2
f (x | ξ , Σ ) = (2π)−d/2 (det K)1/2 e−(x−ξ ) , (4.11)
where K = Σ −1 is the concentration matrix of the distribution. We then also say that
Σ is regular.
X2 ∼ Ns (ξ2 , Σ22 )
and
X1 | X2 = x2 ∼ Nr (ξ1|2 , Σ1|2 ),
where
− −
ξ1|2 = ξ1 + Σ12 Σ22 (x2 − ξ2 ) and Σ1|2 = Σ11 − Σ12 Σ22 Σ21 .
−
Here Σ22 is an arbitrary generalized inverse to Σ22 , i.e. any symmetric matrix which
satisfies
− −
Σ22 Σ22 = Σ22 Σ22 = I.
In the regular case it also holds that
−1 −1
K11 = Σ11 − Σ12 Σ22 Σ21 (4.12)
and
−1 −1
K11 K12 = −Σ12 Σ22 , (4.13)
so then,
−1 −1
ξ1|2 = ξ1 − K11 K12 (x2 − ξ2 ) and Σ1|2 = K11 .
In particular, if Σ12 = 0, X1 and X2 are independent.
52 4 Specific Graphical Models
α 6∼ β ⇒ kαβ = 0
i.e. if the concentration matrix has zero entries for non-adjacent vertices.
where
n
W= ∑ xν (xν )>
ν=1
The Wishart distribution is the sampling distribution of the matrix of sums of squares
and products. More precisely, a random d × d matrix S has a d-dimensional Wishart
distribution with parameter Σ and n degrees of freedom if
n
D
W = ∑ X ν (X ν )>
i=1
W ∼ Wd (n, Σ ).
4.2 Gaussian Graphical Models 53
W1 (n, σ 2 ) = σ 2 χ 2 (n).
Wishart density
fd (w | n, Σ )
−1 w)/2
= c(d, n)−1 (det Σ )−n/2 (det w)(n−d−1)/2 e− tr(Σ
w.r.t. Lebesgue measure on the set of positive definite matrices. The Wishart con-
stant c(d, n) is
d
c(d, n) = 2nd/2 (2π)d(d−1)/4 ∏ Γ {(n + 1 − i)/2}.
i=1
Hence
α ⊥⊥ β |V \ {α, β } ⇐⇒ kαβ = 0.
Thus the dependence graph G (K) of a regular Gaussian distribution is given by
α 6∼ β ⇐⇒ kαβ = 0.
Graphical models
1
log f (x) = constant − ∑ kαα xα2 − ∑ kαβ xα xβ ,
2 α∈V {α,β }∈E
hence no interaction terms involve more than pairs. This is different from the dis-
crete case and generally makes things easier.
Likelihood function
where W is the Wishart matrix of sums of squares and products, W ∼ W|V | (n, Σ )
with Σ −1 = K ∈ S + (G ). For any matrix A we let A(G ) = {a(G )αβ } where
(
aαβ if α = β or α ∼ β
a(G )αβ =
0 otherwise.
Using this fact for A = W we can identify the family as a (regular and canonical)
exponential family with elements of W (G ) as canonical sufficient statistics and the
maximum likelihood estimate is therefore given as the unique solution to the system
of likelihood equations
4.2 Gaussian Graphical Models 55
with the model restriction Σ −1 ∈ S + (G ). This ‘fits variances and covariances along
nodes and edges in G ’ so we can write the equations as
hence making the equations analogous to the discrete case. From (4.15) it follows
that we for K̂ have
This operation is clearly well defined if wcc is positive definite. Exploiting that it
holds in general that
−1
(K −1 )cc = Σcc = Kcc − Kca (Kaa )−1 Kac
,
= wcc /n,
Chordal graphs
If the graph G is chordal, we say that the graphical model is decomposable. We then
have the familiar factorization of densities
∏C∈C f (xC | ΣC )
f (x | Σ ) = (4.18)
∏S∈S f (xS | ΣS )ν(S)
Using the factorization (4.18) we can match the expressions for the trace and deter-
minant to obtain that for a chordal graph G it holds that
and further
∏C∈C det{(K −1 )C }
det Σ = {det(K)}−1 =
∏S∈S [det{(K −1 )S }]ν(S)
∏C∈C det{ΣC }
= .
∏S∈S {det(ΣS )}ν(S)
If we let K = W = I in the first of these equations we obtain the identity
|V | = ∑ |C| − ∑ ν(S)|S|,
C∈C S∈S
For a |d| × |e| matrix A = {aγ µ }γ∈d,µ∈e we let [A]V denote the matrix obtained from
A by filling up with zero entries to obtain full dimension |V | × |V |, i.e.
(
V
aγ µ if γ ∈ d, µ ∈ e
[A] γ µ =
0 otherwise.
For a chordal graph it holds that the maximum likelihood estimates exists if and
only if n ≥ C for all C ∈ C . As in the discrete case, then the IPS-algorithm converges
in a finite number of steps.
The following simple formula then holds for the maximum likelihood estimate
of K: ( )
h iV h iV
−1 −1
K̂ = n ∑ (wC ) − ∑ ν(S) (wS ) (4.19)
C∈C S∈S
1= ∑ χC − ∑ ν(S)χS , (4.21)
C∈C S∈S
58 4 Specific Graphical Models
4.3 Summary
Log–linear models
f (x) = ∏ ψa (x).
a∈A
Dependence graph
The dependence graph G (P) for a family of distributions P is the smallest graph
G so that
α ⊥⊥P β |V \ {α, β } for all P ∈ P.
The dependence graph of a log-linear model PA is determined by
α ∼ β ⇐⇒ ∃a ∈ A : α, β ∈ a.
The set C (G ) of cliques of G is a generating class for the log–linear model of dis-
tributions which factorize w.r.t. G . If the dependence graph completely summarizes
the restrictions imposed by A , i.e. if A = C (G (A )), A is conformal.
Likelihood equations
For any generating class A it holds that the maximum likelihood estimate p̂ of p is
the unique element of PA which satisfies the system of equations
n p̂(xa ) = n(xa ), ∀a ∈ A , xa ∈ Xa .
4.3 Summary 59
n(xa )
(Ta p)(x) ← p(x) , x∈X.
np(xa )
and define S by
Sp = Tak · · · Ta2 Ta1 p.
Let p0 (x) ← 1/|X |, pn = Spn−1 , n = 1, . . . . It then holds that limn→∞ pn = p̂
where p̂ is the unique maximum likelihood estimate of p ∈ PA .
∏C∈C n(xC )
p̂(x) = ,
n ∏S∈S n(xS )ν(S)
where W is the Wishart matrix of sums of squares and products, W ∼ W|V | (n, Σ )
with Σ −1 = K ∈ S + (G ), where S + (G ) are the positive definite matrices with
α 6∼ β ⇒ kαβ = 0.
The MLE of K̂ is the unique element of S + (G ) satisfying
The formula for the maximum likelihood estimate (4.19) derived in the previous
chapter specifies Σ̂ as a random matrix. As we shall see, the sampling distribution
of this random Wishart-type matrix is partly reflecting Markov properties of the
graph G . Before we delve further into this, we shall need some more terminology.
P = {Pθ , θ ∈ Θ }
θ → Pθ
61
62 5 Further Statistical Theory
that
Σ̂1235 ⊥⊥ Σ̂24567 | Σ̂25 .
5.1 Hyper Markov Laws 63
Example 5.1. This little example is a special case where we can directly demonstrate
the hyper Markov property of the law of the maximum likelihood estimate. Consider
the conditional independence model with graph
s s s
I J K
Here the MLE based on data X (n) = (X 1 , . . . , X n ) is
Ni j+ N+ jk
p̂i jk =
nN+ j+
and
Ni j+ N+ jk N+ j+
p̂i j+ = , p̂+ jk = , p̂+ j+ = .
n n n
Clearly, it holds that p̂ is Markov on G and
(n)
{Ni j+ } ⊥⊥ {N+ jk } | {X j }.
we have
(n)
{Ni j+ } ⊥⊥ {X j } | {N+ j+ }
and hence
{Ni j+ } ⊥⊥ {N+ jk } | {N+ j+ },
64 5 Further Statistical Theory
which yields the hyper Markov property of p̂. The law does not satisfy the strong
hyper Markov property as the range of, say, {Ni j+ } is constrained by the value of
{N+ j+ }.
Chordal graphs
For chordal graphs the hyper Markov and ordinary Markov property are less differ-
ent. For example, it is true for chordal graphs that the Markov property is preserved
when (chordal) supergraphs are formed.
Proposition 5.1. If G = (V, E) and G ∗ = (V, E ∗ ) are both chordal graphs and E ⊆
E ∗ , then any hyper Markov law L over G is hyper Markov over G ∗ .
Proof. This result is Theorem 3.10 of Dawid and Lauritzen (1993) but we shall give
a direct argument here. Firstly, as any Markov distribution over G is Markov over
the supergraph G ∗ , we only have to show the second condition for the law to be
hyper Markov.
Lemma 3.2 implies that it is sufficient to consider the case where E and E ∗ differ
by a single edge with endpoints {α, β } then contained in a single clique C∗ of G ∗
according to Lemma 3.1. The clique C∗ is the only complete separator in G ∗ which
is not a complete separator in G . So we have to show that for any hyper Markov law
L on G it holds that
Aα ⊥G (Aᾱ ∪ B ∪ β ) | α ∪C,
Aᾱ ⊥G (Aα ∪ B ∪ α) | β ∪C
Bα ⊥G (Bᾱ ∪ A ∪ β ) | α ∪C
Bᾱ ⊥G (Bα ∪ A ∪ α) | β ∪C
In summary this means that the entire joint distribution θ can be represented as
and also that its constituents satisfy the Markov property w.r.t. the graph in Fig. 5.1.
Using this Markov property in combination with the fact that
θA|C∗ = θAα |α∪C θAᾱ |β ∪C , θB|C∗ = θBα |α∪C θBᾱ | β ∪C , θC∗ = θα|C θβ |C θC ,
Fig. 5.1 The Markov structure of the joint law of the constituents of θ .
A consequence of this result is the following corollary, stating that for chordal
graphs it is not necessary to demand that S is a complete separator to obtain the
relevant conditional independence.
Proposition 5.2. If G is chordal and θ is hyper Markov on G , it holds that
A ⊥G B | S ⇒ A ⊥⊥L B | S.
Proof. Again, this is Theorem 2.8 of Dawid and Lauritzen (1993). It follows by
forming the graph G [S] connecting all pairs of vertices in S and connecting any other
pair α, β if and only if ¬(α ⊥G β | S). Then G [S] is a chordal graph with G [S] ≥ G
so that A ⊥G [S] B | S, and Proposition 5.1 applies.
We have similar notions and results in the directed case. Say that L = L (θ ) is
directed hyper Markov w.r.t. a DAG D if θ is directed Markov on D for all θ ∈ Θ
and
θv∪pa(v) ⊥⊥L θnd(v) | θpa(v) ,
or equivalently θv | pa(v) ⊥⊥L θnd(v) | θpa(v) , or equivalently for a well-ordering
Meta independence
i.e. any joint distribution of XA∪B is identified with a pair of further marginal and
∗
conditional distributions. Define for S ⊆ V the S-section Θ θS of Θ as
∗
Θ θS = {θ ∈ Θ : θS = θS∗ , θ ∈ Θ }.
In words, A and B are meta independent w.r.t. P given S, if the pair of conditional
distributions (θA | S , θB | S ) vary in a product space when θS is fixed. Equivalently,
fixing the values of θB | S and θS places the same restriction on θA | S as just fixing θS .
The relation ‡P satisfies the semigraphoid axioms as it is a special instance of
variation independence.
5.2 Meta Markov Models 67
A ⊥G ∗ B | S ⇒ A ‡P B | S,
A ⊥G B | S ⇒ A ‡P B | S.
Note that for any triple (A, B, S) and any law L on Θ it holds that
A ⊥⊥L B | S ⇒ A ‡P B | S
has dependence graph with cliques C = {abcd, bcde}, displayed in Fig. 5.2. Since
the complete separator bcd is not in A , this model is not meta Markov.
b d
a e
bd
b d
ab de
ad be
a bc e
cd
ac ce
c
Fig. 5.2 Dependence and factor graph of the generating class A in Example 5.2.
has the same dependence graph G (A 0 ) = G (A ) but even though A 0 is not confor-
mal, PA 0 is meta Markov on G (A 0 ).
has a different dependence graph G (A 00 ), see Fig. 5.4. The separator bcd is not in
A 00 , but PA 00 is meta Markov on G (A 00 ), as both minimal separators bc and cd are
in A 00 .
5.2 Meta Markov Models 69
b d
a e
b d
ab de
bcd
ad be
a e
ac ce
c
Fig. 5.3 Dependence and factor graph of the generating class A 0 in Example 5.3.
bd
b d
ab de
bc cd
a e
ac ce
c
Fig. 5.4 Factor graph of the generating class A 00 in Example 5.4. The dependence graph looks
identical to the factor graph when edge labels are removed.
If θ is globally Markov w.r.t. the graph G , it is also Markov w.r.t. any super graph
G 0 = (V, E 0 ) with E ⊆ E 0 .
The similar fact is not true for meta Markov models. For example, the Gaussian
graphical model for the 4-cycle G with adjacencies 1 ∼ 2 ∼ 3 ∼ 4 ∼ 1, is meta
Markov on G , because it has no complete separators.
But the same model is not meta Markov w.r.t. the larger graph G 0 with cliques
{124, 234}, since for any K ∈ S + (G ),
70 5 Further Statistical Theory
σ12 σ14 σ13 σ34
σ24 = + .
σ11 σ33
So fixing the value of σ24 restricts the remaining parameters in a complex way.
Under certain conditions, the MLE θ̂ of the unknown distribution θ will follow a
hyper Markov law over Θ under Pθ . These are
(i) Θ is meta Markov w.r.t. G ;
(n)
(ii) For any prime component Q of G , the MLE θ̂Q for θQ based on XQ is suffi-
cient for ΘQ and boundedly complete.
A sufficient condition for (ii) is that ΘQ is a full and regular exponential family in
the sense of Barndorff-Nielsen (1978). In particular, these conditions are satisfied
for any Gaussian graphical model and any meta Markov log-linear model.
In some cases it is of interest to consider a stronger version of the hyper and meta
Markov properties.
A meta Markov model is strongly meta Markov if θA | S ‡P θS for all complete
separators S.
Similarly, a hyper Markov model is strongly hyper Markov if θA | S ⊥⊥L θS for all
complete separators S.
A directed hyper Markov model is strongly directed hyper Markov if θv | pa(v) ⊥⊥L θpa(v)
for all v ∈ V .
Gaussian graphical models and log-linear meta Markov models are strong meta
Markov models.
5.2 Meta Markov Models 71
so the likelihood function is equal to the density of the posterior w.r.t. the prior
modulo a constant.
Example 5.5 (Bernoulli experiments). Data X1 = x1 , . . . , Xn = xn independent and
Bernoulli distributed with parameter θ , i.e.
P(Xi = 1 | θ ) = 1 − P(Xi = 0) = θ .
π ∈ P ⇒ π ∗ ∈ P.
The family of beta laws is closed under Bernoulli sampling. If the family of priors
is parametrised:
P = {Pα , α ∈ A }
we sometimes say that α is a hyperparameter. Then, Bayesian inference can be
made by just updating hyperparameters. The terminology of hyperparameter breaks
down in more complex models, corresponding to large directed graphs, where all
72 5 Further Statistical Theory
parent variables can be seen as ‘parameters’ for their children. Thus the division
into three levels, with data, parameters, and hyperparameters is not helpful.
For a k-dimensional exponential family
> t(x)−ψ(θ )
p(x | θ ) = b(x)eθ
The hyper Markov property is in wide generality closed under sampling. For L
being a prior law over Θ and X = x is an observation from θ , let L ∗ = L (θ | X = x)
denote the posterior law over Θ . It then holds that If L is hyper Markov w.r.t. G so
is L ∗ .
And further, if L is strongly hyper Markov w.r.t. G so is L ∗ , so also the strong
hyper Markov property is preserved.
In the latter case, the update of L is even local to prime components, i.e.
FiXme Fatal: write more about the strong hyper Markov property, either
here or earlier
Gaussian graphical models are canonical exponential families. The standard family
of conjugate priors have densities
These laws are termed hyper inverse Wishart laws as Σ follows an inverse Wishart
law for complete graphs.
For chordal graphs, each marginal law LC of ΣC is inverse Wishart.
5.3 Summary 73
For any meta Markov model where Θ and ΘQ are full and regular exponential
families for all prime components Q, it follows directly from Barndorff-Nielsen
(1978), page 149, that the standard conjugate prior law is strongly hyper Markov
w.r.t. G .
This is in particular true for the hyper inverse Wishart laws.
The analogous prior distribution for log-linear meta Markov models are likewise
termed hyper Dirichlet laws.
They are also strongly hyper Markov and if G is chordal, each induced marginal
law LC is a standard Dirichlet law.
If Θ is meta Markov and ΘQ are full and regular exponential families for all prime
components Q, the standard conjugate prior law is strongly hyper Markov w.r.t. G .
This is in particular true for the hyper inverse Wishart laws and the hyper Dirich-
let laws.
Thus, for the hyper inverse and hyper Dirichlet laws we have simple local updat-
ing based on conjugate priors for Bayesian inference.
5.3 Summary
P = {Pθ , θ ∈ Θ }.
For A, B ⊆ V identify
Similarly, a hyper Markov law is strongly hyper Markov if θA | S ⊥⊥L θS for all com-
plete separators S.
A directed hyper Markov lawis strongly directed hyper Markov if θv | pa(v) ⊥⊥L θpa(v)
for all v ∈ V .
A meta Markov model is strongly meta Markov if θA | S ‡P θS for all complete
separators S.
Gaussian graphical models and log-linear meta Markov models are strong meta
Markov models.
Gaussian graphical models are canonical exponential families. The standard family
of conjugate priors have densities
These laws are termed hyper inverse Wishart laws as Σ follows an inverse Wishart
law for complete graphs. For chordal graphs, each marginal law LC , C of ΣC is
inverse Wishart.
The standard conjugate prior law for log-linear meta Markov models are termed
hyper Dirichlet laws. If G is chordal, each induced marginal law LC ,C ∈ C is a
standard Dirichlet law.
Chapter 6
Estimation of Structure
Previous chapters have considered the situation where the graph G defining the
model has been known and the inference problems were concerned with an un-
known Pθ with θ ∈ Θ . This chapter discusses inference concerning the graph G ,
specifying only a family Γ of possible graphs.
It is important to ensure that any methods used must scale well with data size
we typically need to consider many structures and also huge collections of high-
dimensional data.
What we here choose to term structure estimation is also known under other
names as model selection (mainstream statistics), system identification (engineer-
ing), or structural learning (AI or machine learning.) Different situations occur de-
pending on the type of assumptions concerning Γ Common assumptions include
that Γ is the set of undirected graphs over V ; the set of chordal graphs over V ; the
set of forests over V ; the set of trees over V ; the set of directed acyclic graphs over
V ; or potentially other types of conditional independence structure.
77
78 6 Estimation of Structure
Example 6.1 (Markov mesh model). Figure 6.1 shows the graph of a so-called
Markov mesh model with 36 variables. All variables are binary and the only variable
without parents, in the upper left-hand corner is uniformly distributed. The remain-
ing variables on the upper and left sides of the 6 × 6 square have a single parent and
the conditional probability that it is in a given state is 3/4 if the state is the same as
its parent. The remaining nodes have two parents and if these are identical, the child
with have that state with probability 3/4 whereas it will otherwise follow the upper
parent with probability 2/3.
Figure 6.2 shows two different attempts of estimating the structure based on the
same 10,000 simulated cases. The two methods are to be described in more de-
tail later, but it is apparent that the estimated structure in both cases have a strong
similarity to the true one. In fact, one of the methods reconstructs the Markov mesh
model perfectly. Both methods used search for a DAG structure which is compatible
with the data.
Fig. 6.2 Structure estimate of Markov mesh model from 10000 simulated cases. The left-hand
side shows the estimate using the crudest algorithm (PC) implemented in HUGIN. The right-hand
side the Bayesian estimate using greedy equivalence search (GES) as implemented in W IN M INE .
Example 6.2 (Tree model). The graph of this example has a particular simple struc-
ture which is that of a rooted tree. Since a rooted tree with arrows pointing away
6.1 Estimation of Structure and Bayes Factors 79
from a root is a perfect DAG, the associated structure is equivalent to the corre-
sponding undirected tree. The state at the root is uniformly distributed and any other
node reproduces the state of the parent node with probability 3/4.
Figure 6.3 shows the structure estimate of the tree based on 10,000 simulated
cases and using the same methods as for the Markov mesh model. In both cases,
the method has attempted to estimate the structure based on the assumption that the
structure was a DAG. Note that in this case it is the first method which reconstruncts
correctly whereas there are too many links in the second case.
Fig. 6.3 Estimates of a tree model with 30 variables based on 10000 observations. The graph to
the left represents the estimate using the PC algorithm and yields a 100% correct reconstruction.
The graph to the right represents the Bayesian estimate using GES.
Example 6.3 (Chest clinic). The next example is taken from Lauritzen and Spiegel-
halter (1988) and reflects the structure involving risk factors and symptoms for lung-
disease. The (fictitious) description given by the authors of the associated medical
knowledge is as follows
“Shortness–of–breath (dyspnoea) may be due to tuberculosis, lung cancer or bronchitis, or
none of them, or more than one of them. A recent visit to Asia increases the chances of
tuberculosis, while smoking is known to be a risk factor for both lung cancer and bron-
chitis. The results of a single chest X–ray do not discriminate between lung cancer and
tuberculosis, as neither does the presence or absence of dyspnoea.”
The actual probabilities involved in this example are given in the original reference
and we abstain from repeating them here.
Figure 6.4 displays the network structure reflecting the knowledge as given above
and three different structure estimates. Note that this problem is obviously more
difficult than the previous examples, in particular because some of the diseases are
rare and larger data sets as well as more refined structure estimators are needed to
even get close to the original structure.
80 6 Estimation of Structure
Fig. 6.4 A Bayesian network model for lung disease and estimates of the model based on simulated
cases. The structure generating the data is in the upper left corner. Then, clockwise, estimates
using the same data but different estimation algorithms: the PC algorithm, Bayesian GES, the NPC
algorithm. In the latter case 100,000 cases were used.
Types of approach
Iκ (G ) = log L̂ − κ dim(G )
with κ = 1 for AIC Akaike (1974), or κ = 21 log N for BIC Schwarz (1978);
• Bayesian posterior probabilities.
The search strategies are more or less based on heuristics, which all attempt to over-
come the fundamental problem that a crude global search among all potential struc-
tures is not feasible as the number of structures is astronomical.
FiXme Fatal: elaborate on each of these or rearrange
6.2 Estimating Trees and Forests 81
Bayes factors
f (x(n) | G1 )
BF(G1 : G2 ) = ,
f (x(n) | G2 )
where Z
f (x(n) | G ) = f (x(n) | G , θ ) LG (dθ )
ΘG
If π(G ) is a prior probability distribution over a given set of graphs Γ , the posterior
distribution is determined as
or equivalently
π ∗ (G1 ) π(G1 )
∗
= BF(G1 : G2 ) .
π (G2 ) π(G2 )
Bayesian analysis looks for the MAP estimate G ∗ maximizing π ∗ (G ) over Γ , or
attempts to sample from the posterior using e.g. Monte-Carlo methods.
Estimating trees
This result is easily extended to Gaussian graphical models, just with the weight
λn (e) of an edge in a tree determined as any strictly increasing function of the em-
pirical cross-entropy along the edge
1
Hn (e) = − log(1 − re2 ),
2
where re2 is empirical correlation coeffient along edge e = {u, v}
To see this, use the expression (4.20) for the determinant of the MLE which in
the case of a tree reduces to
From (4.16) we know that the maximized likelihood function for a fixed tree is pro-
portional to a power of this determinant and hence is maximized when the logarithm
of the determinant is maximized. But since we then have
with κn = 2 for AIC, κn = log n for BIC and dfe the degrees of freedom for indepen-
dence along e.
Fast algorithms Kruskal Jr. (1956) compute maximal weight spanning tree (or
forest) from weights W = (wuv , u, v ∈ V ).
Chow and Wagner (1978) show a.s. consistency in total variation of P̂: If P
factorises w.r.t. τ, then
For strong hyper Markov prior laws, X (n) is itself marginally Markov so
(n)
∏Q∈Q f (xQ | G )
f (x(n) | G ) = (n)
, (6.1)
∏S∈S f (xS | G )νG (S)
where Q are the prime components and S the minimal complete separators of G .
∏Q∈Q h(δ , ΦQ ; GQ )
h(δ , Φ; G ) = .
∏S∈S h(δ , ΦS ; S)νG (S)
For chordal graphs all terms in this expression reduce to known Wishart constants,
and we can thus calculate the normalization constant explicitly.
In general, Monte-Carlo simulation or similar methods must be used Atay-Kayis
and Massam (2005).
The marginal distribution of W (n) is (weak) hyper Markov w.r.t. G . It was termed
the hyper matrix F law by Dawid and Lauritzen (1993).
since all minimal complete separators are singletons and νφ ({v}) = dφ (v) − 1.
(n)
Multiplying the right-hand side with ∏v∈V f (xv ) yields
84 6 Estimation of Structure
(n)
∏e∈E(φ ) f (xe ) (n)
(n)
= ∏ f (xv ) ∏ BF(e),
∏v∈V f (xv )dφ (v)−1 v∈V e∈φ
where BF(e) is the Bayes factor for independence along the edge e:
(n) (n)
f (xu , xv )
BF(e) = (n) (n)
.
f (xu ) f (xv )
π ∗ (φ ) ∝ ∏ BF(e).
e∈E(φ )
In the case where φ is restricted to contain a single tree, the normalization constant
for this distribution can be explicitly obtained via the Matrix Tree Theorem, see e.g.
Bollobás (1998).
Bayesian analysis
MAP estimates of forests can thus be computed using an MWSF algorithm, using
w(e) = log BF(e) as weights.
Algorithms exist for generating random spanning trees Aldous (1990), so full
posterior analysis is in principle possible for trees.
These work less well for weights occurring with typical Bayes factors, as most
of these are essentially zero, so methods based on the Matrix Tree Theorem seem
currently more useful.
Only heuristics available for MAP estimators or maximizing penalized likeli-
hoods such as AIC or BIC, for other than trees.
• Find feasible algorithm for (perfect) simulation from a distribution over chordal
graphs as
∏C∈C w(C)
p(G ) ∝ ,
∏S∈S w(S)νG (S)
where w(A), A ⊆ V are a prescribed set of positive weights.
• Find feasible algorithm for obtaining MAP in decomposable case. This may not
be universally possible as problem most likely is NP-complete.
6.3 Learning Bayesian networks 85
Exploiting independence and taking expectations over θ yields that also marginally,
Z
f (x | D) = f (x | θ ) LD (θ ) = ∏ f (xv | xpa(v) ).
ΘD v∈V
If L is strongly directed hyper Markov and L ∗ it holds that also the posterior
law L ∗ is is strongly directed hyper Markov and
Markov equivalence
In the discrete case, the obvious conjugate prior is for fixed v to let
∗
{θv | paD (v) (xv | xpaD (v)
), xv ∈ Xv }
be Dirichlet distributed and independent for v ∈ V and xpa∗ ∈ XpaD (v) Spiegel-
D (v)
halter and Lauritzen (1990).
We can derive these Dirichlet distributions from a fixed master Dirichlet distri-
bution D(α), where α = α(x), x ∈ X , by letting
∗ ∗
{θv | pa(v) (xv | xpaD (v)
)} ∼ D(α(xv , xpa D (v)
),
D ≡ D 0 ⇒ f (x(n) | D) = f (x(n) | D 0 ).
6.3 Learning Bayesian networks 87
Marginal likelihood Bayes factors derived from these strongly directed hyper
Dirichlet priors have a simple form
Γ (α(xpaD (v) ))
f (x(n) | D) = ∏ ∏
v xpa(v) Γ (α(xpaD (v) ) + n(xpaD (v) ))
u 6∼ v ⇐⇒ ∃S ⊆ V \ {u, v} : Xu ⊥⊥ Xv | XS .
Begin with complete graph, check for S = 0/ and remove edges when indepen-
dence holds. Then continue for increasing |S|.
PC-algorithm (same reference) exploits that only S with S ⊆ bd(u) \ v or S ⊆
bd(v) \ u needs checking where bd refers to current skeleton.
Step 2: Identify directions to be consistent with independence relations found in
Step 1.
|V |d+1
d
|V | |V | − 1
N≤2 ∑ ≤ ,
2 i=0 i (d − 1)!
6.4 Summary
Types of approach
• Methods for judging adequacy of structure such as
– Tests of significance
– Penalised likelihood scores
Iκ (G ) = log L̂ − κ dim(G )
with κ = 1 for AIC Akaike (1974), or κ = 21 log n for BIC Schwarz (1978).
– Bayesian posterior probabilities.
6.4 Summary 89
f (x(n) | G1 )
BF(G1 : G2 ) = ,
f (x(n) | G2 )
where Z
f (x(n) | G ) = f (x(n) | G , θ ) LG (dθ )
ΘG
or equivalently
π ∗ (G1 ) π(G1 )
= BF(G1 : G2 ) .
π ∗ (G2 ) π(G2 )
The BIC is an O(1)-approximation to log BF using Laplace’s method of integrals
on the marginal likelihood.
Bayesian analysis looks for the MAP estimate G ∗ maximizing π ∗ (G ) over Γ , or
attempts to sample from the posterior using e.g. Monte-Carlo methods. Estimating
trees Assume P factorizes w.r.t. an unknown tree T . MLE τ̂ of T has maximal
weight, where the weight of τ is
1
wn (e) = − log(1 − re2 ),
2
where re2 is correlation coeffient along edge e = {u, v}.
Highest AIC or BIC scoring forest also available as MWSF, with modified
weights
wpen
n (e) = nwn (e) − κn dfe ,
with κn = 1 for AIC, κn = 12 log n for BIC and dfe the degrees of freedom for inde-
pendence along e.
Use maximal weight spanning tree (or forest) algorithm from weights W =
(wuv , u, v ∈ V ).
90 6 Estimation of Structure
Hyper inverse Wishart laws Denote the normalisation constant of the hyper in-
verse Wishart density as
Z
h(δ , Φ; G ) = (det K)δ /2 e− tr(KΦ) dK,
S + (G )
h(δ + n, Φ +W n ; G )
f (x(n) | G ) = .
h(δ , Φ; G )
where
∏Q∈Q h(δ , ΦQ ; GQ )
h(δ , Φ; G ) = .
∏S∈S h(δ , ΦS ; S)νG (S)
For chordal graphs all terms reduce to known Wishart constants.
In general, Monte-Carlo simulation or similar methods must be used Atay-Kayis
and Massam (2005).
Bayes factors for forests Trees and forests are decomposable graphs, so for a
forest φ we get
(n)
∗ ∏e∈E(φ ) f (xe )
π (φ ) ∝ (n)
∏v∈V f (xv )dφ (v)−1
∝ ∏ BF(e),
e∈E(φ )
where BF(e) is the Bayes factor for independence along the edge e:
(n) (n)
f (xu , xv )
BF(e) = (n) (n)
.
f (xu ) f (xv )
MAP estimates of forests can thus be computed using an MWSF algorithm, using
w(e) = log BF(e) as weights.
When φ is restricted to contain a single tree, the normalization constant can be
explicitly obtained via the Matrix Tree Theorem, see e.g. Bollobás (1998).
Algorithms exist for generating random spanning trees Aldous (1990), so full
posterior analysis is in principle possible for trees.
Only heuristics available for MAP estimators or maximizing penalized likeli-
hoods such as AIC or BIC, for other than trees.
LIST OF CORRECTIONS 91
List of Corrections
Akaike H (1974) A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control 19:716–723
Aldous D (1990) A random walk construction of uniform spanning trees and uniform labelled
trees. SIAM Journal on Discrete Mathematics 3(4):450–465
Andersen SK, Olesen KG, Jensen FV, Jensen F (1989) Hugin - a shell for building Bayesian belief
universes for expert systems. In: Sridharan NS (ed) Proceedings of the 11th International Joint
Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo, CA, pp 1080–
1085
Asmussen S, Edwards D (1983) Collapsibility and response variables in contingency tables.
Biometrika 70:567–578
Atay-Kayis A, Massam H (2005) A Monte Carlo method for computing the marginal likelihood in
non-decomposable graphical Gaussian models. Biometrika 92:317–335
Bahl L, Cocke J, Jelinek F, Raviv J (1974) Optimal decoding of linear codes for minimizing symbol
error rate. IEEE Transactions on Information Theory 20:284–287
Barndorff-Nielsen OE (1978) Information and Exponential Families in Statistical Theory. John
Wiley and Sons, New York
Baum LE (1972) An equality and associated maximization technique in statistical estimation for
probabilistic functions of Markov processes. Inequalities 3:1–8
Berge C (1973) Graphs and Hypergraphs. North-Holland, Amsterdam, The Netherlands, translated
from French by E. Minieka
Berry A, Bordat JP, Cogis O (2000) Generating all the minimal separators of a graph. International
Journal of Foundations of Computer Science 11:397–403
Bickel PJ, Hammel EA, O’Connell JW (1973) Sex bias in graduate admissions: Data from Berke-
ley. Science 187(4175):398–404
Bollobás B (1998) Modern Graph Theory. Springer-Verlag, New York
Bøttcher SG (2001) Learning Bayesian networks with mixed variables. In: Proceedings of the
Eighth International Workshop in Artificial Intelligence and Statistics, pp 149–156
Bøttcher SG, Dethlefsen C (2003) deal: A package for learning Bayesian networks. Journal of
Statistical Software 8:1–40
Bouchitté V, Todinca I (2001) Treewidth and minimum fill-in: Grouping the minimal separators.
SIAM Journal on Computing 31:212–232
Buhl SL (1993) On the existence of maximum likelihood estimators for graphical Gaussian models.
Scandinavian Journal of Statistics 20:263–270
Cannings C, Thompson EA, Skolnick MH (1976) Recursive derivation of likelihoods on pedigrees
of arbitrary complexity. Advances in Applied Probability 8:622–625
Chickering DM (2002) Optimal structure identification with greedy search. Journal of Machine
Learning Research 3:507–554
93
94 References
Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees.
IEEE Transactions on Information Theory 14:462–467
Chow CK, Wagner TJ (1978) Consistency of an estimate of tree-dependent probability distribu-
tions. IEEE Transactions on Information Theory 19:369–371
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks
from data. Machine Learning 9:309–347
Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic Networks and Expert
Systems. Springer-Verlag, New York
Dawid AP (1979) Conditional independence in statistical theory (with discussion). Journal of the
Royal Statistical Society, Series B 41:1–31
Dawid AP (1980) Conditional independence for statistical operations. The Annals of Statistics
8:598–617
Dawid AP, Lauritzen SL (1993) Hyper Markov laws in the statistical analysis of decomposable
graphical models. The Annals of Statistics 21:1272–1317
Diaconis P, Ylvisaker D (1979) Conjugate priors for exponential families. The Annals of Statistics
7:269–281
Diestel R (1987) Simplicial decompositions of graphs – some uniqueness results. Journal of Com-
binatorial Theory, Series B 42:133–145
Diestel R (1990) Graph Decompositions. Clarendon Press, Oxford, United Kingdom
Dirac GA (1961) On rigid circuit graphs. Abhandlungen Mathematisches Seminar Hamburg
25:71–76
Edwards D (2000) Introduction to Graphical Modelling, 2nd edn. Springer-Verlag, New York
Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Human
Heredity 21:523–542
Frydenberg M (1990a) The chain graph Markov property. Scandinavian Journal of Statistics
17:333–353
Frydenberg M (1990b) Marginalization and collapsibility in graphical interaction models. The An-
nals of Statistics 18:790–805
Geiger D, Heckerman D (1994) Learning Gaussian networks. In: de Mantaras RL, Poole D (eds)
Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, Morgan Kauf-
mann Publishers, San Francisco, CA, pp 235–243
Geiger D, Heckerman D (1997) A characterization of the Dirichlet distribution through global and
local independence. The Annals of Statistics 25:1344–1369
Geiger D, Heckerman D (2002) Parameter priors for directed acyclic graphical models and the
characterization of several probability distributions. The Annals of Statistics 30:1412–1440
Geiger D, Verma TS, Pearl J (1990) Identifying independence in Bayesian networks. Networks
20:507–534
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: The combination
of knowledge and statistical data. Machine Learning 20:197–243
Jensen F (2002) HUGIN API Reference Manual Version 5.4. HUGIN Expert Ltd., Aalborg, Den-
mark
Jensen F, Jensen FV, Dittmer SL (1994) From influence diagrams to junction trees. In: de Mantaras
RL, Poole D (eds) Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence,
Morgan Kaufmann Publishers, San Francisco, CA, pp 367–373
Jensen FV, Jensen F (1994) Optimal junction trees. In: de Mantaras RL, Poole D (eds) Proceedings
of the 10th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers,
San Francisco, CA, pp 360–366
Jensen FV, Lauritzen SL, Olesen KG (1990) Bayesian updating in causal probabilistic networks
by local computation. Computational Statistics Quarterly 4:269–282
Jiroušek R, Přeučil R (1995) On the effective implementation of the iterative proportional fitting
procedure. Computational Statistics and Data Analysis 19:177–189
Kalman RE, Bucy R (1961) New results in linear filtering and prediction. Journal of Basic Engi-
neering 83 D:95–108
References 95
Kong A (1986) Multivariate belief functions and graphical models. Ph.D. Thesis, Department of
Statistics, Harvard University, Massachusetts
Kruskal Jr JB (1956) On the shortest spanning subtree of a graph and the travelling salesman
problem. Proceedings of the American Mathematical Society 7:48–50
Lauritzen SL (1996) Graphical Models. Clarendon Press, Oxford, United Kingdom
Lauritzen SL, Jensen FV (1997) Local computation with valuations from a commutative semi-
group. Annals of Mathematics and Artificial Intelligence 21:51–69
Lauritzen SL, Nilsson D (2001) Representing and solving decision problems with limited informa-
tion. Management Science 47:1238–1251
Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical struc-
tures and their application to expert systems (with discussion). Journal of the Royal Statistical
Society, Series B 50:157–224
Lauritzen SL, Speed TP, Vijayan K (1984) Decomposable graphs and hypergraphs. Journal of the
Australian Mathematical Society, Series A 36:12–29
Lauritzen SL, Dawid AP, Larsen BN, Leimer HG (1990) Independence properties of directed
Markov fields. Networks 20:491–505
Leimer HG (1993) Optimal decomposition by clique separators. Discrete Mathematics 113:99–123
Matúš F (1992) On equivalence of Markov properties over undirected graphs. Journal of Applied
Probability 29:745–749
Meek C (1995) Strong completeness and faithfulness in Bayesian networks. In: Besnard P, Hanks
S (eds) Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Morgan
Kaufmann Publishers, San Francisco, CA, pp 411–418
Moussouris J (1974) Gibbs and Markov random systems with constraints. Journal of Statistical
Physics 10:11–33
Nilsson D (1998) An efficient algorithm for finding the M most probable configurations in a prob-
abilistic expert system. Statistics and Computing 8:159–173
Parter S (1961) The use of linear graphs in Gauss elimination. SIAM Review 3:119–130
Pearl J (1986) Fusion, propagation and structuring in belief networks. Artificial Intelligence
29:241–288
Pearl J (1988) Probabilistic Inference in Intelligent Systems. Morgan Kaufmann Publishers, San
Mateo, CA
Pearl J, Paz A (1987) Graphoids: A graph based logic for reasoning about relevancy relations.
In: Boulay BD, Hogg D, Steel L (eds) Advances in Artificial Intelligence – II, North-Holland,
Amsterdam, The Netherlands, pp 357–363
Richardson TS (2003) Markov properties for acyclic directed mixed graphs. Scandinavian Journal
of Statistics 30:145–158
Robinson RW (1977) Counting unlabelled acyclic digraphs. In: Little CHC (ed) Lecture Notes in
Mathematics: Combinatorial Mathematics V, vol 622, Springer-Verlag, New York
Rose DJ, Tarjan RE, Lueker GS (1976) Algorithmic aspects of vertex elimination on graphs. SIAM
Journal on Computing 5:266–283
Schwarz G (1978) Estimating the dimension of a model. The Annals of Statistics 6:461–464
Shenoy PP, Shafer G (1986) Propagating belief functions using local propagation. IEEE Expert
1:43–52
Shenoy PP, Shafer G (1990) Axioms for probability and belief–function propagation. In: Shachter
RD, Levitt TS, Kanal LN, Lemmer JF (eds) Uncertainty in Artificial Intelligence 4, North-
Holland, Amsterdam, The Netherlands, pp 169–198
Shoiket K, Geiger D (1997) A practical algorithm for finding optimal triangulations. In: Proceed-
ings of the Fourteenth National Conference on Artificial Intelligence, AAAI Press, Menlo Park,
California, pp 185–190
Spiegelhalter DJ, Lauritzen SL (1990) Sequential updating of conditional probabilities on directed
graphical structures. Networks 20:579–605
Spirtes P, Glymour C, Scheines R (1993) Causation, Prediction and Search. Springer-Verlag, New
York, reprinted by MIT Press
96 References