0% found this document useful (0 votes)
17 views

ST Flour Notes

The document discusses graphical models and conditional independence. It introduces concepts like conditional independence, Markov properties for undirected and directed graphs, graph decompositions, and specific graphical models like log-linear and Gaussian models.

Uploaded by

Steffen Ventz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

ST Flour Notes

The document discusses graphical models and conditional independence. It introduces concepts like conditional independence, Markov properties for undirected and directed graphs, graph decompositions, and specific graphical models like log-linear and Gaussian models.

Uploaded by

Steffen Ventz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Steffen L.

Lauritzen

Elements of Graphical Models


Lectures from the XXXVIth International
Probability Summer School in Saint-Flour,
France, 2006

September 4, 2011

Springer
Your dedication goes here
Preface

Here come the golden words

place(s), First name Surname


month year First name Surname

vii
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Markov Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 General conditional independence . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Markov Properties for Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Markov Properties for Directed Acyclic Graphs . . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Graph Decompositions and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


3.1 Graph Decompositions and Markov Properties . . . . . . . . . . . . . . . . . . 25
3.2 Chordal Graphs and Junction Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Probability Propagation and Junction Tree Algorithms . . . . . . . . . . . . 32
3.4 Local Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Specific Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


4.1 Log-linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Interactions and factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Dependence graphs and factor graphs . . . . . . . . . . . . . . . . . . . 42
4.1.3 Data and likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 The multivariate Gaussian distribution . . . . . . . . . . . . . . . . . . . 50
4.2.2 The Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.3 Gaussian graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Further Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


5.1 Hyper Markov Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Meta Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

ix
x Contents

5.2.1 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


5.2.2 Hyper inverse Wishart and Dirichlet laws . . . . . . . . . . . . . . . . 72
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Estimation of Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1 Estimation of Structure and Bayes Factors . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Estimating Trees and Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Learning Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.1 Model search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3.2 Constraint-based search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 1
Introduction

Conditional independence

The notion of conditional independence is fundamental for graphical models.


For three random variables X, Y and Z we denote this as X ⊥⊥ Y | Z and graphi-
cally as
u u u
X Z Y
If the random variables have density w.r.t. a product measure µ, the conditional
independence is reflected in the relation

f (x, y, z) f (z) = f (x, z) f (y, z),

where f is a generic symbol for the densities involved.

Graphical models
2 4
u u
1 @ 5 @ 7
u @u
@ @u
@
@ @
@u
@ @u
@
3 6
For several variables, complex systems of conditional independence can be de-
scribed by undirected graphs.
Then a set of variables A is conditionally independent of set B, given the values
of a set of variables C if C separates A from B.

1
2 1 Introduction

Has tuberculosis
Positive X-ray?

Visit to Asia?
Tuberculosis or cancer

Has lung cancer


Dyspnoea?

Smoker?
Has bronchitis

Fig. 1.1 An example of a directed graphical model describing the relationship between risk fac-
tors, lung diseases and symptoms. This model was used by Lauritzen and Spiegelhalter (1988) to
illustrate important concepts in probabilistic expert systems.

A pedigree

Graphical model for a pedigree from study of Werner’s syndrome. Each node is
itself a graphical model.
1 Introduction 3

A highly complex pedigree


p87 p99 p98 p88 p89 p90 p100p68 p59 p91 p177p60 p92 p93 p104p182p36 p37 p105p17 p153p154p181p78 p18 p77 p49 p56 p42 p40 p39 p79 p80 p151p48 p152p85 p174p183p179p44 p199p28 p27 p51 p84 p70 p15 p19 p16 p186p41 p8 p20 p6 p82 p5 p52 p61 p62 p75 p2 p178p185p47 p83 p33 p66 p101p29 p65 p102p21 p30 p97 p14 p63 p64 p13 p175

p43p280
p314
p86p281
p72p103 p50p317
p288 p295
p292
p318
p294
p31p291
p293
p282
p25p297
p296 p10p287
p343 p73p46p341
p193
p247 p106
p81p358 p365
p354 p340
p304
p355 p357
p356 p1 p245
p148 p337
p306 p171
p299
p278 p125
p336 p368
p279 p342 p11p69p198
p284 p170
p275 p12p26p9 p313
p321 p259
p176 p330
p262 p257
p235 p260
p244 p53p329
p242 p243 p34p327
p274 p248
p3 p241 p95p271
p249 p230 p228
p273 p226
p4 p236
p348 p94p270
p234 p7 p22p346
p23p298
p272
p233 p227
p326 p307
p338 p240
p285 p38p74p303
p325 p309 p76p239
p302 p344
p322
p323
p324 p195
p238 p180

p415
p187
p311
p550
p551
p310
p277p334
p55 p96
p485
p335p67
p45 p143p488
p552p487
p560
p434
p305p486
p567 p578
p483p566
p565
p481
p568
p312
p478
p537
p145p579
p184p246
p139p580
p587p581
p316
p191
p359
p360 p267
p366
p361 p423p563
p480p527
p528p562
p577p522
p529p530
p532p424
p430p452
p349 p232
p427p520
p194
p422p142
p453p456p437
p459p372p144
p455p436
p521 p331
p192
p535 p369p328
p258 p263
p454p457p505
p460p395
p438
p477p283
p467p458p435
p475p558p469
p557p251p396
p173p465
p463
p559p541
p286p265p538
p470 p433
p461p141
p426
p431
p471p231
p24 p253p503
p300 p269
p525p416
p428
p429p229
p531p264
p462p504
p419p472
p169p301
p250p448p417
p449p254p445
p494p495
p266
p497
p418p261
p255p493
p252
p450p491
p237 p443
p492p447
p444
p564p539
p379p381
p466p155
p376 p308
p347 p561
p482p588
p352p290
p351 p289
p136
p196 p339
p350

p553
p353
p554p911
p315 p910
p523p332
p573p526
p524 p907
p574 p507
p912
p276
p333 p908
p895
p889 p909
p137
p156 p484
p575
p913 p345
p190
p882 p320
p383
p548
p938
p555
p547p620
p546 p549
p140p189
p939
p948
p940
p545 p582
p946
p941 p929
p451 p618
p388
p891 p35 p902
p619
p386
p883p595
p134p363
p896p924
p425 p928
p926
p135 p585
p384
p925 p420
p914p536
p421p440
p594 p542
p892
p881 p598
p377 p543
p888p893
p597p508
p596
p364
p880p512
p608 p625
p432 p904
p714
p654 p906
p441
p544 p138p511
p514
p397p936
p905
p510 p385 p506
p509
p921
p609 p622
p937
p637 p927
p446
p887 p479p606
p920 p930
p367 p901
p934
p468 p373
p256
p917 p918
p916p935
p919
p915 p626
p621p540
p571
p474p886
p723
p636 p884
p533 p931
p502
p501
p890 p656
p150p933
p500 p632
p885 p645
p168
p603 p623
p644
p268
p688p903
p633
p607 p572
p515
p605p387 p518
p655 p513
p593p517
p392
p394
p464p439
p146
p393
p635 p599
p687
p390 p686
p556 p391
p200
p378 p602
p398
p638
p473 p516
p685p684
p206p496
p771p172
p399
p628p768
p770
p629p570
p534 p569
p627 p932
p374
p630 p576
p202
p197 p769

p942
p203
p205 p1585
p223p702
p208 p701
p900
p114p989
p898
p490p726
p982
p981 p959
p703p923
p204 p993
p738
p984p899 p760
p728
p983 p727
p980p489p664
p1004
p985 p951 p955
p1077
p1003
p1006
p682 p584
p1017 p894
p922
p943 p954
p211
p757 p750
p992 p945
p991
p499 p944
p161p986
p165 p949
p631
p147
p362 p958
p953
p987
p1182 p950
p947p952
p319p677 p956
p1070
p731
p613 p957
p1078
p442
p129 p1014
p1020
p201
p370 p704
p970
p735
p754
p1122 p737
p734
p973
p162p118p966
p1016
p1124 p1042
p611p1043
p968 p586
p616p971
p614
p1123 p742
p743
p740
p212 p744
p676
p617p749
p1075
p1015
p1013 p1038
p711
p710
p666 p709
p733
p967
p753p1035
p659
p660p706
p1101
p1588
p612p662
p712
p988p683 p1097
p159
p1018 p1100
p1111p716
p222 p1096
p476p1045
p657
p1008 p1036
p604
p1007 p1039
p1074
p640p1099p1037
p643
p1055
p1009 p1071
p641
p590
p1048 p1027
p610
p218p962
p836
p673p1026
p1024
p1022
p1054 p1073
p1010 p1098
p639
p1053 p976p725
p977
p679p975 p1025
p724
p1049
p1052
p678 p979
p680 p960
p715p974
p157 p965
p961
p897p729
p642
p130 p1040
p1072
p408
p978
p1011 p1041
p1044
p752 p751
p407
p409
p651
p160 p601
p648
p401
p1051 p838
p404 p1083
p661
p1023
p382p403
p1021 p719
p405 p658
p600
p389p674
p498p624
p730p718
p739
p1012 p380p964
p667p1080
p406p672p1031
p649p720
p668 p1076
p721p717
p402p120p1587
p110
p722p759
p132
p671 p1079
p375
p113 p1586
p221p400
p210

p1459
p1558
p995
p216p163
p217 p777
p775
p1509
p873p839
p111 p1127
p1508
p788
p793p787 p998
p772
p1267
p990
p1458
p790
p149 p647
p1375
p841
p789 p1455
p792
p1005p1372
p774
p1374
p1191
p220 p1368
p1510
p791
p785
p786p1376
p1271
p1332
p225 p1066
p1371
p1370
p669
p1266 p1069
p1063
p1086
p1187
p994
p207p1064
p997
p1269p1068
p782
p755 p1061
p776
p1144
p1194 p1060
p1084
p1193 p1065
p1148
p1270p646
p1373
p1454p1002
p1369
p1337
p1333
p1057 p1149
p1593p1001
p1336p1087
p1335 p1456
p1265
p1272 p1088
p784
p1140
p116
p1121p778
p1329 p779
p1338
p1059
p1125p1085
p689
p1203
p848 p1190
p1142
p1204
p996
p1334
p1331
p1268
p1330
p857
p371 p1126
p1520
p213p1521
p736
p854 p1457
p867p855 p1062
p1185
p732
p107 p780
p1145
p844 p1000
p1143
p846 p1067
p1186
p1141
p834
p1578 p1184
p802 p856
p783
p1146
p1209
p969 p797
p835
p615
p188 p781
p1028 p1219
p801
p1197
p1046 p1147
p696
p1202
p1205
p1207p1188
p697
p1607
p833p758
p1606 p705
p1198
p695
p663p1208
p1206
p1611p798
p1605 p999
p1189
p1613
p1047 p1092
p1104
p707
p741 p1105
p803 p1108
p1030
p1609
p866 p1110
p800
p1608
p1612
p1577p1177
p691p1106
p1107
p1102
p690
p1614 p1128
p1109
p699
p837 p1175
p1176
p158p963
p215p1103
p1135
p1130
p1174
p1139
p670
p874p799
p692
p698
p1610p858 p1171
p1029
p1019
p796
p127 p1137
p1050p1419
p410
p209 p1094
p1132p1095
p1136
p1133
p1199 p1134
p1201p1243
p166
p681 p1183
p589
p794
p1495
p117 p1032
p1232
p634
p694p1230
p1089
p828
p1056
p713 p1034
p1033
p1218
p700
p591
p1181
p795 p708
p652
p214p1231
p1093
p830
p827 p1172
p1421
p519
p761
p412 p1082
p1058p1090
p1229
p1418
p1200
p829 p1223
p1228
p762
p109 p1170
p1227
p764
p763 p1091
p748
p831
p122 p767
p766
p1420
p119 p745
p653p746
p747
p852 p1081
p693
p826p1436
p1129
p1244
p832
p650 p1225
p1590
p1168
p843
p1417
p972
p112 p1438
p1169
p1435
p1437
p1166
p1165
p1220
p1180p765
p1163
p1162
p1226
p1434
p1222
p864
p1221
p414p413 p1592
p224 p1589
p871 p1439
p1349
p1591
p1138
p851p123 p1161
p850
p411p1173
p1224
p1179
p131
p845
p853

p1260
p1286
p1392p1294
p1394
p1512p1393
p1388
p1490p1391
p128p1396
p1389p1291
pp1544
1390
p121 p1328
p1293
p1289
p1464pp1288
p8631292
p1257
p1287
p1450p1261
p1451
p1453p1290
p1415
p1449p1403
p879 p1262
p1264
p1326
p773
p1506
p108 p1256
p133 p1411
p840p1263
p1463p1246
p1192 p1255
p1195 p1440
p1414
p1258
p1395 p1584
p1250
p1327
p1452p1410 p1441
p1252p1253
p1245
p1402p1412
p1466 p1325
p1416
p1259
p1524p1523
p1522p1311
p1150
p1254
p1308
p1405p1413p1345
p1249
p1404p1247
p1473 p1348
p1400
p1155 p1344
p1251
p1248
p1505p1504
p1497 p1401
p1324
p1409
p756p1467
p1579 p1340
p1304
p1498
p1283
p1566
p1275 p1342
p1210p1347
p1496
p1279
p878 p1341
p1407
p1399 p1339
p1550p1343
p1406
p1276 p1346
p1233
p1427
p1431
p1474
p1277 p1309
p1397 p1468
p1433
p1429
p1299
p1478
p1274
p126 p1428 p1323
p1398
p1280 p1307
pp1408
p1282
p583p1285p1432
p1284
1281p1297
p1296
p1430
p823p818
p819
p824p1319
p849p1300p1301
p1237
pp1196
820 p822p1320
p1318
p1273
p825p1242 p1310
pp1321
1302
p1313
p1303
p821
p167 p1322
p1298
p1422
p814 p1315
p1317
p1239
p1551 p1306
pp816
p8121316
p1314
p817p1295
p1442
p1426
p1352 p1113
p1312
p1350
p1278p1241p1240
p1541
p860p862
p813
p665 p1114
p1443p1305
p1236
p1238
p815
p859 p1569
p1530
p1235
p1465
p870
p1213 p869
p1517
p811 p1462
p1117
p1131
p1447
p847p1537 p1120
p1519
p592p1115
p1351
p1516 p1152
p1445
p1444p1118
p1159
p1423
p1538
p1518 p1116
p1153
p1158
p1424
p1446 p1156
p1160
p877
p1425p1151
p164p876
p1234
p1476 p1555
p1448 p1154
p1212
p872p1211
p1214
p861 p1556pp1112
p1565
p865
p1178
p1477 p868
1564
p875
p1216p1217
p1489p1472p1119
p1380
p1470p1461
p1471
p810p1557p1387
p1378
p1215
p1540
p675 p1601
p1386
p808p807
p806
p804p1377
p1381
p1603
p1552p1595
p1167
p805p1460
p842
p1385
p1600
p1164
p1539p1599
p1525 p1594
p1597
p1382
p1583
p219 pp1563
1596
p1529
p809 p1157
p1598
p1383
p1379
p115p1384
p1514
p1602
p1488
p1604
p1567p1486
p1527p1526
p1528p1487
p124

p1560p1511p1484p1559p1572p1507p1492p1571p1494p1493p1562p1491p1570p1574p1535p1502p1500p1503p1576p1501p1549p1548p1469p1480p1533p1582p1534p1536p1475p1361p1362p1482p1483p1360p1481p1547p1546p1353p1355p1356p1357p1358p1359p1354p1545p1575p1513p1479p1531p1542p1485p1561p1499p1367p1365p1532p1366p1363p1364p1573p1515p1568p1580p1581p1554p1553p1543

Family relationship of 1641 members of Greenland Eskimo population.


Chapter 2
Markov Properties

2.1 Conditional Independence

2.1.1 Basic properties

Graphical models are based on the the notion of conditional independence:


Definition 2.1 (Conditional Independence). Random variables X and Y are condi-
tionally independent given the random variable Z if the conditional distribution of
X given Y and Z is the same as the conditional distribution of X given Z alone, i.e.

L (X |Y, Z) = L (X | Z). (2.1)

We then write X ⊥⊥ Y | Z or X ⊥⊥P Y | Z if we want to emphasize that this depends


on a specific probability measure P. Alternatively, conditional independence can be
formulated in terms of σ -algebras:
Definition 2.2 (Conditional independence of σ -algebras). The σ -algebras A and
B are conditionally independent given the σ -algebra C if the conditional expecta-
tion satisfies

E(1A∩B | C ) = E(1A | C )E(1B | C ) for all A ∈ A , B ∈ B. (2.2)

We then write A ⊥⊥ B | C and we have X ⊥⊥ Y | Z ⇐⇒ σ (X) ⊥⊥ σ (Y ) | σ (Z), i.e.


if the corresponding σ -algebras are independent.
It is not difficult to show that if the joint distribution of (X,Y, Z) has density
with the respect to a product measure, conditional independence is equivalent to the
factorizations

X ⊥⊥ Y | Z ⇐⇒ f (x, y, z) f (z) = f (x, z) f (y, z) (2.3)


⇐⇒ ∃a, b : f (x, y, z) = a(x, z)b(y, z). (2.4)

Similarly, one can show that for random variables X, Y , Z, and W it holds

5
6 2 Markov Properties

(C1) if X ⊥⊥ Y | Z then Y ⊥⊥ X | Z;
(C2) if X ⊥⊥ Y | Z and U = g(Y ), then X ⊥⊥ U | Z;
(C3) if X ⊥⊥ Y | Z and U = g(Y ), then X ⊥⊥ Y | (Z,U);
(C4) if X ⊥⊥ Y | Z and X ⊥⊥ W | (Y, Z), then X ⊥⊥ (Y,W ) | Z;
If the joint distribution of the random variables have a density w.r.t. a product mea-
sure which is strictly positive, it further holds that
(C5) if X ⊥⊥ Y | (Z,W ) and X ⊥⊥ Z | (Y,W ) then X ⊥⊥ (Y, Z) |W .
Without additional conditions on the joint distribution, (C5) does not hold, but pos-
itivity of the density is not necessary for (C5). For example, in the case where W is
constant it is enough that f (y, z) > 0 for all (y, z) or f (x, z) > 0 for all (x, z). In the
discrete and finite case it is sufficient that the bipartite graphs G+ = (Y ∪ Z , E+ )
defined by
y ∼+ z ⇐⇒ f (y, z) > 0,
are all connected, or alternatively if the same condition is satisfied with X replacing
Y.
Conditional independence can be seen as encoding irrelevance in a fundamental
way. If we give A ⊥⊥ B |C the interpretation: Knowing C, A is irrelevant for learning
B, the properties (C1)–(C4) translate to:
(I1) If, knowing C, learning A is irrelevant for learning B, then B is irrelevant for
learning A;
(I2) If, knowing C, learning A is irrelevant for learning B, then A is irrelevant for
learning any part D of B;
(I3) If, knowing C, learning A is irrelevant for learning B, it remains irrelevant
having learnt any part D of B;
(I4) If, knowing C, learning A is irrelevant for learning B and, having also learnt
A, D remains irrelevant for learning B, then both of A and D are irrelevant for
learning B.
The property (C5) does not have immediate intuitive appeal for general irrelevance.
Also the symmetry (C1) is a special property of probabilistic conditional indepen-
dence, rather than of general irrelevance, so (I1) does not have the same immediate
appeal as the others.

2.1.2 General conditional independence

The general interpretation of conditional independence suggests the usefulness of an


abstract study of algebraic structures satisfying these. So consider the set of subsets
of a finite set and a ternary relation ⊥σ among those.
Definition 2.3 (Graphoid). The relation ⊥σ is said to be a graphoid if for all dis-
joint subsets A, B, C, and D of V :
2.1 Conditional Independence 7

(S1) if A ⊥σ B |C then B ⊥σ A |C;


(S2) if A ⊥σ B |C and D ⊆ B, then A ⊥σ D |C;
(S3) if A ⊥σ B |C and D ⊆ B, then A ⊥σ B | (C ∪ D);
(S4) if A ⊥σ B |C and A ⊥σ D | (B ∪C), then A ⊥σ (B ∪ D) |C;
(S5) if A ⊥σ B | (C ∪ D) and A ⊥σ C | (B ∪ D) then A ⊥σ (B ∪C) | D.
The relation is a called a semigraphoid if only (S1)–(S4) holds.
The properties (S1)–(S4) are known as the semigraphoid axioms and similarly
(S1)–(S5) as the graphoid axioms. They originate in a slightly different form with
Dawid (1979, 1980). It was conjectured by Pearl (1988) that they could be used as
complete axioms for probabilistic conditional independence but this has been shown
to be false; in fact, there is no finite axiom system which is complete for conditional
independence (Studený 1992).
It is possible to consider more general (semi)graphoid relations, defined on other
lattices than the lattice of subsets of a set, for example on the lattice of sub-σ -
algebras of a σ -algebra. A completely general discussion of conditional indepen-
dence structures can be based on the notion of imsets (Studený 1993).
In the following we shall give several examples of graphoids and semigraphoids
and more examples will appear later in the notes.
Let V be a finite set and X = (Xv , v ∈ V ) random variables taking values in Xv .
For A ⊆ V we let XA = (Xv , v ∈ A) and similarly xA = (xv , v ∈ A) ∈ XA = ×v∈A Xv .
If we abbreviate as
A ⊥⊥ B | S ⇐⇒ XA ⊥⊥ XB | XS ,
the basic properties of conditional independence imply that the relation ⊥⊥ on sub-
sets of V is a semigraphoid and if f (x) > 0 for all x, the relation ⊥⊥ is also a
graphoid. This is a probabilistic semigraphoid. Not all (semi)graphoids ⊥σ are
probabilistically representable in the sense that there is a joint distribution so that

A ⊥σ B | S ⇐⇒ XA ⊥⊥ XB | XS ;

see Studený (1993) for further discussion of this point.

Second order conditional independence

Sets of random variables A and B are partially uncorrelated for fixed C if their
residuals after linear regression on XC are uncorrelated:

Cov{XA − E∗ (XA | XC ), XB − E∗ (XB | XC )} = 0,

in other words, if the partial correlations ρAB·C are equal to zero. If this holds we
write A ⊥2 B |C. The relation ⊥2 satisfies the semigraphoid axioms (S1) -(S4), and
the graphoid axioms if there is no non-trivial linear relation between the variables
in V .
8 2 Markov Properties

Separation in undirected graphs

Let G = (V, E) be a finite and simple undirected graph (no self-loops, no multiple
edges). For subsets A, B, S of V , let A ⊥G B | S denote that S separates A from B in
G , i.e. that all paths from A to B intersect S. It then holds that the relation ⊥G on
subsets of V is a graphoid. Indeed, this is the reason for choosing this name for such
separation relations.

Geometric orthogonality

As another fundamental example, consider geometric orthogonality in Euclidean


vector spaces or Hilbert spaces. Let L, M, and N be linear subspaces of a Hilbert
space H and define

L ⊥ M | N ⇐⇒ (L N) ⊥ (M N),

where L N = L ∩ N ⊥ . If this condition is satisfied, L and M are said to meet or-


thogonally in N. This relation has properties
(O1) If L ⊥ M | N then M ⊥ L | N;
(O2) If L ⊥ M | N and U is a linear subspace of L, then U ⊥ M | N;
(O3) If L ⊥ M | N and U is a linear subspace of M, then L ⊥ M | (N +U);
(O4) If L ⊥ M | N and L ⊥ R | (M + N), then L ⊥ (M + R) | N.
The analogue of (S5) does not hold in general; for example if M = N we may have

L ⊥ M | N and L ⊥ N | M,

but in general it is false that L ⊥ (M + N). Thus ⊥ is a semigraphoid relation on the


lattice of closed subspaces of a Hilbert space.

Variation independence

Let U ⊆ X = ×v∈V Xv and define for S ⊆ V and u∗S ∈ XS ∩ U the S-section U uS
of U as

U uS = {uV \S : uS = u∗S , u ∈ U }.
Define further the conditional independence relation ‡U as
∗ ∗ ∗
A ‡U B | S ⇐⇒ ∀u∗S : U uS = {U uS }A × {U uS }B

i.e. if and only if the S-sections all have the form of a product space. The relation
‡U satisfies the semigraphoid axioms. Note in particular that A ‡U B | S holds if U
is the support of a probability measure satisfying A ⊥⊥ B | S.
2.2 Markov Properties for Undirected Graphs 9

2.2 Markov Properties for Undirected Graphs

Graphs can be used to generate conditional independence structures in the form


of Markov properties, typically described through the separation properties of the
graph. Here we consider a simple undirected graph G = (V, E) and a conditional
independence relation ⊥σ on the subsets of V which we assume satisfies the semi-
graphoid axioms. The Markov properties associated with an undirected graph G are
known as pairwise, local and global, to be detailed in the following.

2 4
u u
1 @ 5 @ 7
u @u
@ @u
@
@ @
@u
@ @u
@
3 6
Fig. 2.1 Undirected graph used to illustrate the different Markov properties

The pairwise Markov property

The semigraphoid relation ⊥σ satisfies the pairwise Markov property w.r.t. G if


non-adjacent vertices are conditionally independent given the remaining, i.e.

α 6∼ β ⇒ α ⊥σ β |V \ {α, β }.

For example, in Fig. 2.1 the pairwise Markov property states that

1 ⊥σ 5 | {2, 3, 4, 6, 7} and 4 ⊥σ 6 | {1, 2, 3, 5, 7}.

If the relation ⊥σ satisfies the pairwise Markov property, we also write that ⊥σ
satisfies (P).

The local Markov property

The semigraphoid relation ⊥σ satisfies the local Markov property w.r.t. G if every
variable is conditionally independent of the remaining, given its neighbours

∀α ∈ V : α ⊥σ V \ cl(α) | bd(α).

For example, if ⊥σ satisfies the local Markov property w.r.t. the graph in Fig. 2.1
it holds that 5 ⊥σ {1, 4} | {2, 3, 6, 7} and 7 ⊥σ {1, 2, 3} | {4, 5, 6}. If the relation ⊥σ
satisfies the local Markov property, we also write that ⊥σ satisfies (L).
10 2 Markov Properties

The global Markov property

The semigraphoid relation ⊥σ satisfies the global Markov property w.r.t. G if any
two sets which are separated by a third are conditionally independent given the
separating set
A ⊥G B | S ⇒ A ⊥σ B | S.
To identify conditional independence relations in the graph of Fig. 2.1 one should
look for separating sets, such as {2, 3}, {4, 5, 6}, or {2, 5, 6}. For example, it follows
that 1 ⊥σ 7 | {4, 5, 6} and 2 ⊥σ 6 | {3, 4, 5}. If the relation ⊥σ satisfies the global
Markov property, we also write that ⊥σ satisfies (G).

Structural relations among Markov properties

The various Markov properties are related, but different in general:


Theorem 2.1. For any semigraphoid relation ⊥σ it holds that

(G) ⇒ (L) ⇒ (P).

If ⊥σ satisfies graphoid axioms it further holds that

(P) ⇒ (G)

so that in the graphoid case

(G) ⇐⇒ (L) ⇐⇒ (P).

The latter holds in particular for ⊥⊥ , when f (x) > 0, so that for probability distri-
butions with positive densities, all the Markov properties coincide.
Proof. Since this result is so fundamental and the proof illustrates the use of
graphoid axioms very well, we give the full argument here, following Lauritzen
(1996).

(G) implies (L):

This holds because bd(α) separates α from V \ cl(α).

(L) implies (P):

Assume (L). Then β ∈ V \ cl(α) because α 6∼ β . Thus

bd(α) ∪ ((V \ cl(α)) \ {β }) = V \ {α, β },


2.2 Markov Properties for Undirected Graphs 11

Hence by (L) and (S3) we get that

α ⊥σ (V \ cl(α)) |V \ {α, β }.

(S2) then gives α ⊥σ β |V \ {α, β } which is (P).

(P) implies (G) for graphoids:

The proof uses reverse induction to establish this for a general undirected graph.
Before we proceed to give this proof, due to Pearl and Paz (1987), it is helpful to
note that the graphoid condition (S5):

A ⊥σ B | (C ∪ D) and A ⊥σ C | (B ∪ D) ⇒ A ⊥σ (B ∪C) | D

exactly expresses that the pairwise Markov property (P) implies the global Markov
property (G) on the graph in Fig. 2.2.

W tY
X t t
tZ
HH

Fig. 2.2 The graphoid condition (S5) expresses that the pairwise Markov property (P) implies the
global Markov property (G) on this particular graph.

Assume (P) and A ⊥G B | S. We must show A ⊥σ B | S. Without loss of generality


we assume that A and B are non-empty. The proof is reverse induction on n = |S|.
If n = |V | − 2 then A and B are singletons and (P) yields A ⊥σ B | S directly.
Assume next that |S| = n < |V | − 2 and the conclusion has been established for
|S| > n. Consider first the case V = A ∪ B ∪ S. Then either A or B has at least two
elements, say A. If α ∈ A then B ⊥G (A \ {α}) | (S ∪ {α}) and also α ⊥G B | (S ∪ A \
{α}) (as ⊥G is a semi-graphoid). Thus by the induction hypothesis

(A \ {α}) ⊥σ B | (S ∪ {α}) and {α} ⊥σ B | (S ∪ A \ {α}).

Now (S5) gives A ⊥σ B | S.


For A ∪ B ∪ S ⊂ V we choose α ∈ V \ (A ∪ B ∪ S). Then A ⊥G B | (S ∪ {α}) and
hence the induction hypothesis yields A ⊥σ B | (S ∪ {α}). Further, either A ∪ S sep-
arates B from {α} or B ∪ S separates A from {α}. Assuming the former gives
α ⊥σ B | A ∪ S. Using (S5) we get (A ∪ {α}) ⊥σ B | S and from (S2) we derive that
A ⊥σ B | S. The latter case is similar. t
u
The Markov properties are genuinely different in general if the graphoid axioms are
not satisfied, as demonstrated by the examples below.
12 2 Markov Properties

Example 2.1 (Pairwise Markov but not local Markov). Let X = Y = Z with P{X =
1} = P{X = 0} = 1/2. This distribution satisfies (P) but not (L) with respect to the
graph below.
s s s
X Y Z
The pairwise Markov property says that X ⊥⊥ Y | Z and X ⊥⊥ Z |Y , which both are
satisfied. However, we have that bd(X) = 0/ so (L) would imply X ⊥⊥ (Y, Z) which
is false.
It can be shown that (L) ⇐⇒ (P) if and only if Gˇ has no induced subgraph
GA = (A, ĚA ) with |A| = 3 and |ĚA | ∈ {2, 3} (Matúš 1992).
ˇ

Here the dual graph Gˇ is defined by α ∼β


ˇ if and only if α 6∼ β , i.e. has edges exactly
where G does not.

Example 2.2 (Local Markov but not global Markov). Let U and Z be independent
with
P(U = 1) = P(Z = 1) = P(U = 0) = P(Z = 0) = 1/2,
W = U, Y = Z, and X = WY . This satisfies (L) but not (G) w.r.t. the graph below.
s s s s s
U W X Y Z
The local Markov property follows because all variables depend deterministically
on their neighbours. But the global Markov property fails; for example it is false
that W ⊥⊥ Y | X.
It can be shown that (G) ⇐⇒ (L) if and only if the dual graph Gˇ does not have
the 4-cycle as an induced subgraph (Matúš 1992).

Factorization and Markov properties

For a ⊆ V , ψa (x) is a function depending on xa only, i.e.

xa = ya ⇒ ψa (x) = ψa (y).

We can then write ψa (x) = ψa (xa ) without ambiguity.


The distribution of X factorizes w.r.t. G or satisfies (F) if its density f w.r.t. prod-
uct measure on X has the form

f (x) = ∏ ψa (x),
a∈A

where A are complete subsets of G or, equivalently, if

f (x) = ∏ ψ̃c (x),


c∈C

where C are the cliques of G .


2.2 Markov Properties for Undirected Graphs 13

Example 2.3. The cliques of the graph in Fig. 2.1 are the maximal complete subsets
{1, 2}, {1, 3}, {2, 4}, {2, 5}, {3, 5, 6}, {4, 7}, and {5, 6, 7} and a complete set is
any subset of these sets, for example {2} or {5, 7}. The graph corresponds to a
factorization as

f (x) = ψ12 (x1 , x2 )ψ13 (x1 , x3 )ψ24 (x2 , x4 )ψ25 (x2 , x5 )


× ψ356 (x3 , x5 , x6 )ψ47 (x4 , x7 )ψ567 (x5 , x6 , x7 ).

Consider a distribution with density w.r.t. a product measure and let (G), (L) and
(P) denote Markov properties w.r.t. the semigraphoid relation ⊥⊥ .

Theorem 2.2. It holds that


(F) ⇒ (G)
and if the density is strictly positive it further holds that (P) ⇒ (F), such that then
all the Markov properties coincide:

(F) ⇐⇒ (G) ⇐⇒ (L) ⇐⇒ (P).

Proof. See Lauritzen (1996, pp. 35–36).

Without the positivity restriction (G) and (F) are genuinely different, as illustrated
in the example below, due to Moussouris (1974).

1 r r1
1 r r1
1 r r1 1 r r1
1 r r0 0 r r1
1 r r0 0 r r1
1 r r0 0 r r1
0 r r0 0 r r0
1 r r0 0 r r1
0 r r0
0 r r0

Fig. 2.3 The distribution which is uniform on these 8 configurations satisfies (G) w.r.t. the 4-cycle.
Yet it does not factorize with respect to this graph.

Example 2.4 (Global but not factorizing). Consider the uniform distribution on the
8 configurations displayed in Fig. 2.3. Conditioning on opposite corners renders one
corner deterministic and therefore the global Markov property is satisfied.
However, the density does not factorize. To see this we assume the density fac-
torizes. Then e.g.

0 6= 1/8 = f (0, 0, 0, 0) = ψ12 (0, 0)ψ23 (0, 0)ψ34 (0, 0)ψ41 (0, 0)
14 2 Markov Properties

so these factors are all positive. Continuing for all possible 8 configurations yields
that all factors ψa (x) are strictly positive, since all four possible configurations are
possible for every clique.
But this contradicts the fact that only 8 out of 16 possible configurations have
positive probability. t
u
In fact, we shall see later that (F) ⇐⇒ (G) if and only if G is chordal, i.e. does not
have an n-cycle as an induced subgraph with n ≥ 4.

Instability of conditional independence under weak limits

Consider a sequence Pn , n = 1, 2, . . . of probability measures on X and assume that


A ⊥⊥Pn B |C. If Pn → P weakly, it does not hold in general that A ⊥⊥P B |C. A simple
counterexample is as follows: Consider X = (X1 , X2 , X3 ) ∼ N3 (0, Σn ) with
   
1 √1n 12 1 0 12
 1 2 1   
Σn =  n n n →0 0 0
√ √   
1 √1 1
2 n
1 2 01

so in the limit it is not true that 1 ⊥⊥P 3 | 2. The concentration matrix Kn is


 √ 
2 − n 0
 √ √ 
Kn = Σn−1 = − n 2 − n
3n 

0 − n 2

so for all n it holds that 1 ⊥⊥Pn 3 | 2. The critical feature is that Kn does not converge,
hence the densities do not converge.

Stability of conditional independence under limits

If X is discrete and finite and Pn → P pointwise, conditional independence is pre-


served: This follows from the fact that

X ⊥⊥Pn Y | Z ⇐⇒ fn (x, y, z) fn (z) = fn (x, z) fn (y, z)

and this relation is clearly stable under pointwise limits. Hence (G), (L) and (P) are
closed under pointwise limits in the discrete case.
In general, conditional independence is preserved if Pn → P in total variation (A.
Klenke, personal communication, St Flour 2006).
Example 2.5 (Instability of factorization under limits). Even in the discrete case,
(F) is not closed under pointwise limits in general. Consider four binary variables
X1 , X2 , X3 , X4 with joint distribution
2.3 Markov Properties for Directed Acyclic Graphs 15

nx1 x2 +x2 x3 +x3 x4 −x1 x4 −x2 −x3 +1


fn (x1 , x2 , x3 , x4 ) = .
8 + 8n
This factorizes w.r.t. the graph below.

2 s s3

1 s s4

It holds that fn (x) = n/(8 + 8n) for each of the configurations below

(0, 0, 0, 0) (1, 0, 0, 0) (1, 1, 0, 0) (1, 1, 1, 0)


(0, 0, 0, 1) (0, 0, 1, 1) (0, 1, 1, 1) (1, 1, 1, 1),

whereas fn (x) = 1/(8 + 8n) for the remaining 8 configurations. Thus, when n → ∞
the density fn converges to f (x) = 1/8 for each of the configurations above and
f (x) = 0 otherwise, i.e. to the distribution in Example 2.4 which is globally Markov
but does not factorize.

Markov faithfulness

A distribution P is said to be Markov faithful to a graph G if it holds that

A ⊥G B | S ⇐⇒ A ⊥⊥P B | S.

It can be shown by a dimensional argument that if |Xv | ≥ 2 for all v ∈ V , then


there is a distribution P which is Markov faithful to G . For a Markov faithful P, the
graphoids ⊥G and ⊥⊥P are isomorphic.
In fact, in the discrete and finite case, the set of Markov distributions which are
not faithful to a given graph is a Lebesgue null-set in the set of Markov distributions.
No formal proof seems to be published, but Meek (1995) gives a proof for the case
of directed acyclic graphs and indicates how this can be extended to undirected
graphs.

2.3 Markov Properties for Directed Acyclic Graphs

A directed acyclic graph D over a finite set V is a simple graph with all edges
directed and no directed cycles in the sense that that following arrows in the graph,
it is impossible to return to any point.
Graphical models based on DAGs have proved fundamental and useful in a
wealth of interesting applications, including expert systems, genetics, complex
biomedical statistics, causal analysis, and machine learning, see for example Fig. 1.1
and other examples in Chapter 1.
16 2 Markov Properties

The directed Markov properties are straight-forward generalizations of the notion


of a Markov chain with Xi+1 ⊥⊥ {X1 , . . . , Xi−1 } | Xi for i = 3, . . . , n:
s -s -s -s -s -s
X1 X2 X3 X4 X5 Xn

Local directed Markov property

A semigraphoid relation ⊥σ satisfies the local Markov property (L) w.r.t. a directed
acyclic graph D if all variables are conditionally independent of its non-descendants
given its parents.

∀α ∈ V : α ⊥σ {nd(α) \ pa(α)} | pa(α).

Here nd(α) denotes the non-descendants of α.


2 4
u -u
1  @ 5 @ 7
u Ru
@
@ Ru
@
@
-
@  @ 
Ru
@
@ Ru
@
@
-
3 6

Fig. 2.4 A directed, acyclic graph

The local Markov property for the DAG in Fig. 2.4 yields, for example, that
4 ⊥σ {1, 3, 5, 6} | 2, 5 ⊥σ {1, 4} | {2, 3}, and 3 ⊥σ {2, 4} | 1.

Ordered Markov property

Suppose the vertices V of a DAG D are well-ordered in the sense that they are
linearly ordered in a way which is compatible with D, i.e. so that

α ∈ pa(β ) ⇒ α < β .

We then say that the semigraphoid relation ⊥σ satisfies the ordered Markov prop-
erty (O) w.r.t. a well-ordered DAG D if

∀α ∈ V : α ⊥σ {pr(α) \ pa(α)} | pa(α).

Here pr(α) are the predecessors of α, i.e. those which are before α in the well-
ordering.
The numbering in Fig. 2.4 corresponds to a well-ordering. The ordered Markov
property says for example that 4 ⊥σ {1, 3} | 2, 5 ⊥σ {1, 4} | {2, 3}, and 3 ⊥σ {2} | 1.
2.3 Markov Properties for Directed Acyclic Graphs 17

Separation in DAGs

The global Markov property for directed acyclic graphs is expressed in terms of a
type of separation which is somewhat involved compared to the undirected case.
A trail τ from α to β is a sequence v1 , v2 , . . . , vn of edges with α = v1 , β = vn
and all consecutive vertices being adjacent. A trail τ in D is blocked by a set S if it
contains a vertex γ ∈ τ such that
• either γ ∈ S and edges of τ do not meet head-to-head at γ, or
• γ and all its descendants are not in S, and edges of τ meet head-to-head at γ.
A trail that is not blocked is active. Two subsets A and B of vertices are d-separated
by S if all trails from A to B are blocked by S. We write A ⊥D B | S.
In the DAG of Fig. 2.4 we have, for example, that for S = {5}, the trail
(4, 2, 5, 3, 6) is active, whereas the trails (4, 2, 5, 6) and (4, 7, 6) are blocked. For
S = {3, 5} all these trails are blocked. Hence it holds that 4 ⊥D 6 | 3, 5, but it is not
true that 4 ⊥D 6 | 5 nor that 4 ⊥D 6.

Global directed Markov property

A semigraphoid relation ⊥σ satisfies the global Markov property (G) w.r.t. a di-
rected acyclic graph D if

A ⊥D B | S ⇒ A ⊥σ B | S.

In Fig. 2.4 the global Markov property thus entails that 4 ⊥⊥ 6 | 3, 5 and 2 ⊥⊥ 3 | 1.

Equivalence of Markov properties

In the directed case the relationship between the alternative Markov properties is
much simpler than in the undirected case.
Proposition 2.1. It holds for any directed acyclic graph D and any semigraphoid
relation ⊥σ that all directed Markov properties are equivalent:

(G) ⇐⇒ (L) ⇐⇒ (O).

We omit the proof of this fact and refer to Lauritzen et al (1990) for details. FiXme
Fatal: give the proof here?
There is also a pairwise property (P), but it is less natural than in the undirected
case and it is weaker than the others, see Lauritzen (1996, page 51).
18 2 Markov Properties

Factorisation with respect to a DAG

A probability distribution P over X = XV factorizes over a DAG D if its density f


w.r.t. some product measure µ has the form

(F) : f (x) = ∏ kv (xv | xpa(v) )


v∈V
R
where kv ≥ 0 and Xv kv (xv | xpa(v) ) µv (dxv ) = 1. It can be easily shown by induction
that (F) is equivalent to (F∗ ), where

(F∗ ) : f (x) = ∏ f (xv | xpa(v) ),


v∈V

i.e. it follows from (F) that kv in fact are conditional densities. The graph in Fig. 2.4
thus corresponds to the factorization

f (x) = f (x1 ) f (x2 | x1 ) f (x3 | x1 ) f (x4 | x2 )


× f (x5 | x2 , x3 ) f (x6 | x3 , x5 ) f (x7 | x4 , x5 , x6 ).

Markov properties and factorization

Assume that the probability distribution P has a density w.r.t. some product measure
on X . It is then always true that (F) holds if and only if ⊥⊥P satisfies (G), so all
directed Markov properties are equivalent to the factorization property!

(F) ⇐⇒ (G) ⇐⇒ (L) ⇐⇒ (O). (2.5)


FiXme Fatal: give the proof here?

Ancestral marginals

The directed Markov properties are closed under marginalization to ancestral sub-
sets, i.e. sets which contain the parents of all its vertices

α ∈ A ⇒ pa(α) ∈ A.

Proposition 2.2. If P factorizes w.r.t. D and A ⊆ V is ancestral, it factorizes w.r.t.


DA .

Proof. Induction after |V |, using that if A is ancestral and A 6= V , there is a terminal


vertex α0 with α0 6∈ A. Hence the A-marginal can be obtained by first marginalizing
to V 0 = V \ {α0 } and subsequently marginalizing to A from V 0 which has one vertex
less than V . t
u
2.3 Markov Properties for Directed Acyclic Graphs 19

Moralization and undirected factorizations

The moral graph D m of a DAG D is obtained by adding undirected edges between


unmarried parents and subsequently dropping directions, as illustrated in Fig. 2.5.

2 4 2 4 2 4
s -s s -s s s
1 @ 5 @ 7 1 @ 5 @ 7 1 @ 5 @ 7
s Rs
@ Rs s
@
- Rs
@ Rs s
@
- @s @s
Rs Rs @s Rs @s @s
@ @  @ @  @ @
@ @
- R @
-
3 6 3 6 3 6

Fig. 2.5 Illustration of the moralization process. Undirected edges are added to parents with a
common child. Directions on edges are subsequently dropped.

Markov properties of directed and undirected graphs are different in general.


However, there are obvious important connections between directed and undirected
factorizations. We have for example the following
Proposition 2.3. If P factorizes w.r.t. D, it factorizes w.r.t. the moralized graph D m .
Proof. This is seen directly from the factorization:

f (x) = ∏ f (xv | xpa(v) ) = ∏ ψ{v}∪pa(v) (x),


v∈V v∈V

since {v} ∪ pa(v) are all complete in D m . t


u
Hence if P satisfies any of the directed Markov properties w.r.t. D, it satisfies all
Markov properties for D m .

Perfect DAGs

The skeleton σ (D) of a DAG is the undirected graph obtained from D by ignoring
directions.
A DAG D is perfect if all parents are married or, in other words if σ (D) = D m .
It follows directly from Proposition 2.3 that the directed and undirected properties
are identical for a perfect DAG D:
Corollary 2.1. P factorizes w.r.t a perfect DAG D if and only if it factorizes w.r.t. its
skeleton σ (D).
Not that a rooted tree with arrows pointing away from the root is a perfect DAG.
Thus for such a rooted tree the directed and undirected Markov properties are the
same.
In particular this yields the well-known fact that any Markov chain is also a
Markov field.
20 2 Markov Properties

We shall later see that an undirected graph G can be oriented to form a perfect
DAG if and only if G is chordal.

Alternative equivalent separation criterion

The criterion of d-separation can be difficult to verify in some cases, although ef-
ficient algorithms to settle d-separation queries exists. For example, Geiger et al
(1990) describe an algorithm with worst case complexity O(|E|) for finding all ver-
tices α which satisfy α ⊥D B | S for fixed sets B and S.
Algorithms for settling such queries can also be based on the following alterna-
tive separation criterion given by Lauritzen et al (1990) which is based on Proposi-
tions 2.2 and 2.3. For a query involving three sets A, B, S we perform the following

2 4 2 4 2 4
s -s s -s s s
1 @ 5 1 @ 5 1 @ 5
s Rs
@ s Rs
@ s @s
Rs Rs Rs Rs @s @s
@ @ @ @ @ @
@ @
- @ @
-
3 6 3 6 3 6
Fig. 2.6 To settle the query “4 ⊥m 6 | 3, 5?” we first form the subgraph induced by all ancestors of
vertices involved. The moralization adds an undirected edge between 2 and 3 with common child
5 and drops directions. Since {3, 5} separates 4 from 6 in the resulting graph, we conclude that
4 ⊥m 6 | 3, 5.

operations:
1. Reduce to subgraph induced by ancestral set DAn(A∪B∪S) of A ∪ B ∪ S;
2. Moralize to form (DAn(A∪B∪S) )m ;
3. Say that S m-separates A from B and write A ⊥m B | S if and only if S separates
A from B in this undirected graph.
The procedure is illustrated in Fig. 2.6. It now follows directly from Propositions
2.2 and 2.3 that
Corollary 2.2. If P factorizes w.r.t. D it holds that

A ⊥m B | S ⇒ A ⊥⊥ B | S.

Proof. This holds because then P factorizes w.r.t. DAn(A∪B∪S)


m and hence satisfies
(G) for this graph. t
u

Indeed the concepts of m-separation and d-separation are equivalent:


Proposition 2.4. A ⊥m B | S if and only if A ⊥D B | S.

Proof. This is Proposition 3.25 of Lauritzen (1996). t


u
2.3 Markov Properties for Directed Acyclic Graphs 21

Note however that Richardson (2003) has pointed out that the proof given in Lau-
ritzen et al (1990) and Lauritzen (1996) needs to allow self-intersecting paths to be
correct. FiXme Fatal: give the correct proof here
It holds for any DAG D that ⊥D (and hence ⊥m ) satisfies graphoid axioms
(Verma and Pearl 1990).
To show this is true, it is sometimes easy to use ⊥m , sometimes ⊥D . For exam-
ple, (S2) is trivial for ⊥D , whereas (S5) is trivial for ⊥m . So, equivalence of ⊥D
and ⊥m can be very useful.

Faithfulness

As in the undirected case, a distribution P is said to be Markov faithful for a DAG


D if it holds that
A ⊥D B | S ⇐⇒ A ⊥⊥P B | S.
For a Markov faithful P, the graphoids ⊥D and ⊥⊥P are isomorphic.
If |Xv | ≥ 2 for all v ∈ V , then there is a distribution P which is Markov faith-
ful for D, and it holds further that the set of directed Markov distributions which
are not faithful is a Lebesgue null-set in the set of directed Markov distributions
(Meek 1995), confirming in particular that the criterion of d-separation is indeed
the strongest possible.

Markov equivalence

Two DAGS D and D 0 are said to be Markov equivalent if the separation relations
⊥D and ⊥D 0 are identical. Markov equivalence between DAGs is easy to identify,
as shown by Frydenberg (1990a) and Verma and Pearl (1990).
s s s s
@ ≡ I
@
@ @ 6≡ @
6
s @
s -? Rs
@ s -?
s @s s @
s -? Rs
@
- s -s @
Rs
@
-

Fig. 2.7 The two DAGs to the left are Markov equivalent whereas those to the right are not. Al-
though those to the right have the same skeleton they do not share the same unmarried parents.

Proposition 2.5. Two directed acyclic graphs D and D 0 are Markov equivalent if
and only if D and D 0 have the same skeleton and the same unmarried parents.
The use of this result is illustrated in Fig. 2.7.
A DAG D is Markov equivalent to an undirected G if the separation relations
⊥D and ⊥G are identical.
This happens if and only if D is perfect and G = σ (D). So the graphs below are
all equivalent
22 2 Markov Properties

r r r r -r -r r
 r -r r r
 r

but not equivalent to the directed acyclic graph below

r r
- r

2.4 Summary

We conclude by a summary of the most important definitions and facts given in the
present chapter.

Markov properties for undirected graphs

(P) Pairwise Markov property: α 6∼ β ⇒ α ⊥⊥ β |V \ {α, β };


(L) Local Markov property: α ⊥⊥ V \ cl(α) | bd(α);
(G) Global Markov property: A ⊥G B | S ⇒ A ⊥⊥ B | S;
(F) Factorization property: f (x) = ∏a∈A ψa (x), A being complete subsets of V .
It then holds that
(F) ⇒ (G) ⇒ (L) ⇒ (P).
If f (x) > 0 even
(F) ⇐⇒ (G) ⇐⇒ (L) ⇐⇒ (P).

Markov properties for directed acyclic graphs

(O) Ordered Markov property: α ⊥⊥ {pr(α) \ pa(α)} | pa(α);


(L) Local Markov property: α ⊥⊥ {nd(α) \ pa(α)} | pa(α);
(G) Global Markov property: A ⊥D B | S ⇒ A ⊥⊥ B | S.
(F) Factorization property: f (x) = ∏v∈V f (xv | xpa(v) ).
It then always holds that

(F) ⇐⇒ (G) ⇐⇒ (L) ⇐⇒ (O).

Relation between Markov properties on different graphs

If P is directed Markov w.r.t. D then P factorizes w.r.t. D m .


A DAG D is perfect if skeleton G = σ (D) = D m , implying that directed and
undirected separation properties are identical, i.e. A ⊥G B | S ⇐⇒ A ⊥D B | S.
An undirected graph G is the skeleton of a perfect DAG D, i.e. G = σ (D) = D m ,
if and only if G is chordal.
2.4 Summary 23

Two DAGs D and D 0 are Markov equivalent, i.e. A ⊥D B | S ⇐⇒ A ⊥D 0 B | S, if


and only if σ (D) = σ (D 0 ) and D and D 0 have the same unmarried parents.
Chapter 3
Graph Decompositions and Algorithms

One important feature of graphical models is modularity; probabilistic information


in complex stochastic systems is distributed over smaller modules exploiting con-
ditional independence relations. The present chapter is specifically concerned with
such aspects.

3.1 Graph Decompositions and Markov Properties

Definition 3.1 (Graph decomposition). A partitioning of V into a triple (A, B, S) of


subsets of V forms a decomposition of an undirected graph G if both of the following
holds:
(i) A ⊥G B | S;
(ii) S is complete.
The decomposition is proper if A 6= 0/ and B 6= 0/ and the components of G are the
induced subgraphs GA∪S and GB∪S . A graph is said to be prime if no proper de-
composition exists. Examples of prime graphs and graph decompositions are given
in Fig. 3.1 and Fig. 3.2. Any finite undirected graph can be recursively decom-

2 4
s s
1 @ 5 @ 7
s @s @s
@s @s
@ @
3 6

Fig. 3.1 An example of a prime graph. This graph has no complete separators.

posed into its uniquely defined prime components (Wagner 1937; Tarjan 1985; Di-
estel 1987, 1990), as illustrated in Fig. 3.3.

25
26 3 Graph Decompositions and Algorithms

2 4 2 2 4
s s s s s
1 @ 5 @ 7 1 @ 5 @ 5 @ 7
s @s @s s @s @s @s
@s @s @s @s
@ @ @ @
3 6 3 6

Fig. 3.2 Decomposition with A = {1, 3}, B = {4, 6, 7} and S = {2, 5}.

2 4 2 4
s s 2 s s
1 @ 5 @ 7 s @ 5 @ 7
s @s @s 1 @ 5 @s @s
s @s 5 7
@
@s
@
@s s s
@s
@
3 6 @
@s
3
6
Fig. 3.3 Recursive decomposition of a graph into its unique prime components.

Definition 3.2 (Decomposable graph). A graph is said to be decomposable if its


prime components are cliques.
It would make more sense to say that such a graph is fully decomposable and reserve
the term decomposable for a graph that is not prime. However, this has not been the
tradition in the statistical literature.

Decomposition of Markov properties

Graph decompositions are important because they correspond to decomposition and


thus modularization of the Markov properties, as captured in the following result.
Proposition 3.1. Let (A, B, S) be a decomposition of G . Then P factorizes w.r.t. G if
and only if both of the following hold:
(i) PA∪S and PB∪S factorize w.r.t. GA∪S and GB∪S ;
(ii) f (x) fS (xS ) = fA∪S (xA∪S ) fB∪S (xB∪S ).

Proof. This is Proposition 3.16 of (Lauritzen 1996). t


u

Recursive decomposition of a decomposable graph yields:

f (x) ∏ fS (xS )ν(S) = ∏ fC (xC ).


S∈S C∈C

Here S is the set of complete separators occurring in the decomposition process


and ν(S) the number of times a given S appears. More generally, if Q denotes the
prime components of G we have:
3.2 Chordal Graphs and Junction Trees 27

f (x) ∏ fS (xS )ν(S) = ∏ fQ (xQ ). (3.1)


S∈S Q∈Q

Combinatorial consequences

If in (3.1) we let Xv = {0, 1} and f be uniform, i.e. f (x) = 2−|V | , this yields

2−|V | ∏ 2−|S|ν(S) = ∏ 2−|Q|


S∈S Q∈Q

and hence we must have

∑ |Q| = ∑ |S|ν(S) + |V |. (3.2)


Q∈Q S∈S

Similarly the right and left hand sides of (3.1) must have the same number of factors
as every decomposition yields an extra factor on both sides of the equation and
hence it holds that
|Q| = ∑ ν(S) + 1.
S∈S

These identities were also derived in a slightly different form in Lauritzen et al


(1984).

Example 3.1. An undirected tree τ is decomposable with prime components being


the edges and the separators equal to the vertices, with the multiplicity ν(v) being
one less than the degree of the vertex

ν(v) = deg(v) − 1 = | bd(v)|,

and the combinatorial identities above reduce to

2|E| = ∑ deg(v), |E| = ∑ (deg(v) + 1) + 1 = |V | + 1 + ∑ deg(v)


v∈V v∈V v∈V

implying in particular
|E| = |V | + 1.

3.2 Chordal Graphs and Junction Trees

Properties associated with decomposability

A numbering V = {1, . . . , |V |} of the vertices of an undirected graph is said to be


perfect if the induced oriented graph is a perfect DAG or, equivalently, if

∀ j = 2, . . . , |V | : bd( j) ∩ {1, . . . , j − 1} is complete in G .


28 3 Graph Decompositions and Algorithms

An undirected graph G is said to be chordal if it has no n-cycle with n ≥ 4 as


an induced subgraph. Chordal graphs are also known as rigid circuit graphs (Dirac
1961) or triangulated graphs (Berge 1973). A set S is called an (α, β )-separator if
α ⊥G β | S. Chordal graphs can now be characterized as follows.
Proposition 3.2. The following conditions are equivalent for any undirected graph
G.
(i) G is chordal;
(ii) G is decomposable;
(iii) G admits a perfect numbering;
(iv) All minimal (α, β )-separators are complete.
Two important lemmas will be used in the following and will therefore be quoted
here. They are simple reformulations of Lemma 2.19 and Lemma 2.21 of Lauritzen
(1996) and the reader is referred to this text for formal proofs:
Lemma 3.1. Let G = (V, E) be a chordal graph and G 0 = (V, E 0 ) a subgraph with
exactly one less edge. Then G 0 is chordal if and only the endpoints of this edge are
contained in exactly one clique of G .
The next lemma ensures that any chordal subgraph of a chordal graph can be ob-
tained from the larger by removing a single edge at a time without violating chordal-
ity.
Lemma 3.2. Let G = (V, E) be a chordal graph and G 0 = (V, E 0 ) a chordal sub-
graph with k less edges. Then there exists a sequence of chordal graphs G 0 = G0 ⊂
G1 ⊂ · · · ⊂ Gk = G which all differ by a single edge.

Identifying chordal graphs

There are several algorithms for identifying chordal graphs. Here is a greedy algo-
rithm for checking chordality based on the fact that chordal graphs are those that
admit perfect numberings:
Algorithm 3.1. Greedy algorithm for checking chordality of a
graph and identifying a perfect numbering:
1. Look for a vertex v∗ with bd(v∗ ) complete.
If no such vertex exists, the graph is not chordal.
2. Form the subgraph GV \v∗ and let v∗ = |V |;
3. Repeat the process under 1;
4. If the algorithm continues until only one vertex is left,
the graph is chordal and the numbering is perfect.
The worst-case complexity of this algorithm is O(|V |2 ) as |V | − k vertices must be
queried to find the vertex to be numbered as |V Z| − k. The algorithm is illustrated in
Fig. 3.4 and Fig. 3.5.
3.2 Chordal Graphs and Junction Trees 29
5
t t t t t t
@ @ @ @ 6 @ @ 6
t @t @t t @t @t t @t @t
@ @ @ @ @ @
@t @t @t @t @t @t
7 7 7

Fig. 3.4 The greedy algorithm at work. This graph is not chordal, as there is no candidate for
number 4.

5 5
t t t t
@ @ 6 3 @ @ 6
t @t @t t @t @t
@ @ @ @
@t @t @t @t
4 7 4 7

5 1 5
t t t t
3 @ 2 @ 6 3 @ 2 @ 6
t @t @t t @t @t
@ @ @ @
@t @t @t @t
4 7 4 7

Fig. 3.5 The greedy algorithm at work. Initially the algorithm proceeds as in Fig. 3.4. This graph
is chordal and the numbering obtained is a perfect numbering.

Maximum cardinality search

This simple algorithm is due to Tarjan and Yannakakis (1984) and has complexity
O(|V | + |E|). It checks chordality of the graph and generates a perfect numbering
if the graph is chordal. In addition, as we shall see in a moment, the cliques of the
chordal graph can be identified as the algorithm runs.
Algorithm 3.2 (Maximum Cardinality Search). Checking chordality of
a chordal graph and identifying a perfect numbering:
1. Choose v0 ∈ V arbitrary and let v0 = 1;
2. When vertices {1, 2, . . . , j} have been identified, choose v =
j + 1 among V \ {1, 2, . . . , j} with highest cardinality of its
numbered neighbours;
3. If bd( j +1)∩{1, 2, . . . , j} is not complete, G is not chordal;
4. Repeat from 2;
5. If the algorithm continues until only one vertex is left,
the graph is chordal and the numbering is perfect.
The algorithm is illustrated in Fig. 3.7 and Fig. 3.6.
30 3 Graph Decompositions and Algorithms

* 1
s s s s
@ * @ *
s @s @s s @s @s
@ @

@s @s @s @s
@ @ @ @

2 1 2 1
s s s s
* @ ** @ * * @ 3 @ **
s @s @s s @s @s
@s @s @s @s
@ @ @ @
* *
2 1 2 1
s s s s
* @ 3 @ 4 * @ 3 @ 4
s @s @s s @s @s
@s @s @s @s
@ @ @ @
* ** * 5
2 1 2 1
s s s s
6 @ 3 @ 4 6 @ 3 @ 4
s @s @s s @s @s
@s @s @s @s
@ @ @ @
** 5 7 5

Fig. 3.6 Maximum Cardinality Search at work. When a vertex is numbered, a counter for each of
its unnumbered neighbours is increased with one, marked here with the symbol ∗. The counters
keep track of the numbered neighbours of any vertex and are used to identify the next vertex to
be numbered. This graph is not chordal as discovered at the last step because 7 does not have a
complete boundary.
2 1
t t
6 @ 3 @ 4
t @t @t
@ @
@t @t
7 5

Fig. 3.7 MCS numbering for a chordal graph. The algorithm runs essentially as in the non-chordal
case.
3.2 Chordal Graphs and Junction Trees 31

Finding the cliques

Finding the cliques of a general graph is an NP-complete problem. But the cliques
of a chordal graph can be found in a simple fashion from a MCS numbering V =
{1, . . . , |V |}. More precisely we let

Sλ = bd(λ ) ∩ {1, . . . , λ − 1}

and πλ = |Bλ |. Say that λ is a ladder vertex if λ = |V | or if πλ +1 < πλ + 1 and let


Λ be the set of ladder vertices.
It then holds that the cliques of G are Cλ = {λ } ∪ Bλ , λ ∈ Λ . For a proof of this
assertion see e.g. Cowell et al (1999, page 56).

Example 3.2. For the MCS ordering in Fig. 3.7 we find πλ = (0, 1, 2, 2, 2, 1, 1) yield-
ing the ladder nodes {3, 4, 5, 6, 7} and the corresponding cliques

C = {{1, 2, 3}, {1, 3, 4}, {3, 4, 5}, {2, 6}, {6, 7}}.

Junction tree

Let A be a collection of finite subsets of a set V . A junction tree T of sets in A


is an undirected tree with A as a vertex set, satisfying the junction tree property: If
A, B ∈ A and C is on the unique path in T between A and B, then A ∩ B ⊂ C.
If the sets in A are pairwise incomparable, they can be arranged in a junction
tree if and only if A = C where C are the cliques of a chordal graph.
The junction tree can be constructed directly from the MCS ordering Cλ , λ ∈ Λ .
More precisely, since
Bλ = bd(λ ) ∩ {1, . . . , λ − 1}
is complete for all λ ∈ Λ it holds that

Cλ ∩ (∪λ 0 <λ Cλ 0 ) = Cλ ∩Cλ ∗ = Sλ

for some λ ∗ < λ . A junction tree is now easily constructed by attaching Cλ to any
Cλ ∗ satisfying the above. Although λ ∗ may not be uniquely determined, Sλ is. In-
deed, the sets Sλ are the minimal complete separators and the numbers ν(S) are
ν(S) = |{λ ∈ Λ : Sλ = S}|. Junction trees can be constructed in many other ways as
well (Jensen and Jensen 1994). FiXme Fatal: make figure to illustrate
32 3 Graph Decompositions and Algorithms

3.3 Probability Propagation and Junction Tree Algorithms

Junction trees of prime components

In general, the prime components of any undirected graph can be arranged in a


junction tree in a similar way using an algorithm of Tarjan (1985), see also Leimer
(1993).
Then every pair of neighbours (C, D) in the junction tree represents a decompo-
sition of G into GC̃ and GD̃ , where C̃ is the set of vertices in cliques connected to C
but separated from D in the junction tree, and similarly with D̃.
Tarjan’s algorithm is based on first numbering the vertices by a slightly more
sophisticated algorithm (Rose et al 1976) known as Lexicographic Search (LEX)
which runs in O(|V |2 ) time.

Markov properties of junction tree

The factorization property of an undirected graph can be seen as an ‘outer’ fac-


torization over the junction tree into prime components, combined with ‘inner’ or
‘local’ factorizations on each prime component. More precisely, if we let Q ∈ Q be
the prime components of a graph G , arranged in a junction tree T and use that any
graph decomposition also yields a decomposition of the Markov properties we have
the following.
Proposition 3.3. The distribution of X = (Xv , v ∈ V ) factorizes w.r.t. G if and only
if the distribution of XQ , Q ∈ Q factorizes w.r.t. T and each of XQ factorizes w.r.t.
GQ .
If G is decomposable, X = (Xv , v ∈ V ) factorizes w.r.t. G if and only if XC ,C ∈ C
factorizes w.r.t. T .

3.4 Local Computation

Local computation algorithms similar to probability propagation have been devel-


oped independently in a number of areas with a variety of purposes. This includes,
for example:
• Kalman filter and smoother (Thiele 1880; Kalman and Bucy 1961);
• Solving sparse linear equations (Parter 1961);
• Decoding digital signals (Viterbi 1967; Bahl et al 1974);
• Estimation in hidden Markov models (Baum 1972);
• Peeling in pedigrees (Elston and Stewart 1971; Cannings et al 1976);
• Belief function evaluation (Kong 1986; Shenoy and Shafer 1986);
• Probability propagation (Pearl 1986; Lauritzen and Spiegelhalter 1988; Jensen
et al 1990);
3.4 Local Computation 33

• Optimizing decisions (Jensen et al 1994; Lauritzen and Nilsson 2001).


All algorithms are using, explicitly or implicitly, a graph decomposition and a
junction tree or similar to make the computations.

An abstract perspective

Before we describe the local computation algorithms for probability propagation in


detail, it is helpful to look at things from an abstract perspective.
We consider a large finite set V and a collection C of small subsets of V . Our
elementary objects φC ,C ∈ C are valuations with domain C. These can be combined
as
φA ⊗ φB
to form more complex valuations with domain A ∪ B. The combination operation ⊗
is assumed to commutative and associative:

φA ⊗ φB = φA ⊗ φB , (φA ⊗ φB ) ⊗ φC = φA ⊗ (φB ⊗ φC ). (3.3)

Valuations can be marginalised: For A ⊂ V , φ ↓A denotes the A-marginal of φ .


φ ↓Ahas domain A. The marginalisation is assumed to satisfy consonance:
 ↓A
φ ↓(A∩B) = φ ↓B (3.4)

and distributivity:  
(φ ⊗ φC )↓B = φ ↓B ⊗ φC if C ⊆ B. (3.5)

The conditions (3.3), (3.4) and (3.5) are known as the Shenoy–Shafer axioms after
Shenoy and Shafer (1990) who first studied local computation in an abstract per-
spective. The specific algorithms described here only work when the semigroup of
valuations is also separative, i.e. satisfies

φA ⊗ φB = φA ⊗ φA = φB ⊗ φB ⇒ φA = φB ,

which implies that division of valuations can be partially defined (Lauritzen and
Jensen 1997).

Computational challenge

The computational challenge is to calculate marginals ψA = φ ↓A of a joint valuation

φ = ⊗C∈C φC

with domain V = ∪C∈C C.


34 3 Graph Decompositions and Algorithms

We are interested in cases where the direct computation of φ ↓A is impossible if V


is large. Hence we wish to calculate φ ↓A using only local operations, i.e. operating
on factors ψB with domain B ⊆ C for some C ∈ C , taking advantage of the fact that
C are rather small.
Typically there also a second purpose of the calculation. Let us consider some
examples.
Example 3.3 (Probability propagation). We consider a factorizing density on X =
×v∈V Xv with V and Xv finite:

p(x) = ∏ φC (x).
C∈C

The potentials φC (x) depend on xC = (xv , v ∈ C) only. The basic task to calculate a
marginal (likelihood)

p(x∗ E) = p↓E (xE∗ ) = ∑ p(xE∗ , yV \E )


yV \E

for E ⊆ V and fixed xE∗ , but the sum has too many terms. A second purpose is to
calculate the predictive probabilities p(xv | xE∗ ) = p(xv , xE∗ )/p(xE∗ ) for v ∈ V .

Example 3.4 (Sparse linear equations). Here valuations φC are equation systems in-
volving variables with labels C. The combination operation φA ⊗ φB concatenates
equation systems. The marginal φB↓A eliminates variables in B \ A, resulting in an
equation system involving only variables in A. The marginal φ ↓A of the joint valu-
ation thus reduces the system of equations to a smaller one. A second computation
finds a solution of the equation system.

Example 3.5 (Constraint satisfaction). Here the valuations φC represent constraints


involving variables in C; the combination φA ⊗ φB concatenates the constraints; the
marginal φB↓A finds implied constraints. The second computation identifies jointly
feasible configurations. If represented by indicator functions, ⊗ is ordinary product
and φ ↓E (xE∗ ) = ⊕yV \E φ (xE∗ , yV \E ), where 1 ⊕ 1 = 1 ⊕ 0 = 0 ⊕ 1 = 1 and 0 ⊕ 0 = 0.

Computational structure

Algorithms all implicitly or explicitly arrange the collection of sets C in a junction


tree T . Thus the algorithms work if and only if C are cliques of chordal graph G .
If this is not so from the outset, a triangulation is used to construct a chordal
graph G 0 with E ⊆ E 0 . This triangulation can be made in different ways with dif-
ferent computational complexities resulting. Typically, what must be controlled is
the maximal clique size, i.e. the cardinality of the largest C ∈ C . Optimizing this
step is known to be NP-complete (Yannakakis 1981), but there are several heuristic
algorithms which find good triangulations. In fact, there are algorithms, which in
most cases run at reasonable computational speed and are guaranteed to return an
3.4 Local Computation 35

optimal triangulation. Such an algorithm has been implemented in version 6 of the


commercially available software HUGIN (Andersen et al 1989). This algorithm is
based on work of Shoiket and Geiger (1997), Berry et al (2000), and Bouchitté and
Todinca (2001), and it is described in Jensen (2002).
Clearly, in a probabilistic perspective, if P factorizes w.r.t. G it factorizes w.r.t.
G 0 . Henceforth we assume such a triangulation has been made so we work with a
chordal graph G .

Setting up the structure

In many applications P is initially factorizing over a directed acyclic graph D. The


computational structure is then set up in several steps:
1. Moralization: Constructing D m , exploiting that if P factorizes on D, it factor-
izes over D m .
2. Triangulation: Adding edges to find chordal graph G with D m ⊆ G as men-
tioned above;
3. Constructing junction tree:
4. Initialization: Assigning potential functions φC to cliques.

Basic computation

The basic computation now involves following steps


1. Incorporating observations: If XE = xE∗ is observed, we modify potentials as

φC (xC ) ← φC (x) ∏ δ (xe∗ , xe ),


e∈E∩C

with δ (u, v) = 1 if u = v and else δ (u, v) = 0. Then:

∏C∈C φC (xC )
p(x | XE = xE∗ ) = .
p(xE∗ )

2. Marginals p(xE∗ ) and p(xC | xE∗ ) are then calculated by a local message passing
algorithm, to be described in further detail below.

Assigning potentials

Between any two cliques C and D which are neighbours in the junction tree their
intersection S = C ∩ D is one of the minimal separators appearing in the decomposi-
tion sequence. We now explicitly represent these separators in the junction tree and
also assign potentials to them, initially φS ≡ 1 for all S ∈ S , where S is the set of
separators. We also let
36 3 Graph Decompositions and Algorithms

∏C∈C φC (xC )
κ(x) = , (3.6)
∏S∈S φS (xS )
and now it holds that p(x | xE∗ ) = κ(x)/p(xE∗ ). The expression (3.6) will be invariant
under the message passing.

Marginalization

The A-marginal of a potential φB for A ⊆ B is

φB↓A (x) = ∑ φB (y)


yB :yA =xA

If φB depends on x through xB only and B ⊆ V is ‘small’, marginal can be computed


easily. The marginalisation clearly satisfies consonance (3.4) and distributivity (3.5).

Messages

When C sends message to D, the following happens:


  
Before: φC φS φD
  
  
-
↓S
φC↓S
φ
After: φC φD φCS
  

Note that this computation is local, involving only variables within the pair of
cliques. The expression in (3.6) is invariant under the message passing since
φC φD /φS is:
↓S
φC
φC φD φS φC φD
↓S
= .
φC φS
After the message has been sent, D contains the D-marginal of φC φD /φS . To see
this, we calculate
φC φD ↓D φD ↓D φD ↓S
 
= φ = φ ,
φS φS C φS C
where we have used distributivity and consonance.
3.4 Local Computation 37

Second message

Before we proceed to discuss the case of a general junction tree, we shall investigate
what happens when D returns message to C:
 - 
↓S
φC↓S
φ
First message: φC φD φCS
  
   
↓S ↓S
φ φ
Second message: φC φDS φ ↓S φD φCS
  

Now all sets contain the relevant marginal of φ = φC φD /φS , including the separator.
This is seen as follows. The separator contains
!↓S
φ ↓S φC↓S φD↓S
 ↓S
↓S φC φD ↓D ↓S
φ = = (φ ) = φD C = .
φS φS φS

The clique C contains


φ ↓S φC ↓S
φC = φ = φ ↓C
φC↓S φS D
since, as before
 ↓C
φC φD φD ↓D φC ↓S
= φ = φ .
φS φS C φS D
Note that now further messages between C and D are neutral. Nothing will change
if a message is repeated.

Message passing schedule

To describe the message passing algorithm fully we need to arrange for a schedul-
ing of messages to be delivered. As we have seen above, it never harms to send a
message, since the expression (3.6) is invariant under the operation. However, for
computational efficiency it is desirable to send messages in such a way that redun-
dant messages are avoided. The schedule to be described here is used in HUGIN and
has two phases:

C OLL I NFO:
In this first phase, messages are sent from leaves towards arbitrarily chosen root R.
It then holds that after C OLL I NFO, the root potential satisfies φR (xR ) = p(xR , xE∗ ).
38 3 Graph Decompositions and Algorithms

D IST I NFO:
In the second phase messages are sent from the root R towards the leaves of the
junction tree. After C OLL I NFO and subsequent D IST I NFO, it holds that

φB (xB ) = p(xB , xE∗ ) for all B ∈ C ∪ S . (3.7)

Hence p(xE∗ ) = ∑xS φS (xS ) for any S ∈ S and p(xv | xE∗ ) can readily be computed
from any φS with v ∈ S.

Alternative scheduling of messages

Another efficient way of scheduling the messages is via local control. We then allow
clique to send a message if and only if it has already received message from all other
of its neighbours. Such messages are live. Using this protocol, there will be one
clique who first receives messages from all its neighbours. This is effectively the
root R in C OLL I NFO and D IST I NFO. Exactly two live messages along every branch
are needed to ensure that (3.7) holds.

Maximization

Another interesting task is to find the configuration with maximum probability, also
known as the MAP. To solve this, we simply replace the standard sum-marginal with
max-marginal:
φB↓A (x) = max φB (y).
yB :yA =xA

This marginalization also satisfies consonance and distributivity, and hence the same
message passing schemes as above will apply. After C OLL I NFO and subsequent
D IST I NFO, the potentials satisfy

φB (xB ) = max p(xB , xE∗ , yV \(B∪E) ) = p(xB , xE∗ , x̂V \(B∪E) ) for all B ∈ C ∪ S ,
yV \(B∪E)

where x̂ is the most probable configuration. Hence

max p(yV \E , xE∗ )p(x̂V \E , xE∗ ) = max φS (xS ) for any S ∈ S


xS

and the most probable configuration can now readily be identified (Cowell et al
1999, page 98). Viterbi’s decoding algorithm for Hidden Markov Models (Viterbi
1967) is effectively a special instance of max-propagation.
It is also possible to find the k most probable configurations by a local computa-
tion algorithm (Nilsson 1998).
Since (3.6) remains invariant, one can switch freely between max- and sum-
propagation without reloading original potentials.
3.5 Summary 39

Random propagation

Another variant of the message passing scheme picks a random configuration with
distribution p(x | xE∗ ). Recall that after C OLL I NFO, the root potential is φR (x) ∝
p(xR | xE ). We then modify D IST I NFO as follows:
1. Pick random configuration x̌R from φR ;
2. Send message to neighbours C as x̌R∩C = x̌S where S = C ∩ R is the separator;
3. Continue by picking x̌C according to φC (xC\S , x̌S ) and send message further
away from root.
When the sampling stops at the leaves of the junction tree, a configuration x̌ has
been generated from p(x | xE∗ ).
There is an abundance of variants of the basic propagation algorithm; see Cowell
et al (1999) for many of these.

3.5 Summary

Graph decompositions

A partitioning (A, B, S) of V forms a decomposition if S is complete and A ⊥G B | S.


A graph is prime if it has no proper decomposition exists. The prime components
of a graph are the prime induced subgraphs and any finite undirected graph can be
recursively decomposed into its prime components.

Chordal graphs

A graph is chordal if it has no induced cycles of length greater than three. The
following are equivalent for any undirected graph G .
(i) G is chordal;
(ii) G is decomposable;
(iii) All prime components of G are cliques;
(iv) G admits a perfect numbering;
(v) Every minimal (α, β )-separator are complete.
Trees are chordal graphs and thus decomposable. The prime components are the
branches.
Maximum Cardinality Search (MCS) (Tarjan and Yannakakis 1984) identifies
whether a graph is chordal or not. If a graph G is chordal, MCS yields a perfect
numbering of the vertices. In addition it finds the cliques of G :
40 3 Graph Decompositions and Algorithms

Junction tree

A junction tree T of sets A is an undirected tree with A as a vertex set, satisfying


the junction tree property:
If A, B ∈ A and C is on the unique path in T between A and B it holds that
A ∩ B ⊂ C.
If the sets in A are pairwise incomparable, they can be arranged in a junction
tree if and only if A = C where C are the cliques of a chordal graph.
The junction tree can be constructed directly from the MCS ordering Cλ , λ ∈ Λ .

Message passing

Initially the junction tree has potentials φC , c ∈ C ∪ S so that the joint distribution
of interest satisfies
∏ φC (xC )
p(x | xE∗ ) ∝ C∈C .
∏S∈S φS (xS )
The expression on the right-hand side is invariant under message passing. A mes-
sage sent from a clique which has already received message from all other of its
neighbours is live. When exactly two live messages have been sent along every
branch of the junction tree it holds that

φB (xB ) = p(xB , xE∗ ) for all B ∈ C ∪ S ,

from which most quantities of interest can be directly calculated.


Chapter 4
Specific Graphical Models

4.1 Log-linear Models

4.1.1 Interactions and factorization

Let A be a set of subsets of V . A density f or function is said to factorize w.r.t. A


if there exist functions ψa (x) which depend on x through xa only and

f (x) = ∏ ψa (x).
a∈A

The set of distributions PA which factorize w.r.t. A is the hierarchical log–linear


model generated by A . The set A is the generating class of the log-linear model.
Typically the sets in A are taken to be pairwise incomparable under inclusion,
so that no set in A is a subset of another set in A . This need not necessarily be so
but avoids redundancy in the representation.
The traditional notation used for contingency tables lets mi jk denote the mean of
the counts Ni jk in the cell (i, j, k) which is then expanded as e.g.

log mi jk = αi + β j + γk (4.1)

or
log mi jk = αi j + β jk (4.2)
or
log mi jk = αi j + β jk + γik , (4.3)
or (with redundancy)

log mi jk = γ + δi + φ j + ηk + αi j + β jk + γik , (4.4)

To make the connection between this notation and the one used here, we assume
that we have observations X 1 = x1 , . . . , X n = xn and V = {I, J, K}. We then write

41
42 4 Specific Graphical Models

i = 1, . . . , |I| for the possible values of XI etc. and

Ni jk = |{ν : xν = (i, j, k)}|.

Then we have mi jk = n f (x) and if f is strictly positive and factorizes w.r.t. A =


{{I, J}, {J, K}}, it holds that

log f (x) = log ψIJ (xI , xJ ) + log ψJK (xJ , xK ).

Thus if we let

αi j = log n + log ψIJ (xI , xJ ), β jk = log ψJK (xJ , xK )

we have
log mi jk = αi j + β jk .
The main difference is the assumption of positivity needed for the logarithm to be
well defined. This is not necessary when using the multiplicative definition above. It
is typically an advantage to relax the restriction of positivity although it also creates
technical difficulties.
The logarithm of the factors φa = log ψa are known as interaction terms of or-
der |a| − 1 or |a|-factor interactions. Interaction terms of 0th order are called main
effects. In the following we also refer to the factors themselves as interactions and
main effects, rather than their logarithms.

4.1.2 Dependence graphs and factor graphs

Any joint probability distribution P of X = (Xv , v ∈ V ) has a dependence graph


G = G(P) = (V, EP ). This is defined by letting α 6∼ β in G(P) exactly when

α ⊥⊥P β |V \ {α, β }.

X will then satisfy the pairwise Markov w.r.t. G(P) and G(P) is the smallest graph
with this property, i.e. P is pairwise Markov w.r.t. G iff

G(P) ⊆ G .

The dependence graph G(P) for a family P of probability measures is the smallest
graph G so that all P ∈ P are pairwise Markov w.r.t. G :

α ⊥⊥P β |V \ {α, β } for all P ∈ P.

For any generating class A we construct the dependence graph G(A ) = G(PA )
of the log–linear model PA . This is determined by the relation

α ∼ β ⇐⇒ ∃a ∈ A : α, β ∈ a.
4.1 Log-linear Models 43

Sets in A are clearly complete in G(A ) and therefore distributions in PA fac-


torize according to G(A ). They are thus also global, local, and pairwise Markov
w.r.t. G(A ).

Some simple examples

Example 4.1 (Independence). The log–linear model specified by (4.1) is known as


the main effects model. It has generating class A = {{I}, {J}, {K}} consisting of
singletons only and dependence graph
I K
t t

t
J

Thus it corresponds to complete independence.


Example 4.2 (Conditional independence). The log–linear model specified by (4.2)
has no interaction between I and K. It has generating class A = {{I, J}, {J, K}} and
dependence graph
I K
t t
@
@t
J

Thus it corresponds to the conditional independence I ⊥⊥ K | J.


Example 4.3 (No second-order interaction). The log–linear model specified by (4.3)
has no second-order interaction. It has generating class A = {{I, J}, {J, K}, {I, K}}
and its dependence graph
I K
t t
@
@t
J

is the complete graph. Thus it has no conditional independence interpretation.

Conformal log-linear models

As a generating class defines a dependence graph G(A ), the reverse is also true.
The set C (G ) of cliques of G is a generating class for the log–linear model of
distributions which factorize w.r.t. G .
If the dependence graph completely summarizes the restrictions imposed by A ,
i.e. if A = C (G(A )), we say that A is conformal. The generating classes for the
models given by (4.1) and (4.2) are conformal, whereas this is not the case for (4.3).
44 4 Specific Graphical Models

Factor graphs

The factor graph of A is the bipartite graph with vertices V ∪ A and edges define
by
α ∼ a ⇐⇒ α ∈ a.
Using this graph even non-conformal log–linear models admit a simple visual rep-
resentation, as illustrated in Figure 4.1. which displays the factor graph of the non-
conformal model in Example 4.3 with no second-order interaction.

φIK

I @ K
t @t
@
t @
φIJ J φJK

Fig. 4.1 The factor graph of the model in Example 4.3 with no second-order interaction.

If F = F(A ) is the factor graph for A and G = G(A ) the corresponding de-
pendence graph, it is not difficult to see that for A, B, S being subsets of V

A ⊥G B | S ⇐⇒ A ⊥F B | S

and hence conditional independence properties can be read directly off the factor
graph also. In that sense, the factor graph is more informative than the dependence
graph.

4.1.3 Data and likelihood function

Data in list form

Consider a sample X 1 = x1 , . . . , X n = xn from a distribution with probability mass


function p. We refer to such data as being in list form, e.g. as

Case Admitted? Sex


1 Yes Male
2 Yes Female
3 No Male
4 Yes Male
.. .. ..
. . .
4.1 Log-linear Models 45

Contingency Table

Data often presented in the form of a contingency table or cross-classification, ob-


tained from the list by sorting according to category:
Sex
Admitted? Male Female
Yes 1198 557
No 1493 1278
This is a two-way table (or two-way classification) with categorical variables A:
Admitted? and S: Sex. In this case it is a 2 × 2-table. The numerical entries are cell
counts
n(x) = |{ν : xν = x}|
and the total number of observations is n = ∑x∈X n( x).

Likelihood function

Assume now p ∈ PA but otherwise unknown. The likelihood function can be ex-
pressed as
n
L(p) = ∏ p(xν ) = ∏ p(x)n(x) .
ν=1 x∈X

In contingency table form the data follow a multinomial distribution


n!
P{N(x) = n(x), x ∈ X } = ∏ p(x)n(x)
∏x∈X n(x)! x∈X

but this only affects the likelihood function by a constant factor. The likelihood func-
tion is clearly continuous as a function of the (|X |-dimensional vector) unknown
probability distribution p. Since the closure PA is compact (bounded and closed),
L attains its maximum on PA (not necessarily on PA itself).

Uniqueness of the MLE

Indeed, it is also true that L has a unique maximum over PA , essentially because
the likelihood function is log-concave.The proof is indirect: Assume p1 , p2 ∈ PA
with p1 6= p2 and
L(p1 ) = L(p2 ) = sup L(p). (4.5)
p∈PA

Define p
p12 (x) = c p1 (x)p2 (x),
where c−1 = {∑x p1 (x)p2 (x)} is a normalizing constant. Then p12 ∈ PA because
p
46 4 Specific Graphical Models
p
p12 (x) = c p1 (x)p2 (x)
q
= lim c ∏ ψan
n→∞
1 (x)ψ 2 (x) = lim
an n→∞
∏ ψan12 (x),
a∈A a∈A

12 = c1/|A |
p
where e.g. ψan 1 (x)ψ 2 (x). The Cauchy–Schwarz inequality yields
ψan an
r r
c−1 = ∑ p1 (x)p2 (x) < ∑ p1 (x) ∑ p2 (x) = 1.
p
x x x

Hence
n p on(x)
L(p12 ) = ∏ p12 (x)n(x) = ∏ c{ p1 (x)p2 (x)
x x
p n(x) p n(x)
= cn ∏ p1 (x) ∏ p2 (x)
x x
n
p p
=c L(p1 )L(p2 ) > L(p1 )L(p2 ) = L(p1 ) = L(p2 ),

which contradicts (4.5). Hence we conclude p1 = p2 .

Likelihood equations

A simple application of the information inequality yields:


Proposition 4.1. The maximum likelihood estimate p̂ of p is the unique element of
PA which satisfies the system of equations

n p̂(xa ) = n(xa ), ∀a ∈ A , xa ∈ Xa . (4.6)

Here g(xa ) = ∑y:ya =xa g(y) is the a-marginal of the function g.


Proof. See Lauritzen (1996, Thm. 4.8) for a formal proof of this fact.
The system of equations (4.6) expresses the fitting of the marginals in A . This
is also an instance of the familiar result that in an exponential family (log-linear
∼ exponential), the MLE is found by equating the sufficient statistics (marginal
counts) to their expectation.

Iterative proportional scaling

To show that the equations (4.6) indeed have a solution, we simply describe a
convergent algorithm which solves it. This cycles (repeatedly) through all the a-
marginals in A and fit them one by one. For a ∈ A define the following scaling
operation on p:
n(xa )
(Ta p)(x) ← p(x) , x∈X
np(xa )
4.1 Log-linear Models 47

where 0/0 = 0 and b/0 is undefined if b 6= 0. Fitting the marginals The operation Ta
fits the a-marginal if p(xa ) > 0 when n(xa ) > 0:

n(ya )
n(Ta p)(xa ) = n ∑ p(y)
y:ya =xa np(ya )
n(xa )
=n p(y)
np(xa ) y:y∑
a =xa

n(xa )
=n p(xa ) = n(xa ).
np(xa )

Next, we make an ordering of the generators A = {a1 , . . . , ak }. We define S by a


full cycle of scalings
Sp = Tak · · · Ta2 Ta1
and consider the iteration

p0 (x) ← 1/|X |, pn = Spn−1 , n = 1, . . . .

Proposition 4.2. The iteration specified is convergent:

lim pn = p̂,
n→∞

where p̂ is the unique maximum likelihood estimate of p ∈ PA .


In other words, the limit p̂ is the unique solution of the equation system (4.6).
Proof. The key elements in the proof of this result are:
1. If p ∈ PA , so is Ta p;
2. Ta is continuous at any point p of PA with p(xa ) 6= 0 whenever n(xa ) = 0;
3. L(Ta p) ≥ L(p) with equality if and only if (4.6) is satisfied, so lthe ikelihood
always increases at very step;
4. p̂ is the unique fixpoint for T (and S);
5. PA is compact.
We abstain from giving further details here.
The algorithm is known as Iterative Proportional Scaling, the IPS-algorithm, Iter-
ative Proportional Fitting or the IPF-algorithm. It has numerous implementations,
for example in R (inefficiently) in loglin with front end loglm in MASS (Ven-
ables and Ripley 2002).
Example 4.4. We illustrate the steps of the algorithm by a simple example:
Admitted?
Sex Yes No S-marginal
Male 1198 1493 2691
Female 557 1278 1835
A-marginal 1755 2771 4526
48 4 Specific Graphical Models

These data are concerned with student admissions from Berkeley (Bickel et al 1973)
and adapted by Edwards (2000). We consider the model with A ⊥⊥ S, correspond-
ing to A = {{A}, {S}}. We should then fit the A-marginal and the S-marginal. For
illustration we shall do so iteratively. The initial values are uniform:

Admitted?
Sex Yes No S-marginal
Male 1131.5 1131.5 2691
Female 1131.5 1131.5 1835
A-marginal 1755 2771 4526

Initially all entries are equal to 4526/4. Gives initial values of np0 . Next, we fit the
S-marginal:

Admitted?
Sex Yes No S-marginal
Male 1345.5 1345.5 2691 .
Female 917.5 917.5 1835
A-marginal 1755 2771 4526

We have calculated the entries as


2691
1345.5 = 1131.5
1131.5 + 1131.5
and so on. Subsequently we fit the A-marginal:

Admitted
Sex Yes No S-marginal
Male 1043.46 1647.54 2691 .
Female 711.54 1123.46 1835
A-marginal 1755 2771 4526

For example
1755
711.54 = 917.5
917.5 + 1345.5
and so on. The algorithm has now converged, so there is no need to use more steps.
If we wish, we can normalize to obtain probabilities. Dividing everything by 4526
yields p̂.
4.1 Log-linear Models 49

Admitted?
Sex Yes No S-marginal
Male 0.231 0.364 0.595
Female 0.157 0.248 0.405
A-marginal 0.388 0.612 1

In this example it is unnecessary to use the IPS algorithm as there is an explicit


formula. We shall later elaborate on that issue.

IPS by probability propagation

The IPS-algorithm performs the scaling operations Ta :

n(xa )
p(x) ← p(x) , x∈X. (4.7)
np(xa )

This moves through all possible values of x ∈ X , which in general can be huge,
hence impossible.
Jiroušek and Přeučil (1995) realized that the algorithm could be implemented
using probability propagation as follows: A chordal graph G with cliques C so that
for all a ∈ A , a are complete subsets of G is a chordal cover of A . The steps of the
efficient implementation are now:
1. Find chordal cover G of A ;
2. Arrange cliques C of G in a junction tree;
3. Represent p implicitly as

∏C∈C ψC (x)
p(x) = ;
∏S∈S ψS (x)
4. Replace the step (4.7) with

n(xa )
ψC (xC ) ← ψC (xC ) , xC ∈ XC ,
np(xa )

where a ⊆ C and p(xa ) is calculated by probability propagation.


Since the scaling only involves XC , this is feasible if maxC∈C |XC | is of a reasonable
size.

Closed form maximum likelihood

In some cases the IPS algorithm converges after a finite number of cycles. An ex-
plicit formula is then available for the MLE of p ∈ PA .
50 4 Specific Graphical Models

A generating class A is called decomposable if A = C (i.e. A is conformal)


and C are the cliques of a chordal graph G . It can be shown that the IPS-algorithm
converges after a finite number of cycles (at most two) if and only if A is decom-
posable.
Thus A = {{1, 2}, {2, 3}, {1, 3}} is the smallest non-conformal generating class,
demanding proper iteration. Since the IPS-algorithm converges in a finite number
of steps, there must be an explicit expression for calculating the MLE in this case,
to be given below.
Let S be the set of minimal separators of the chordal graph G . The MLE for p
under the log-linear model with generating class A = C (G ) is

∏C∈C n(xC )
p̂(x) = (4.8)
n ∏S∈S n(xS )ν(S)

where ν(S) is the number of times S appears as an intersection a ∩ b of neighbours


in a junction tree T with A as vertex set.
A simple inductive argument shows that p̂ given above indeed satisfies the like-
lihood equation (4.6) and hence this must be the MLE. Contrast this result with the
factorization of the probability function itself:

∏C∈C p(xC )
p(x) = .
∏S∈S p(xS )ν(S)
For the specific case where G is a tree, (4.8) reduces to

∏e∈E n(xe ) 1 n(xuv )


p̂(x) = = ∏ ∏ n(xv ), (4.9)
n ∏v∈V n(xv )deg(v)−1 n uv∈E n(xu )n(xv ) v∈V

where we have used that the degree of a vertex exactly is equal to the number of
times this vertex occurs as an endpoint of an edge.

4.2 Gaussian Graphical Models

4.2.1 The multivariate Gaussian distribution

Definition and density

A d-dimensional random vector X = (X1 , . . . , Xd ) has a multivariate Gaussian dis-


tribution or normal distribution on R d if there is a vector ξ ∈ R d and a d × d matrix
Σ such that
λ > X ∼ N (λ > ξ , λ > Σ λ ) for all λ ∈ Rd . (4.10)
We then write X ∼ Nd (ξ , Σ ). Taking λ = ei or λ = ei + e j where ei is the unit vector
with i-th coordinate 1 and the remaining equal to zero yields:
4.2 Gaussian Graphical Models 51

Xi ∼ N (ξi , σii ), Cov(Xi , X j ) = σi j .

The definition (4.10) makes sense if and only if λ > Σ λ ≥ 0, i.e. if Σ is positive
semidefinite.
If Σ is positive definite, i.e. if λ > Σ λ > 0 for λ 6= 0, the multivariate distribution
has density w.r.t. Lebesgue measure on R d
> K(x−ξ )/2
f (x | ξ , Σ ) = (2π)−d/2 (det K)1/2 e−(x−ξ ) , (4.11)

where K = Σ −1 is the concentration matrix of the distribution. We then also say that
Σ is regular.

Marginal and conditional distributions

Partition X into X1 and X2 , where X1 ∈ R r and X2 ∈ R s with r + s = d and partition


mean vector, concentration and covariance matrix accordingly as
! ! !
ξ1 K11 K12 Σ11 Σ12
ξ= , K= , Σ=
ξ2 K21 K22 Σ21 Σ22

so that Σ11 is r × r and so on. Then, if X ∼ Nd (ξ , Σ ) it holds that

X2 ∼ Ns (ξ2 , Σ22 )

and
X1 | X2 = x2 ∼ Nr (ξ1|2 , Σ1|2 ),
where
− −
ξ1|2 = ξ1 + Σ12 Σ22 (x2 − ξ2 ) and Σ1|2 = Σ11 − Σ12 Σ22 Σ21 .

Here Σ22 is an arbitrary generalized inverse to Σ22 , i.e. any symmetric matrix which
satisfies
− −
Σ22 Σ22 = Σ22 Σ22 = I.
In the regular case it also holds that
−1 −1
K11 = Σ11 − Σ12 Σ22 Σ21 (4.12)

and
−1 −1
K11 K12 = −Σ12 Σ22 , (4.13)
so then,
−1 −1
ξ1|2 = ξ1 − K11 K12 (x2 − ξ2 ) and Σ1|2 = K11 .
In particular, if Σ12 = 0, X1 and X2 are independent.
52 4 Specific Graphical Models

Factorization of the multivariate Gaussian

Consider a multivariate Gaussian random vector X = NV (ξ , Σ ) with Σ regular so it


has density
> K(x−ξ )/2
f (x | ξ , Σ ) = (2π)−|V |/2 (det K)1/2 e−(x−ξ ) ,

where K = Σ −1 is the concentration matrix of the distribution. Thus the Gaussian


density factorizes w.r.t. G if and only if

α 6∼ β ⇒ kαβ = 0

i.e. if the concentration matrix has zero entries for non-adjacent vertices.

Gaussian likelihood functions

Consider ξ = 0 and a sample X 1 = x1 , . . . , X n = xn Nd (0, Σ ) with Σ regular. Using


(4.11), we get the likelihood function
n ν )> Kxν /2
L(K) = (2π)−nd/2 (det K)n/2 e− ∑ν=1 (x
n ν (xν )> }/2
∝ (det K)n/2 e− ∑ν=1 tr{Kx
n ν (xν )> }/2
= (det K)n/2 e− tr{K ∑ν=1 x
= (det K)n/2 e− tr(KW )/2 . (4.14)

where
n
W= ∑ xν (xν )>
ν=1

is the matrix of sums of squares and products.

4.2.2 The Wishart distribution

The Wishart distribution is the sampling distribution of the matrix of sums of squares
and products. More precisely, a random d × d matrix S has a d-dimensional Wishart
distribution with parameter Σ and n degrees of freedom if
n
D
W = ∑ X ν (X ν )>
i=1

where X ν ∼ Nd (0, Σ ). We then write

W ∼ Wd (n, Σ ).
4.2 Gaussian Graphical Models 53

The Wishart distribution is the multivariate analogue of the χ 2 :

W1 (n, σ 2 ) = σ 2 χ 2 (n).

Basic properties of the Wishart distribution

If W ∼ Wd (n, Σ ) its mean is E(W ) = nΣ . If W1 and W2 are independent with Wi ∼


Wd (ni , Σ ), then
W1 +W2 ∼ Wd (n1 + n2 , Σ ).
If A is an r × d matrix and W ∼ Wd (n, Σ ), then

AWA> ∼ Wr (n, AΣ A> ).

For r = 1 we get that when W ∼ Wd (n, Σ ) and λ ∈ Rd ,

λ >W λ ∼ σλ2 χ 2 (n),

where σλ2 = λ > Σ λ .

Wishart density

If W ∼ Wd (n, Σ ), where Σ is regular, then W is regular with probability one if and


only if n ≥ d. When n ≥ d the Wishart distribution has density

fd (w | n, Σ )
−1 w)/2
= c(d, n)−1 (det Σ )−n/2 (det w)(n−d−1)/2 e− tr(Σ

w.r.t. Lebesgue measure on the set of positive definite matrices. The Wishart con-
stant c(d, n) is
d
c(d, n) = 2nd/2 (2π)d(d−1)/4 ∏ Γ {(n + 1 − i)/2}.
i=1

4.2.3 Gaussian graphical models

Conditional independence in the multivariate Gaussian distribution

Consider X = (X1 , . . . , XV ) ∼ N|V | (0, Σ ) with Σ regular and K = Σ −1 . The concen-


tration matrix of the conditional distribution of (Xα , Xβ ) given XV \{α,β } is
54 4 Specific Graphical Models
!
kαα kαβ
K{α,β } = .
kβ α kβ β

Hence
α ⊥⊥ β |V \ {α, β } ⇐⇒ kαβ = 0.
Thus the dependence graph G (K) of a regular Gaussian distribution is given by

α 6∼ β ⇐⇒ kαβ = 0.

Graphical models

S (G ) denotes the symmetric matrices A with aαβ = 0 unless α ∼ β and S + (G )


their positive definite elements.
A Gaussian graphical model for X specifies X as multivariate normal with K ∈
S + (G ) and otherwise unknown. Note that the density then factorizes as

1
log f (x) = constant − ∑ kαα xα2 − ∑ kαβ xα xβ ,
2 α∈V {α,β }∈E

hence no interaction terms involve more than pairs. This is different from the dis-
crete case and generally makes things easier.

Likelihood function

The likelihood function based on a sample of size n is

L(K) ∝ (det K)n/2 e− tr(KW )/2 ,

where W is the Wishart matrix of sums of squares and products, W ∼ W|V | (n, Σ )
with Σ −1 = K ∈ S + (G ). For any matrix A we let A(G ) = {a(G )αβ } where
(
aαβ if α = β or α ∼ β
a(G )αβ =
0 otherwise.

Then, as K ∈ S (G ) it holds for any A that

tr(KA) = tr{KA(G )}. (4.15)

Using this fact for A = W we can identify the family as a (regular and canonical)
exponential family with elements of W (G ) as canonical sufficient statistics and the
maximum likelihood estimate is therefore given as the unique solution to the system
of likelihood equations
4.2 Gaussian Graphical Models 55

E{W (G )} = nΣ (G ) = w(G )obs .

Alternatively we can write the equations as

nσ̂vv = wvv , nσ̂αβ = wαβ , v ∈ V, {α, β } ∈ E,

with the model restriction Σ −1 ∈ S + (G ). This ‘fits variances and covariances along
nodes and edges in G ’ so we can write the equations as

nΣ̂cc = wcc for all cliques c ∈ C (G ),

hence making the equations analogous to the discrete case. From (4.15) it follows
that we for K̂ have

tr{K̂W } = tr{K̂W (G )} = tr{K̂nΣ̂ (G )} = n tr{K̂ Σ̂ } = nd

so that the maximized likelihood function becomes

L(K̂) = (2π)−nd/2 (det K̂)n/2 e−n/2 ∝ (det K̂)n/2 . (4.16)

Iterative Proportional Scaling

For K ∈ S + (G ) and c ∈ C , define the operation of ‘adjusting the c-marginal’ as


follows. Let a = V \ c and
!
n(wcc )−1 + Kca (Kaa )−1 Kac Kca
Tc K = . (4.17)
Kac Kaa

This operation is clearly well defined if wcc is positive definite. Exploiting that it
holds in general that
−1
(K −1 )cc = Σcc = Kcc − Kca (Kaa )−1 Kac

,

we find the covariance Σ̃cc corresponding to the adjusted concentration matrix as

Σ̃cc = {(Tc K)−1 }cc


−1
= n(wcc )−1 + Kca (Kaa )−1 Kac − Kca (Kaa )−1 Kac


= wcc /n,

hence Tc K does indeed fit the marginals.


From (4.17) it is seen that the pattern of zeros in K is preserved under the oper-
ation Tc , and it can also be seen to stay positive definite. In fact, Tc scales propor-
tionally in the sense that
56 4 Specific Graphical Models

f (xc | wcc /n)


f {x | (Tc K)−1 } = f (x | K −1 ) .
f (xc | Σcc )

This clearly demonstrates the analogy to the discrete case.


Next we choose any ordering (c1 , . . . , ck ) of the cliques in G . Choose further
K0 = I and define for r = 0, 1, . . .

Kr+1 = (Tc1 · · · Tck )Kr .

The following now holds


Proposition 4.3. Consider a sample from a covariance selection model with graph
G . Then
K̂ = lim Kr ,
r→∞

provided the maximum likelihood estimate K̂ of K exists.

Proof. This is Theorem 5.4 of Lauritzen (1996).

The problem of existence of the MLE is non-trivial:


(i) If n < supa∈A |a| the MLE does not exist.
(ii) if n > supC∈C |C| − 1, where C are the cliques of a chordal cover of A the
MLE exists with probability one.
The quantity τ(G ) being the smallest possible value of the right hand-side of (ii)
supC∈C (G ∗ ) |C| − 1, is known as the tree-width of the graph G . Calculation of the
tree-width is NP-complete, but for any fixed k it can be decided in linear time
whether τ ≤ k.
For n between these values the general situation is unclear. For the k-cycle it
holds Buhl (1993) that for n = 2,
2
P{MLE exists | Σ = I} = 1 − ,
k − 1!
whereas for n = 1 the MLE does not exist and for n ≥ 3 the MLE exists with prob-
ability one, as a k-cycle has tree-width 2.

Chordal graphs

If the graph G is chordal, we say that the graphical model is decomposable. We then
have the familiar factorization of densities

∏C∈C f (xC | ΣC )
f (x | Σ ) = (4.18)
∏S∈S f (xS | ΣS )ν(S)

where ν(S) is the number of times S appears as an intersection between neighbour-


ing cliques of a junction tree for C .
4.2 Gaussian Graphical Models 57

Relations for trace and determinant

Using the factorization (4.18) we can match the expressions for the trace and deter-
minant to obtain that for a chordal graph G it holds that

tr(KW ) = ∑ tr(KCWC ) − ∑ ν(S) tr(KSWS )


C∈C S∈S

and further

∏C∈C det{(K −1 )C }
det Σ = {det(K)}−1 =
∏S∈S [det{(K −1 )S }]ν(S)
∏C∈C det{ΣC }
= .
∏S∈S {det(ΣS )}ν(S)
If we let K = W = I in the first of these equations we obtain the identity

|V | = ∑ |C| − ∑ ν(S)|S|,
C∈C S∈S

which is also a special case of (3.2).

Maximum likelihood estimates

For a |d| × |e| matrix A = {aγ µ }γ∈d,µ∈e we let [A]V denote the matrix obtained from
A by filling up with zero entries to obtain full dimension |V | × |V |, i.e.
(
V
 aγ µ if γ ∈ d, µ ∈ e
[A] γ µ =
0 otherwise.

For a chordal graph it holds that the maximum likelihood estimates exists if and
only if n ≥ C for all C ∈ C . As in the discrete case, then the IPS-algorithm converges
in a finite number of steps.
The following simple formula then holds for the maximum likelihood estimate
of K: ( )
h iV h iV
−1 −1
K̂ = n ∑ (wC ) − ∑ ν(S) (wS ) (4.19)
C∈C S∈S

and the determinant of the MLE is

∏S∈S {det(wS )}ν(S) d


det(K̂) = n . (4.20)
∏C∈C det(wC )
Note that setting W = I in the first identity yields another variant of (3.2) to

1= ∑ χC − ∑ ν(S)χS , (4.21)
C∈C S∈S
58 4 Specific Graphical Models

where χA is the indicator function for the set A.

4.3 Summary

A brief summary of the contents of this chapter is given below.

Log–linear models

A density f factorizes w.r.t. a set A of subsets of V if

f (x) = ∏ ψa (x).
a∈A

The set of distributions PA which factorize w.r.t. a set of A is the hierarchical


log–linear model with generating class A .

Dependence graph

The dependence graph G (P) for a family of distributions P is the smallest graph
G so that
α ⊥⊥P β |V \ {α, β } for all P ∈ P.
The dependence graph of a log-linear model PA is determined by

α ∼ β ⇐⇒ ∃a ∈ A : α, β ∈ a.

Distributions in PA factorize according to G (A ) and are all global, local, and


pairwise Markov w.r.t. G (A ).

Conformal log-linear model

The set C (G ) of cliques of G is a generating class for the log–linear model of dis-
tributions which factorize w.r.t. G . If the dependence graph completely summarizes
the restrictions imposed by A , i.e. if A = C (G (A )), A is conformal.

Likelihood equations

For any generating class A it holds that the maximum likelihood estimate p̂ of p is
the unique element of PA which satisfies the system of equations

n p̂(xa ) = n(xa ), ∀a ∈ A , xa ∈ Xa .
4.3 Summary 59

The equations are solved by Iterative Proportional Scaling. For a ∈ A we let

n(xa )
(Ta p)(x) ← p(x) , x∈X.
np(xa )

and define S by
Sp = Tak · · · Ta2 Ta1 p.
Let p0 (x) ← 1/|X |, pn = Spn−1 , n = 1, . . . . It then holds that limn→∞ pn = p̂
where p̂ is the unique maximum likelihood estimate of p ∈ PA .

Closed form maximum likelihood

The generating class A is decomposable if A = C where C are the cliques of a


chordal graph.
The IPS-algorithm converges after at a finite number of cycles (at most two) if
and only if A is decomposable.
The MLE for p under the decomposable log-linear model A = C (G ) is

∏C∈C n(xC )
p̂(x) = ,
n ∏S∈S n(xS )ν(S)

where ν(S) is the usual multiplicity of a separator.

Gaussian graphical models

The likelihood function based on a sample of size n is

L(K) ∝ (det K)n/2 e− tr(KW )/2 ,

where W is the Wishart matrix of sums of squares and products, W ∼ W|V | (n, Σ )
with Σ −1 = K ∈ S + (G ), where S + (G ) are the positive definite matrices with
α 6∼ β ⇒ kαβ = 0.
The MLE of K̂ is the unique element of S + (G ) satisfying

nΣ̂cc = wcc for all cliques c ∈ C (G ).

These equations are also solved by Iterative Proportional Scaling: For K ∈ S + (G )


and c ∈ C , let !
n(wcc )−1 + Kca (Kaa )−1 Kac Kca
Tc K = .
Kac Kaa
Next choose an ordering (c1 , . . . , ck ) of the cliques in G , let K0 = I and define for
r = 0, 1, . . .
Kr+1 = (Tc1 · · · Tck )Kr .
60 4 Specific Graphical Models

It then holds that K̂ = limr→∞ Kr , provided the maximum likelihood estimate K̂ of K


exists.
If the graph G is chordal, we say that the graphical model is decomposable. In
this case, the IPS-algorithm converges in at most two cycles, as in the discrete case.
The MLE for decomposable models is given as
( )
h iV h iV
−1 −1
K̂ = n ∑ (wC ) − ∑ ν(S) (wS ) .
C∈C S∈S
Chapter 5
Further Statistical Theory

5.1 Hyper Markov Laws

Special Wishart distributions

The formula for the maximum likelihood estimate (4.19) derived in the previous
chapter specifies Σ̂ as a random matrix. As we shall see, the sampling distribution
of this random Wishart-type matrix is partly reflecting Markov properties of the
graph G . Before we delve further into this, we shall need some more terminology.

Laws and distributions

Families of distributions may not always be simply parameterized, or we may want


to describe the families without specific reference to a parametrization. Generally
we think of a family of the form

P = {Pθ , θ ∈ Θ }

and sometimes identify P with Θ which is justified when the parametrization

θ → Pθ

is one-to-one and onto. For example, in a Gaussian graphical model θ = K ∈


S + (G ) is uniquely identifying any regular Gaussian distribution satisfying the
Markov properties w.r.t. G .
Parametrization of a hierarchical log-linear model when P = PA is more sub-
tle, and specific choices must be made to ensure a one-to-one correspondence be-
tween the parameters; we omit the details here.
In any case, any probability measure on P (or on Θ ) represents a random el-
ement of P, i.e. a random distribution. The sampling distribution of a maximum

61
62 5 Further Statistical Theory

likelihood estimate such as p̂ is an example of such a measure, as are Bayesian prior


distributions on Θ (or P).
In the following we generally refer to a probability measure on P as a law,
whereas a distribution is used to signify a probability measure on X . Thus we shall
e.g. speak of the Wishart law as we emphasize that it specifies a distribution of
f (· | Σ ), by considering Σ to be random.

Hyper Markov laws

We identify θ ∈ Θ with Pθ ∈ P, so e.g. θA for A ⊆ V denotes the distribution of XA


under Pθ and θA | B the family of conditional distributions of XA given XB , etc. For a
law L on Θ we write

A ⊥⊥L B | S ⇐⇒ θA∪S ⊥⊥L θB∪S | θS .

A law L on Θ is said to be hyper Markov w.r.t. G if


(i) All θ ∈ Θ are globally Markov w.r.t. G ;
(ii) A ⊥⊥L B | S whenever S is complete and A ⊥G B | S.
Note that the conditional independence in (ii) is only required to hold for graph
decompositions as S is assumed to be complete. This implies in particular that a
law can be hyper Markov on G without being hyper Markov on G ∗ = (V, E ∗ ) even
if this has more edges than G , i.e. E ∗ ⊇ E. This is because G ∗ may have more
complete subsets than G and hence more independence statements are required to
be true. The complication is a consequence of the fact that we have deviated from
Dawid and Lauritzen (1993) and defined hyper Markov laws for graphs that are not
necessarily chordal; see Corollary 5.2 below.
If θ follows a hyper Markov law for the graph in Fig. 3.3 it holds for example
that
θ1235 ⊥⊥ θ24567 | θ25 .
We shall later show that this is true for θ̂ = p̂ and also for Σ̂ in the graphical model
with this graph, i.e. if W ∼ W7 ( f , Σ ) with Σ −1 = K ∈ S + (G ) then it holds, for
example, for the maximum likelihood estimate that
( )−1
1 h iV h iV
Σ̂ = ∑ (wC )−1 − ∑ ν(S) (wS )−1
n C∈C S∈S

that
Σ̂1235 ⊥⊥ Σ̂24567 | Σ̂25 .
5.1 Hyper Markov Laws 63

Consequences of the hyper Markov property

If A ⊥⊥L B | S we may further deduce that θA ⊥⊥L θB | θS , since θA and θB are


functions of θA∪S and θB∪S respectively. But the converse is false. The relation
θA ⊥⊥L θB | θS does not imply θA∪S ⊥⊥L θB∪S | θS , since θA∪S is not in general a
function of (θA , θS ). In contrast, XA∪B is a (one-to-one) function of (XA , XB ). How-
ever since θA | S and θB | S are functions of (θA∪S , θB∪S ), it generally holds that

A ⊥⊥L B | S ⇐⇒ θA | S ⊥⊥L θB | S | θS . (5.1)

Under some circumstances it is of interest to consider the notion of a strong hyper


Markov law, demanding complete mutual independence of conditional and marginal
distributions:
θA | S ⊥⊥L θB | S ⊥⊥L θS
whenever S is complete and separates A from B. This is clearly stronger than (5.1).
The notion is of particular importance for Bayesian analysis of graphical models
with chordal graphs.

Example 5.1. This little example is a special case where we can directly demonstrate
the hyper Markov property of the law of the maximum likelihood estimate. Consider
the conditional independence model with graph
s s s
I J K
Here the MLE based on data X (n) = (X 1 , . . . , X n ) is
Ni j+ N+ jk
p̂i jk =
nN+ j+

and
Ni j+ N+ jk N+ j+
p̂i j+ = , p̂+ jk = , p̂+ j+ = .
n n n
Clearly, it holds that p̂ is Markov on G and
(n)
{Ni j+ } ⊥⊥ {N+ jk } | {X j }.

But since e.g.


!
(n) n+ j+ ! n
P({Ni j+ = ni j } | {X j }) = ∏ ∏ p i j+
∏i ni j+ ! i i j+
,
j

we have
(n)
{Ni j+ } ⊥⊥ {X j } | {N+ j+ }
and hence
{Ni j+ } ⊥⊥ {N+ jk } | {N+ j+ },
64 5 Further Statistical Theory

which yields the hyper Markov property of p̂. The law does not satisfy the strong
hyper Markov property as the range of, say, {Ni j+ } is constrained by the value of
{N+ j+ }.

Chordal graphs

For chordal graphs the hyper Markov and ordinary Markov property are less differ-
ent. For example, it is true for chordal graphs that the Markov property is preserved
when (chordal) supergraphs are formed.
Proposition 5.1. If G = (V, E) and G ∗ = (V, E ∗ ) are both chordal graphs and E ⊆
E ∗ , then any hyper Markov law L over G is hyper Markov over G ∗ .

Proof. This result is Theorem 3.10 of Dawid and Lauritzen (1993) but we shall give
a direct argument here. Firstly, as any Markov distribution over G is Markov over
the supergraph G ∗ , we only have to show the second condition for the law to be
hyper Markov.
Lemma 3.2 implies that it is sufficient to consider the case where E and E ∗ differ
by a single edge with endpoints {α, β } then contained in a single clique C∗ of G ∗
according to Lemma 3.1. The clique C∗ is the only complete separator in G ∗ which
is not a complete separator in G . So we have to show that for any hyper Markov law
L on G it holds that

A ⊥G ∗ B |C∗ ⇒ θA |C∗ ⊥⊥ θB |C∗ | θC∗ . (5.2)

We let C = C∗ \ {α, β } and realize that we must have α ⊥G β |C since G and


G∗ is chordal and any path in from α to β circumventing C would create a cycle in
G or in G ∗ . Let Aα be the vertices in A which are not separated from α by α ∪ C,
Aᾱ = A \ Aα , and similarly with Bα , Bᾱ . The same argument implies the separations

Aα ⊥G (Aᾱ ∪ B ∪ β ) | α ∪C,
Aᾱ ⊥G (Aα ∪ B ∪ α) | β ∪C
Bα ⊥G (Bᾱ ∪ A ∪ β ) | α ∪C
Bᾱ ⊥G (Bα ∪ A ∪ α) | β ∪C

In summary this means that the entire joint distribution θ can be represented as

θ = θC θα|C θβ |C θAα |α∪C θBα |α∪C θAᾱ |β ∪C θBᾱ |β ∪C

and also that its constituents satisfy the Markov property w.r.t. the graph in Fig. 5.1.
Using this Markov property in combination with the fact that

θA|C∗ = θAα |α∪C θAᾱ |β ∪C , θB|C∗ = θBα |α∪C θBᾱ | β ∪C , θC∗ = θα|C θβ |C θC ,

yields (5.2) and the proof is complete.


5.1 Hyper Markov Laws 65
θC
u
@
@
@
u @u β |C
θα|C θ
A A
 A  A
 A  A
u
 Au u
 Au
θAα |α∪C θBα |α∪C θAᾱ |β ∪C θBᾱ |β ∪C

Fig. 5.1 The Markov structure of the joint law of the constituents of θ .

A consequence of this result is the following corollary, stating that for chordal
graphs it is not necessary to demand that S is a complete separator to obtain the
relevant conditional independence.
Proposition 5.2. If G is chordal and θ is hyper Markov on G , it holds that

A ⊥G B | S ⇒ A ⊥⊥L B | S.

Proof. Again, this is Theorem 2.8 of Dawid and Lauritzen (1993). It follows by
forming the graph G [S] connecting all pairs of vertices in S and connecting any other
pair α, β if and only if ¬(α ⊥G β | S). Then G [S] is a chordal graph with G [S] ≥ G
so that A ⊥G [S] B | S, and Proposition 5.1 applies.

If G is not chordal, we can form a chordal cover G ∗ by completing all prime


components of G . Then if θ is hyper Markov on G , it is also hyper Markov on G ∗
and thus
A ⊥G ∗ B | S ⇒ A ⊥⊥L B | S.
But the similar result would be false for an arbitrary chordal cover of G . The hyper
Markov property thus has a simple formulation in terms of junction trees: Arrange
the prime components Q of G in a junction tree T with complete separators S and
consider the extended junction tree T ∗ which is the (bipartite) tree with Q ∪ S as
vertices and edges from separators to prime components so that C ∼ S ∼ D in T ∗ if
and only if C ∼ D in T . Next, associate θA to A for each A ∈ Q ∪ S . It now holds
that

A ⊥T ∗ B | S ⇐⇒ A ⊥G ∗ B | S ⇐⇒ ∃S∗ ⊆ S : A ⊥G B | S∗ with S∗ complete,

implying that L is hyper Markov on G if and only if {θA , A ∈ Q ∪ S } is globally


Markov w.r.t. the extended junction tree T ∗ .
66 5 Further Statistical Theory

Directed hyper Markov property

We have similar notions and results in the directed case. Say that L = L (θ ) is
directed hyper Markov w.r.t. a DAG D if θ is directed Markov on D for all θ ∈ Θ
and
θv∪pa(v) ⊥⊥L θnd(v) | θpa(v) ,
or equivalently θv | pa(v) ⊥⊥L θnd(v) | θpa(v) , or equivalently for a well-ordering

θv∪pa(v) ⊥⊥L θpr(v) | θpa(v) .

It clearly holds that if v∗ is a terminal vertex in V and L is directed hyper Markov


over D, then LV \{v∗ } is directed hyper Markov over DV \{v∗ } . Repeated use of this
fact yields that if L is directed hyper Markov over D and A is an ancestral set, then
LA is directed hyper Markov over DA .
Indeed, if D is perfect, L is directed hyper Markov w.r.t. D if and only if L is
hyper Markov w.r.t. G = σ (D) = D m .

5.2 Meta Markov Models

Meta independence

Stochastic independence and conditional independence of parameters of marginal


and conditional distributions can only occur when the associated parameters are
variation independent. In the following we formalize such relationships among pa-
rameters of graphical models. We shall for A, B ⊆ V identify

θA∪B = (θB | A , θA ) = (θA | B , θB ),

i.e. any joint distribution of XA∪B is identified with a pair of further marginal and

conditional distributions. Define for S ⊆ V the S-section Θ θS of Θ as

Θ θS = {θ ∈ Θ : θS = θS∗ , θ ∈ Θ }.

The meta independence relation ‡P is defined as


∗ θ∗ θ∗
A ‡P B | S ⇐⇒ ∀θS∗ ∈ ΘS : Θ θS = ΘA S| S ×ΘB S| S ,

In words, A and B are meta independent w.r.t. P given S, if the pair of conditional
distributions (θA | S , θB | S ) vary in a product space when θS is fixed. Equivalently,
fixing the values of θB | S and θS places the same restriction on θA | S as just fixing θS .
The relation ‡P satisfies the semigraphoid axioms as it is a special instance of
variation independence.
5.2 Meta Markov Models 67

Meta Markov models

We say that a model determined by a family of distributions P, or its parametriza-


tion Θ , is meta Markov w.r.t. G if
(i) All θ ∈ Θ are globally Markov w.r.t. G ;
(ii) A ⊥G B | S ⇒ A ‡P B | S whenever S is complete.
Thus, a Markov model is meta Markov if and only if

A ⊥G ∗ B | S ⇒ A ‡P B | S,

where G ∗ is obtained from G by completing all prime components. Note that if G


is chordal, we have G ∗ = G and hence it holds for any meta Markov model P that

A ⊥G B | S ⇒ A ‡P B | S.

Hyper Markov laws and meta Markov models

Note that for any triple (A, B, S) and any law L on Θ it holds that

A ⊥⊥L B | S ⇒ A ‡P B | S

for if θA | S ⊥⊥L θB | S | θS it must in particular be true that (θA | S , θB | S ) vary in a prod-


uct space for every fixed value of θS . Thus hyper Markov laws live on meta Markov
models: If a law L on Θ is hyper Markov w.r.t. G , Θ is meta Markov w.r.t. G .
In particular, if a Markov model is not meta Markov, it cannot carry a hyper
Markov law without further restricting to Θ0 ⊂ Θ .
A Gaussian graphical model with graph G is meta Markov on G . This follows
for example from results of collapsibility of Gaussian graphical models in Fryden-
berg (1990b), who show that in such a model, the conditional distribution θV \C|C is
variation independent of the marginal distribution θC if and only if the boundary of
every connected component of V \C is complete, which trivially holds when C itself
is complete. FiXme Fatal: Hertil er jeg kommet

Log-linear meta Markov models

Using results on collapsibility of log-linear models (Asmussen and Edwards 1983),


it follows that a log-linear model PA is meta Markov on its dependence graph
G (A ) if and only if for any minimal complete separator S of G (A ) there there is
an a ∈ A with S ⊆ a. FiXme Note: check this result, please In particular, if A is
conformal, PA is meta Markov. FiXme Note: give argument
Example 5.2. The log-linear model with generating class

A = {ab, ac, ad, bc, bd, be, cd, ce, de}


68 5 Further Statistical Theory

has dependence graph with cliques C = {abcd, bcde}, displayed in Fig. 5.2. Since
the complete separator bcd is not in A , this model is not meta Markov.

b d

a e

bd
b d

ab de

ad be
a bc e
cd

ac ce
c

Fig. 5.2 Dependence and factor graph of the generating class A in Example 5.2.

Example 5.3. The model with generating class

A 0 = {ab, ac, ad, bcd, be, ce, de}

has the same dependence graph G (A 0 ) = G (A ) but even though A 0 is not confor-
mal, PA 0 is meta Markov on G (A 0 ).

Example 5.4. The model with generating class

A 00 = {ab, ac, bc, bd, cd, ce, de}

has a different dependence graph G (A 00 ), see Fig. 5.4. The separator bcd is not in
A 00 , but PA 00 is meta Markov on G (A 00 ), as both minimal separators bc and cd are
in A 00 .
5.2 Meta Markov Models 69

b d

a e

b d

ab de
bcd
ad be
a e

ac ce
c

Fig. 5.3 Dependence and factor graph of the generating class A 0 in Example 5.3.

bd
b d

ab de

bc cd
a e

ac ce
c

Fig. 5.4 Factor graph of the generating class A 00 in Example 5.4. The dependence graph looks
identical to the factor graph when edge labels are removed.

Meta Markov properties on supergraphs

If θ is globally Markov w.r.t. the graph G , it is also Markov w.r.t. any super graph
G 0 = (V, E 0 ) with E ⊆ E 0 .
The similar fact is not true for meta Markov models. For example, the Gaussian
graphical model for the 4-cycle G with adjacencies 1 ∼ 2 ∼ 3 ∼ 4 ∼ 1, is meta
Markov on G , because it has no complete separators.
But the same model is not meta Markov w.r.t. the larger graph G 0 with cliques
{124, 234}, since for any K ∈ S + (G ),
70 5 Further Statistical Theory
σ12 σ14 σ13 σ34
σ24 = + .
σ11 σ33
So fixing the value of σ24 restricts the remaining parameters in a complex way.

Maximum likelihood in meta Markov models

Under certain conditions, the MLE θ̂ of the unknown distribution θ will follow a
hyper Markov law over Θ under Pθ . These are
(i) Θ is meta Markov w.r.t. G ;
(n)
(ii) For any prime component Q of G , the MLE θ̂Q for θQ based on XQ is suffi-
cient for ΘQ and boundedly complete.
A sufficient condition for (ii) is that ΘQ is a full and regular exponential family in
the sense of Barndorff-Nielsen (1978). In particular, these conditions are satisfied
for any Gaussian graphical model and any meta Markov log-linear model.

Canonical construction of hyper Markov laws

The distributions of maximum likelihood estimators are important examples of hy-


per Markov laws. But for chordal graphs there is a canonical construction of such
laws.
Let C be the cliques of a chordal graph G and let LC ,C ∈ C be a family of laws
over ΘC ⊆ P(XC ). The family of laws are hyperconsistent if for any C and D with
/ LC and LD induce the same law for θS .
C ∩ D = S 6= 0,
If LC ,C ∈ C are hyperconsistent, there is a unique hyper Markov law L over G
with L (θC ) = LC ,C ∈ C .

Strong hyper and meta Markov properties

In some cases it is of interest to consider a stronger version of the hyper and meta
Markov properties.
A meta Markov model is strongly meta Markov if θA | S ‡P θS for all complete
separators S.
Similarly, a hyper Markov model is strongly hyper Markov if θA | S ⊥⊥L θS for all
complete separators S.
A directed hyper Markov model is strongly directed hyper Markov if θv | pa(v) ⊥⊥L θpa(v)
for all v ∈ V .
Gaussian graphical models and log-linear meta Markov models are strong meta
Markov models.
5.2 Meta Markov Models 71

5.2.1 Bayesian inference

Parameter θ ∈ Θ , data X = x, likelihood


dPθ (x)
L(θ | x) ∝ p(x | θ ) = .
dµ(x)

Express knowledge about θ through a prior π on θ . Use also π to denote density of


prior w.r.t. some measure ν on Θ .
Inference about θ from x is then represented through posterior distribution
π ∗ (θ ) = p(θ | x). Then, from Bayes’ formula

π ∗ (θ ) = p(x | θ )π(θ )/p(x) ∝ L(θ | x)π(θ )

so the likelihood function is equal to the density of the posterior w.r.t. the prior
modulo a constant.
Example 5.5 (Bernoulli experiments). Data X1 = x1 , . . . , Xn = xn independent and
Bernoulli distributed with parameter θ , i.e.

P(Xi = 1 | θ ) = 1 − P(Xi = 0) = θ .

Represent as a directed acyclic graph with θ as only parent to all nodes xi , i =


1, . . . , n. Use a beta prior:

π(θ | a, b) ∝ θ a−1 (1 − θ )b−1 .

If we let x = ∑ xi , we get the posterior:

π ∗ (θ ) ∝ θ x (1 − θ )n−x θ a−1 (1 − θ )b−1


= θ x+a−1 (1 − θ )n−x+b−1 .

So the posterior is also beta with parameters (a + x, b + n − x).

Closure under sampling

A family P of laws on Θ is said to be closed under sampling from x if

π ∈ P ⇒ π ∗ ∈ P.

The family of beta laws is closed under Bernoulli sampling. If the family of priors
is parametrised:
P = {Pα , α ∈ A }
we sometimes say that α is a hyperparameter. Then, Bayesian inference can be
made by just updating hyperparameters. The terminology of hyperparameter breaks
down in more complex models, corresponding to large directed graphs, where all
72 5 Further Statistical Theory

parent variables can be seen as ‘parameters’ for their children. Thus the division
into three levels, with data, parameters, and hyperparameters is not helpful.
For a k-dimensional exponential family
> t(x)−ψ(θ )
p(x | θ ) = b(x)eθ

the standard conjugate family (Diaconis and Ylvisaker 1979) is


> a−κψ(θ )
π(θ | a, κ) ∝ eθ

for (a, κ) ∈ A ⊆ R k × R+ , where A is determined so that the normalisation con-


stant is finite. Posterior updating from (x1 , . . . , xn ) with t = ∑i t(xi ) is then made as
(a∗ , κ ∗ ) = (a + t, κ + n).

Closure under sampling of hyper Markov properties

The hyper Markov property is in wide generality closed under sampling. For L
being a prior law over Θ and X = x is an observation from θ , let L ∗ = L (θ | X = x)
denote the posterior law over Θ . It then holds that If L is hyper Markov w.r.t. G so
is L ∗ .
And further, if L is strongly hyper Markov w.r.t. G so is L ∗ , so also the strong
hyper Markov property is preserved.
In the latter case, the update of L is even local to prime components, i.e.

L ∗ (θQ ) = LQ∗ (θQ ) = LQ (θQ | XQ = xQ )

and the marginal distribution p of X is globally Markov w.r.t. G ∗ , where


Z
p(x) = P(X = x | θ )L (dθ ).
Θ

FiXme Fatal: write more about the strong hyper Markov property, either
here or earlier

5.2.2 Hyper inverse Wishart and Dirichlet laws

Gaussian graphical models are canonical exponential families. The standard family
of conjugate priors have densities

π(K | Φ, δ ) ∝ (det K)δ /2 e− tr(KΦ) , K ∈ S + (G ).

These laws are termed hyper inverse Wishart laws as Σ follows an inverse Wishart
law for complete graphs.
For chordal graphs, each marginal law LC of ΣC is inverse Wishart.
5.3 Summary 73

For any meta Markov model where Θ and ΘQ are full and regular exponential
families for all prime components Q, it follows directly from Barndorff-Nielsen
(1978), page 149, that the standard conjugate prior law is strongly hyper Markov
w.r.t. G .
This is in particular true for the hyper inverse Wishart laws.
The analogous prior distribution for log-linear meta Markov models are likewise
termed hyper Dirichlet laws.
They are also strongly hyper Markov and if G is chordal, each induced marginal
law LC is a standard Dirichlet law.

Conjugate prior laws are strong hyper Markov

If Θ is meta Markov and ΘQ are full and regular exponential families for all prime
components Q, the standard conjugate prior law is strongly hyper Markov w.r.t. G .
This is in particular true for the hyper inverse Wishart laws and the hyper Dirich-
let laws.
Thus, for the hyper inverse and hyper Dirichlet laws we have simple local updat-
ing based on conjugate priors for Bayesian inference.

5.3 Summary

Laws and distributions

A statistical model involves a family P of distributions, often parametrized as

P = {Pθ , θ ∈ Θ }.

We refer to a probability measure on P or Θ as a law, whereas a distribution is a


probability measure on X .

Hyper Markov Laws

A law L on Θ is hyper Markov w.r.t. G if


(i) All θ ∈ Θ are globally Markov w.r.t. G ;
(ii) A ⊥⊥L B | S whenever S is complete and A ⊥G B | S.

Directed hyper Markov property

L = L (θ ) is directed hyper Markov w.r.t. a DAG D if θ is directed Markov on D


for all θ ∈ Θ and
74 5 Further Statistical Theory

θv∪pa(v) ⊥⊥L θnd(v) | θpa(v) ,


or equivalently
θv | pa(v) ⊥⊥L θnd(v) | θpa(v) .
If D is perfect, L is directed hyper Markov w.r.t. D if and only if L is hyper
Markov w.r.t. G = σ (D) = D m .

Meta Markov models

For A, B ⊆ V identify

θA∪B = (θB | A , θA ) = (θA | B , θB ).

A and B are meta independent w.r.t. P given S, denoted A ‡P B | S, if the pair of


conditional distributions (θA | S , θB | S ) vary in a product space when θS is fixed.
The family P, or Θ , is meta Markov w.r.t. G if
(i) All θ ∈ Θ are globally Markov w.r.t. G ;
(ii) A ⊥G B | S ⇒ A ‡P B | S whenever S is complete.

Hyper Markov laws and meta Markov models

Hyper Markov laws live on meta Markov models.


A Gaussian graphical model with graph G is meta Markov on G .
A log-linear model PA is meta Markov on its dependence graph G (A ) if and
only if S ∈ A for any minimal complete separator S of G (A ).
In particular, if A is conformal, PA is meta Markov.

Maximum likelihood in meta Markov models

If the following conditions are satisfied:


(i) Θ is meta Markov w.r.t. G ;
(ii) For any prime component Q of G , ΘQ is a full and regular exponential family,
the MLE θ̂ of the unknown distribution θ will follow a hyper Markov law over Θ
under Pθ .
In particular, this holds for any Gaussian graphical model and any meta Markov
log-linear model.
5.3 Summary 75

Strong hyper and meta Markov properties

Similarly, a hyper Markov law is strongly hyper Markov if θA | S ⊥⊥L θS for all com-
plete separators S.
A directed hyper Markov lawis strongly directed hyper Markov if θv | pa(v) ⊥⊥L θpa(v)
for all v ∈ V .
A meta Markov model is strongly meta Markov if θA | S ‡P θS for all complete
separators S.
Gaussian graphical models and log-linear meta Markov models are strong meta
Markov models.

Closure under sampling of hyper Markov properties

If L is a prior law over Θ and X = x is an observation from θ , L ∗ = L (θ | X = x)


denotes the posterior law over Θ .
If L is hyper Markov w.r.t. G so is L ∗ .
If L is strongly hyper Markov w.r.t. G so is L ∗ .
In the latter case, the update of L is local to prime components, i.e.

L ∗ (θQ ) = LQ∗ (θQ ) = LQ (θQ | XQ = xQ )

and the marginal distribution p of X is globally Markov w.r.t. G ∗ , where


Z
p(x) = P(X = x | θ )L (dθ ).
Θ

Hyper inverse Wishart and Dirichlet laws

Gaussian graphical models are canonical exponential families. The standard family
of conjugate priors have densities

π(K | Φ, δ ) ∝ (det K)δ /2 e− tr(KΦ) , K ∈ S + (G ).

These laws are termed hyper inverse Wishart laws as Σ follows an inverse Wishart
law for complete graphs. For chordal graphs, each marginal law LC , C of ΣC is
inverse Wishart.
The standard conjugate prior law for log-linear meta Markov models are termed
hyper Dirichlet laws. If G is chordal, each induced marginal law LC ,C ∈ C is a
standard Dirichlet law.
Chapter 6
Estimation of Structure

6.1 Estimation of Structure and Bayes Factors

Previous chapters have considered the situation where the graph G defining the
model has been known and the inference problems were concerned with an un-
known Pθ with θ ∈ Θ . This chapter discusses inference concerning the graph G ,
specifying only a family Γ of possible graphs.
It is important to ensure that any methods used must scale well with data size
we typically need to consider many structures and also huge collections of high-
dimensional data.
What we here choose to term structure estimation is also known under other
names as model selection (mainstream statistics), system identification (engineer-
ing), or structural learning (AI or machine learning.) Different situations occur de-
pending on the type of assumptions concerning Γ Common assumptions include
that Γ is the set of undirected graphs over V ; the set of chordal graphs over V ; the
set of forests over V ; the set of trees over V ; the set of directed acyclic graphs over
V ; or potentially other types of conditional independence structure.

Why estimation of structure?

It may be worthwhile to dwell somewhat on the rationale behind structure estima-


tion. We think of it as a method to get a quick overview of relations between a huge
set of variables in a complex stochastic system and see it in many ways as a parallel
to e.g. histograms or density estimation which gives a rough overview of the features
of univariate data. It will typically be used in areas such as, for example, general data
mining, identification of gene regulatory networks, or for reconstructing family trees
from DNA information. Established methods exist and are in daily routine use, but
there is a clear need for better understanding of their statistical properties.
We begin by showing a few simple examples of structure estimation to motivate
that the issue is not a priori impossible.

77
78 6 Estimation of Structure

Example 6.1 (Markov mesh model). Figure 6.1 shows the graph of a so-called
Markov mesh model with 36 variables. All variables are binary and the only variable

Fig. 6.1 Graph of a Markov mesh model with 36 binary variables.

without parents, in the upper left-hand corner is uniformly distributed. The remain-
ing variables on the upper and left sides of the 6 × 6 square have a single parent and
the conditional probability that it is in a given state is 3/4 if the state is the same as
its parent. The remaining nodes have two parents and if these are identical, the child
with have that state with probability 3/4 whereas it will otherwise follow the upper
parent with probability 2/3.
Figure 6.2 shows two different attempts of estimating the structure based on the
same 10,000 simulated cases. The two methods are to be described in more de-
tail later, but it is apparent that the estimated structure in both cases have a strong
similarity to the true one. In fact, one of the methods reconstructs the Markov mesh
model perfectly. Both methods used search for a DAG structure which is compatible
with the data.

Fig. 6.2 Structure estimate of Markov mesh model from 10000 simulated cases. The left-hand
side shows the estimate using the crudest algorithm (PC) implemented in HUGIN. The right-hand
side the Bayesian estimate using greedy equivalence search (GES) as implemented in W IN M INE .

Example 6.2 (Tree model). The graph of this example has a particular simple struc-
ture which is that of a rooted tree. Since a rooted tree with arrows pointing away
6.1 Estimation of Structure and Bayes Factors 79

from a root is a perfect DAG, the associated structure is equivalent to the corre-
sponding undirected tree. The state at the root is uniformly distributed and any other
node reproduces the state of the parent node with probability 3/4.
Figure 6.3 shows the structure estimate of the tree based on 10,000 simulated
cases and using the same methods as for the Markov mesh model. In both cases,
the method has attempted to estimate the structure based on the assumption that the
structure was a DAG. Note that in this case it is the first method which reconstruncts
correctly whereas there are too many links in the second case.

Fig. 6.3 Estimates of a tree model with 30 variables based on 10000 observations. The graph to
the left represents the estimate using the PC algorithm and yields a 100% correct reconstruction.
The graph to the right represents the Bayesian estimate using GES.

Example 6.3 (Chest clinic). The next example is taken from Lauritzen and Spiegel-
halter (1988) and reflects the structure involving risk factors and symptoms for lung-
disease. The (fictitious) description given by the authors of the associated medical
knowledge is as follows
“Shortness–of–breath (dyspnoea) may be due to tuberculosis, lung cancer or bronchitis, or
none of them, or more than one of them. A recent visit to Asia increases the chances of
tuberculosis, while smoking is known to be a risk factor for both lung cancer and bron-
chitis. The results of a single chest X–ray do not discriminate between lung cancer and
tuberculosis, as neither does the presence or absence of dyspnoea.”

The actual probabilities involved in this example are given in the original reference
and we abstain from repeating them here.
Figure 6.4 displays the network structure reflecting the knowledge as given above
and three different structure estimates. Note that this problem is obviously more
difficult than the previous examples, in particular because some of the diseases are
rare and larger data sets as well as more refined structure estimators are needed to
even get close to the original structure.
80 6 Estimation of Structure

Fig. 6.4 A Bayesian network model for lung disease and estimates of the model based on simulated
cases. The structure generating the data is in the upper left corner. Then, clockwise, estimates
using the same data but different estimation algorithms: the PC algorithm, Bayesian GES, the NPC
algorithm. In the latter case 100,000 cases were used.

Types of approach

Essentially all structure estimation methods combine a specification of potentially


interesting structures with a way of judging the adequacy of structure and a search
strategy, which evaluates a large number of space of possible structures.
As detailed further in the following sections, methods of judging adequacy in-
clude using
• tests of significance;
• penalised likelihood scores;

Iκ (G ) = log L̂ − κ dim(G )

with κ = 1 for AIC Akaike (1974), or κ = 21 log N for BIC Schwarz (1978);
• Bayesian posterior probabilities.
The search strategies are more or less based on heuristics, which all attempt to over-
come the fundamental problem that a crude global search among all potential struc-
tures is not feasible as the number of structures is astronomical.
FiXme Fatal: elaborate on each of these or rearrange
6.2 Estimating Trees and Forests 81

Bayes factors

For G ∈ Γ , ΘG is associated parameter space so that P factorizes w.r.t. G if and only


if P = Pθ for some θ ∈ ΘG . LG is prior law on ΘG .
The Bayes factor (likelihood ratio) for discriminating between G1 and G2 based
on observations X (n) = x(n) is

f (x(n) | G1 )
BF(G1 : G2 ) = ,
f (x(n) | G2 )

where Z
f (x(n) | G ) = f (x(n) | G , θ ) LG (dθ )
ΘG

is known as the marginal likelihood of G .

Posterior distribution over graphs

If π(G ) is a prior probability distribution over a given set of graphs Γ , the posterior
distribution is determined as

π ∗ (G ) = π(G | x(n) ) ∝ f (x(n) | G )π(G )

or equivalently
π ∗ (G1 ) π(G1 )

= BF(G1 : G2 ) .
π (G2 ) π(G2 )
Bayesian analysis looks for the MAP estimate G ∗ maximizing π ∗ (G ) over Γ , or
attempts to sample from the posterior using e.g. Monte-Carlo methods.

6.2 Estimating Trees and Forests

Estimating trees

Let us assume that the distribution P of X = Xv , v ∈ V over a discrete state space X


factorizes w.r.t. an unknown tree τ and that we have observations X 1 = x1 , . . . , X n =
xn as independent and identically distributed according to P.
Chow and Liu (1968) showed that the maximum likelihood estimate τ̂ of τ is a
maximal weight spanning tree (MWST), where the weight of a tree τ is

λ (τ) = ∑ λn (e) = ∑ Hn (e)


e∈E(τ) e∈E(τ)

and Hn (e) is the empirical cross-entropy or mutual information between endpoint


variables of the edge e = {u, v}:
82 6 Estimation of Structure

n(xu , xv ) n(xu , xv )/n n(xu , xv )


Hn (e) = ∑ log 2
= ∑ n(xu , xv ) log .
xu xv n n(xu )n(xv )/n xu ,xv n(xu )n(xv )

This result is easily extended to Gaussian graphical models, just with the weight
λn (e) of an edge in a tree determined as any strictly increasing function of the em-
pirical cross-entropy along the edge
1
Hn (e) = − log(1 − re2 ),
2
where re2 is empirical correlation coeffient along edge e = {u, v}

(∑ni=1 xui xvi )2 w2uv


re2 = = .
(∑ni=1 (xui )2 )(∑ni=1 (xvi )2 ) wuu wvv

To see this, use the expression (4.20) for the determinant of the MLE which in
the case of a tree reduces to

∏v∈V (wvv )deg(v)−1 d


det(K̂) = n
∏e∈E det(we )
wuu wvv
∝ ∏ (wvv )−1 ∏ w w − w2uv
∝ (1 − re2 )−1 .
v∈V {u,v}∈E uu vv

From (4.16) we know that the maximized likelihood function for a fixed tree is pro-
portional to a power of this determinant and hence is maximized when the logarithm
of the determinant is maximized. But since we then have

log det K̂(τ) = 2 ∑ Hn (e) = 2λ (τ),


e∈E(τ)

maximizing L̂(τ) over all possible trees is equivalent to maximizing λ (τ).


Highest AIC or BIC scoring forest also available as MWSF, with modified
weights
wpen
n (e) = nwn (e) − κn dfe ,

with κn = 2 for AIC, κn = log n for BIC and dfe the degrees of freedom for indepen-
dence along e.
Fast algorithms Kruskal Jr. (1956) compute maximal weight spanning tree (or
forest) from weights W = (wuv , u, v ∈ V ).
Chow and Wagner (1978) show a.s. consistency in total variation of P̂: If P
factorises w.r.t. τ, then

sup |p(x) − p̂(x)| → 0 for n → ∞,


x

so if τ is unique for P, τ̂ = τ for all n > N for some N.


If P does not factorize w.r.t. a tree, P̂ converges to closest tree-approximation P̃
to P (Kullback-Leibler distance).
6.2 Estimating Trees and Forests 83

Strong hyper Markov prior laws

For strong hyper Markov prior laws, X (n) is itself marginally Markov so
(n)
∏Q∈Q f (xQ | G )
f (x(n) | G ) = (n)
, (6.1)
∏S∈S f (xS | G )νG (S)

where Q are the prime components and S the minimal complete separators of G .

Hyper inverse Wishart laws

Denote the normalisation constant of the hyper inverse Wishart density as


Z
h(δ , Φ; G ) = (det K)δ /2 e− tr(KΦ) dK,
S + (G )

i.e. the usual Wishart constant if Q = C is a clique.


Combining with the Gaussian likelihood, it is easily seen that for Gaussian graph-
ical models we have
h(δ + n, Φ +W n ; G )
f (x(n) | G ) = .
h(δ , Φ; G )

Comparing with (6.1) leads to a similar factorization of the normalising constant

∏Q∈Q h(δ , ΦQ ; GQ )
h(δ , Φ; G ) = .
∏S∈S h(δ , ΦS ; S)νG (S)
For chordal graphs all terms in this expression reduce to known Wishart constants,
and we can thus calculate the normalization constant explicitly.
In general, Monte-Carlo simulation or similar methods must be used Atay-Kayis
and Massam (2005).
The marginal distribution of W (n) is (weak) hyper Markov w.r.t. G . It was termed
the hyper matrix F law by Dawid and Lauritzen (1993).

Bayes factors for forests

Trees and forests are decomposable graphs, so for a forest φ we get


(n)
(n) ∏e∈E(φ ) f (xe )
f (φ | x ) ∝ (n)
,
∏v∈V f (xv )dφ (v)−1

since all minimal complete separators are singletons and νφ ({v}) = dφ (v) − 1.
(n)
Multiplying the right-hand side with ∏v∈V f (xv ) yields
84 6 Estimation of Structure

(n)
∏e∈E(φ ) f (xe ) (n)
(n)
= ∏ f (xv ) ∏ BF(e),
∏v∈V f (xv )dφ (v)−1 v∈V e∈φ

where BF(e) is the Bayes factor for independence along the edge e:
(n) (n)
f (xu , xv )
BF(e) = (n) (n)
.
f (xu ) f (xv )

Thus the posterior distribution of φ is

π ∗ (φ ) ∝ ∏ BF(e).
e∈E(φ )

In the case where φ is restricted to contain a single tree, the normalization constant
for this distribution can be explicitly obtained via the Matrix Tree Theorem, see e.g.
Bollobás (1998).

Bayesian analysis

MAP estimates of forests can thus be computed using an MWSF algorithm, using
w(e) = log BF(e) as weights.
Algorithms exist for generating random spanning trees Aldous (1990), so full
posterior analysis is in principle possible for trees.
These work less well for weights occurring with typical Bayes factors, as most
of these are essentially zero, so methods based on the Matrix Tree Theorem seem
currently more useful.
Only heuristics available for MAP estimators or maximizing penalized likeli-
hoods such as AIC or BIC, for other than trees.

Some challenges for undirected graphs

• Find feasible algorithm for (perfect) simulation from a distribution over chordal
graphs as
∏C∈C w(C)
p(G ) ∝ ,
∏S∈S w(S)νG (S)
where w(A), A ⊆ V are a prescribed set of positive weights.
• Find feasible algorithm for obtaining MAP in decomposable case. This may not
be universally possible as problem most likely is NP-complete.
6.3 Learning Bayesian networks 85

6.3 Learning Bayesian networks

6.3.1 Model search methods

Directed hyper Markov property

L = L (θ ) is directed hyper Markov w.r.t. a DAG D if θ is directed Markov on D


for all θ ∈ Θ and
θv | pa(v) ⊥⊥L θnd(v) | θpa(v) .
A law L is directed hyper Markov on D if and only if LA is hyper Markov on
(DA )m for any ancestral set A ⊆ V .
L is strongly directed hyper Markov if in addition θv | pa(v) ⊥⊥L θpa(v) for all v or,
equivalently if the conditional distributions θv | pa(v) , v ∈ V are mutually independent.
Graphically, this is most easily displayed by introducing one additional parent
θv | pa(v) for every vertex V in D, so then

f (x | θ ) = ∏ f (xv | xpa(v) , θv | pa(v) ).


v∈V

Exploiting independence and taking expectations over θ yields that also marginally,
Z
f (x | D) = f (x | θ ) LD (θ ) = ∏ f (xv | xpa(v) ).
ΘD v∈V

If L is strongly directed hyper Markov and L ∗ it holds that also the posterior
law L ∗ is is strongly directed hyper Markov and

L ∗ (θv | pa(v) ) ∝ f (xv | xpa(v) , θv | pa(v) )L (θv | pa(v) )

Spiegelhalter and Lauritzen (1990).

Markov equivalence

D and D 0 are equivalent if and only if:


1. D and D 0 have same skeleton (ignoring directions)
2. D and D 0 have same unmarried parents
s s
so
@ ≡ I
@
s -? s @Rs s - s? @ s
but s s
@ 6≡ @
6
s -?
s -
Rs
@ s -s -
Rs
@
86 6 Estimation of Structure

Searching equivalence classes

In general, there is no hope of distinguishing Markov equivalent DAGs, so D can at


best be identified up to Markov equivalence.
The number Dn of unlabelled DAGs with n vertices is given by the recursion
Robinson (1977)
n  
n i(n−i)
Dn = ∑ (−1)i+1 2 Dn−i
i=1 i

which grows superexponentially. For n = 10, Dn ≈ 4.2×1018 . The number of equiv-


alence classes is smaller, but is conjectured still to grow superexponentially.

Conjugate priors for DAGs

In the discrete case, the obvious conjugate prior is for fixed v to let

{θv | paD (v) (xv | xpaD (v)
), xv ∈ Xv }

be Dirichlet distributed and independent for v ∈ V and xpa∗ ∈ XpaD (v) Spiegel-
D (v)
halter and Lauritzen (1990).
We can derive these Dirichlet distributions from a fixed master Dirichlet distri-
bution D(α), where α = α(x), x ∈ X , by letting
∗ ∗
{θv | pa(v) (xv | xpaD (v)
)} ∼ D(α(xv , xpa D (v)
),

where as usual α(xa ) = ∑y:ya =xa α(y).


Typically, α is specified by letting α = λ p0 (x) where p0 is an initial guess on the
joint distribution, for example specified through a DAG D0 , and λ is the equivalent
sample size for the prior information.

The values α(xv , xpa ∗
) = λ p0 (xv , xpa ) can then be calculated by probability
D (v) D (v)
propagation.
Common default values is λ = 1 and α(x) = |X |−1 .
A similar construction is possible in the Gaussian case using the Wishart dis-
tribution Geiger and Heckerman (1994) and for mixed discrete Gaussian networks
Bøttcher (2001), the latter implemented in the R-package DEAL Bøttcher and Deth-
lefsen (2003).
In all cases, it was shown Geiger and Heckerman (1997, 2002) that prior distri-
butions constructed in this way are the only distributions which are
1. modular:
paD (v) = paD 0 (v) ⇒ θv | paD (v) ∼ θv | paD 0 (v) ;
2. score equivalent:

D ≡ D 0 ⇒ f (x(n) | D) = f (x(n) | D 0 ).
6.3 Learning Bayesian networks 87

Marginal likelihood Bayes factors derived from these strongly directed hyper
Dirichlet priors have a simple form

Γ (α(xpaD (v) ))
f (x(n) | D) = ∏ ∏
v xpa(v) Γ (α(xpaD (v) ) + n(xpaD (v) ))

Γ (α(xv∪paD (v) ) + n(xv∪paD (v) ))


×∏ .
xv Γ (α(xv∪paD (v) ))

Cooper and Herskovits (1992); Heckerman et al (1995)


Challenge: Find good algorithm for sampling from the full posterior over DAGs
or equivalence classes of DAGs. Issue: prior uniform over equivalence classes or
over DAGs?

Greedy equivalence class search

1. Initialize with empty DAG


2. Repeatedly search among equivalence classes with a single additional edge and
go to class with highest score - until no improvement.
3. Repeatedly search among equivalence classes with a single edge less and move
to one with highest score - until no improvement.
For BIC or Bayesian posterior score with directed hyper Dirichlet priors, this algo-
rithm yields consistent estimate of equivalence class for P. Chickering (2002)

6.3.2 Constraint-based search

Another alternative search algorithm is known as constraint based search.


Essentially, the search methods generate queries of the type “A ⊥⊥ B | S?”, and
the answer to such a query divides Γ into those graphs conforming with the query
and those that do not.
These type of methods were originally designed by computer scientists in the
context where P was fully available, so queries could be answered without error.
The advantage of this type of method is that relatively few queries are needed to
identify a DAG D (or rather its equivalence class).
The disadvantage is that there seems to be no coherent and principled method to
answer the query in the presence of statistical uncertainty, which is computable.

SGS and PC algorithms

SGS-algorithm Spirtes et al (1993):


88 6 Estimation of Structure

Step 1: Identify skeleton using that, for P faithful,

u 6∼ v ⇐⇒ ∃S ⊆ V \ {u, v} : Xu ⊥⊥ Xv | XS .

Begin with complete graph, check for S = 0/ and remove edges when indepen-
dence holds. Then continue for increasing |S|.
PC-algorithm (same reference) exploits that only S with S ⊆ bd(u) \ v or S ⊆
bd(v) \ u needs checking where bd refers to current skeleton.
Step 2: Identify directions to be consistent with independence relations found in
Step 1.

Exact properties of PC-algorithm

If P is faithful to DAG D, PC-algorithm finds D 0 equivalent to D.


It uses N independence checks where N is at most

|V |d+1
  d  
|V | |V | − 1
N≤2 ∑ ≤ ,
2 i=0 i (d − 1)!

where d is the maximal degree of any vertex in D.


So worst case complexity is exponential, but algorithm fast for sparse graphs.
Sampling properties are less well understood although consistency results exist.
The general idea has these elements:
1. When a query is decided negatively, ¬(A ⊥⊥ B | S), it is taken at face value;
When a query is decided positively, A ⊥⊥ B | S, it is recorded with care;
2. If at some later stage, the PC algorithm would remove an edge so that a negative
query ¬(A ⊥⊥ B | S) would conflict with A ⊥D B | S, the removal of this edge is
suppressed.
This leads to unresolved queries which are then passed to the user.

6.4 Summary

Types of approach
• Methods for judging adequacy of structure such as
– Tests of significance
– Penalised likelihood scores

Iκ (G ) = log L̂ − κ dim(G )

with κ = 1 for AIC Akaike (1974), or κ = 21 log n for BIC Schwarz (1978).
– Bayesian posterior probabilities.
6.4 Summary 89

• Search strategies through space of possible structures, more or less based on


heuristics.
Bayes factors For G ∈ Γ , ΘG is associated parameter space so that P factorizes
w.r.t. G if P = Pθ for some θ ∈ ΘG . LG is prior law on ΘG .
The Bayes factor for discriminating between G1 and G2 based on X (n) = x(n) is

f (x(n) | G1 )
BF(G1 : G2 ) = ,
f (x(n) | G2 )

where Z
f (x(n) | G ) = f (x(n) | G , θ ) LG (dθ )
ΘG

is known as the marginal likelihood of G . Posterior distribution over graphs If π(G )


is a prior probability distribution over a given set of graphs Γ , the posterior distri-
bution is determined as

π ∗ (G ) = π(G | x(n) ) ∝ f (x(n) | G )π(G )

or equivalently
π ∗ (G1 ) π(G1 )
= BF(G1 : G2 ) .
π ∗ (G2 ) π(G2 )
The BIC is an O(1)-approximation to log BF using Laplace’s method of integrals
on the marginal likelihood.
Bayesian analysis looks for the MAP estimate G ∗ maximizing π ∗ (G ) over Γ , or
attempts to sample from the posterior using e.g. Monte-Carlo methods. Estimating
trees Assume P factorizes w.r.t. an unknown tree T . MLE τ̂ of T has maximal
weight, where the weight of τ is

w(τ) = ∑ wn (e) = ∑ Hn (e)


e∈E(τ) e∈E(τ)

and Hn (e) is the empirical cross-entropy or mutual information between endpoint


variables of the edge e = {u, v}. For Gaussian trees this becomes

1
wn (e) = − log(1 − re2 ),
2
where re2 is correlation coeffient along edge e = {u, v}.
Highest AIC or BIC scoring forest also available as MWSF, with modified
weights
wpen
n (e) = nwn (e) − κn dfe ,

with κn = 1 for AIC, κn = 12 log n for BIC and dfe the degrees of freedom for inde-
pendence along e.
Use maximal weight spanning tree (or forest) algorithm from weights W =
(wuv , u, v ∈ V ).
90 6 Estimation of Structure

Hyper inverse Wishart laws Denote the normalisation constant of the hyper in-
verse Wishart density as
Z
h(δ , Φ; G ) = (det K)δ /2 e− tr(KΦ) dK,
S + (G )

The marginal likelihood is then

h(δ + n, Φ +W n ; G )
f (x(n) | G ) = .
h(δ , Φ; G )

where
∏Q∈Q h(δ , ΦQ ; GQ )
h(δ , Φ; G ) = .
∏S∈S h(δ , ΦS ; S)νG (S)
For chordal graphs all terms reduce to known Wishart constants.
In general, Monte-Carlo simulation or similar methods must be used Atay-Kayis
and Massam (2005).
Bayes factors for forests Trees and forests are decomposable graphs, so for a
forest φ we get
(n)
∗ ∏e∈E(φ ) f (xe )
π (φ ) ∝ (n)
∏v∈V f (xv )dφ (v)−1
∝ ∏ BF(e),
e∈E(φ )

where BF(e) is the Bayes factor for independence along the edge e:
(n) (n)
f (xu , xv )
BF(e) = (n) (n)
.
f (xu ) f (xv )

MAP estimates of forests can thus be computed using an MWSF algorithm, using
w(e) = log BF(e) as weights.
When φ is restricted to contain a single tree, the normalization constant can be
explicitly obtained via the Matrix Tree Theorem, see e.g. Bollobás (1998).
Algorithms exist for generating random spanning trees Aldous (1990), so full
posterior analysis is in principle possible for trees.
Only heuristics available for MAP estimators or maximizing penalized likeli-
hoods such as AIC or BIC, for other than trees.
LIST OF CORRECTIONS 91

List of Corrections

Fatal: give the proof here? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


Fatal: give the proof here? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Fatal: give the correct proof here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Fatal: make figure to illustrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Fatal: Hertil er jeg kommet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Note: check this result, please . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Note: give argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Fatal: write more about the strong hyper Markov property, either here or
earlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Fatal: elaborate on each of these or rearrange . . . . . . . . . . . . . . . . . . . . . . . . . 80
References

Akaike H (1974) A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control 19:716–723
Aldous D (1990) A random walk construction of uniform spanning trees and uniform labelled
trees. SIAM Journal on Discrete Mathematics 3(4):450–465
Andersen SK, Olesen KG, Jensen FV, Jensen F (1989) Hugin - a shell for building Bayesian belief
universes for expert systems. In: Sridharan NS (ed) Proceedings of the 11th International Joint
Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo, CA, pp 1080–
1085
Asmussen S, Edwards D (1983) Collapsibility and response variables in contingency tables.
Biometrika 70:567–578
Atay-Kayis A, Massam H (2005) A Monte Carlo method for computing the marginal likelihood in
non-decomposable graphical Gaussian models. Biometrika 92:317–335
Bahl L, Cocke J, Jelinek F, Raviv J (1974) Optimal decoding of linear codes for minimizing symbol
error rate. IEEE Transactions on Information Theory 20:284–287
Barndorff-Nielsen OE (1978) Information and Exponential Families in Statistical Theory. John
Wiley and Sons, New York
Baum LE (1972) An equality and associated maximization technique in statistical estimation for
probabilistic functions of Markov processes. Inequalities 3:1–8
Berge C (1973) Graphs and Hypergraphs. North-Holland, Amsterdam, The Netherlands, translated
from French by E. Minieka
Berry A, Bordat JP, Cogis O (2000) Generating all the minimal separators of a graph. International
Journal of Foundations of Computer Science 11:397–403
Bickel PJ, Hammel EA, O’Connell JW (1973) Sex bias in graduate admissions: Data from Berke-
ley. Science 187(4175):398–404
Bollobás B (1998) Modern Graph Theory. Springer-Verlag, New York
Bøttcher SG (2001) Learning Bayesian networks with mixed variables. In: Proceedings of the
Eighth International Workshop in Artificial Intelligence and Statistics, pp 149–156
Bøttcher SG, Dethlefsen C (2003) deal: A package for learning Bayesian networks. Journal of
Statistical Software 8:1–40
Bouchitté V, Todinca I (2001) Treewidth and minimum fill-in: Grouping the minimal separators.
SIAM Journal on Computing 31:212–232
Buhl SL (1993) On the existence of maximum likelihood estimators for graphical Gaussian models.
Scandinavian Journal of Statistics 20:263–270
Cannings C, Thompson EA, Skolnick MH (1976) Recursive derivation of likelihoods on pedigrees
of arbitrary complexity. Advances in Applied Probability 8:622–625
Chickering DM (2002) Optimal structure identification with greedy search. Journal of Machine
Learning Research 3:507–554

93
94 References

Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees.
IEEE Transactions on Information Theory 14:462–467
Chow CK, Wagner TJ (1978) Consistency of an estimate of tree-dependent probability distribu-
tions. IEEE Transactions on Information Theory 19:369–371
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks
from data. Machine Learning 9:309–347
Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ (1999) Probabilistic Networks and Expert
Systems. Springer-Verlag, New York
Dawid AP (1979) Conditional independence in statistical theory (with discussion). Journal of the
Royal Statistical Society, Series B 41:1–31
Dawid AP (1980) Conditional independence for statistical operations. The Annals of Statistics
8:598–617
Dawid AP, Lauritzen SL (1993) Hyper Markov laws in the statistical analysis of decomposable
graphical models. The Annals of Statistics 21:1272–1317
Diaconis P, Ylvisaker D (1979) Conjugate priors for exponential families. The Annals of Statistics
7:269–281
Diestel R (1987) Simplicial decompositions of graphs – some uniqueness results. Journal of Com-
binatorial Theory, Series B 42:133–145
Diestel R (1990) Graph Decompositions. Clarendon Press, Oxford, United Kingdom
Dirac GA (1961) On rigid circuit graphs. Abhandlungen Mathematisches Seminar Hamburg
25:71–76
Edwards D (2000) Introduction to Graphical Modelling, 2nd edn. Springer-Verlag, New York
Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Human
Heredity 21:523–542
Frydenberg M (1990a) The chain graph Markov property. Scandinavian Journal of Statistics
17:333–353
Frydenberg M (1990b) Marginalization and collapsibility in graphical interaction models. The An-
nals of Statistics 18:790–805
Geiger D, Heckerman D (1994) Learning Gaussian networks. In: de Mantaras RL, Poole D (eds)
Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, Morgan Kauf-
mann Publishers, San Francisco, CA, pp 235–243
Geiger D, Heckerman D (1997) A characterization of the Dirichlet distribution through global and
local independence. The Annals of Statistics 25:1344–1369
Geiger D, Heckerman D (2002) Parameter priors for directed acyclic graphical models and the
characterization of several probability distributions. The Annals of Statistics 30:1412–1440
Geiger D, Verma TS, Pearl J (1990) Identifying independence in Bayesian networks. Networks
20:507–534
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: The combination
of knowledge and statistical data. Machine Learning 20:197–243
Jensen F (2002) HUGIN API Reference Manual Version 5.4. HUGIN Expert Ltd., Aalborg, Den-
mark
Jensen F, Jensen FV, Dittmer SL (1994) From influence diagrams to junction trees. In: de Mantaras
RL, Poole D (eds) Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence,
Morgan Kaufmann Publishers, San Francisco, CA, pp 367–373
Jensen FV, Jensen F (1994) Optimal junction trees. In: de Mantaras RL, Poole D (eds) Proceedings
of the 10th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers,
San Francisco, CA, pp 360–366
Jensen FV, Lauritzen SL, Olesen KG (1990) Bayesian updating in causal probabilistic networks
by local computation. Computational Statistics Quarterly 4:269–282
Jiroušek R, Přeučil R (1995) On the effective implementation of the iterative proportional fitting
procedure. Computational Statistics and Data Analysis 19:177–189
Kalman RE, Bucy R (1961) New results in linear filtering and prediction. Journal of Basic Engi-
neering 83 D:95–108
References 95

Kong A (1986) Multivariate belief functions and graphical models. Ph.D. Thesis, Department of
Statistics, Harvard University, Massachusetts
Kruskal Jr JB (1956) On the shortest spanning subtree of a graph and the travelling salesman
problem. Proceedings of the American Mathematical Society 7:48–50
Lauritzen SL (1996) Graphical Models. Clarendon Press, Oxford, United Kingdom
Lauritzen SL, Jensen FV (1997) Local computation with valuations from a commutative semi-
group. Annals of Mathematics and Artificial Intelligence 21:51–69
Lauritzen SL, Nilsson D (2001) Representing and solving decision problems with limited informa-
tion. Management Science 47:1238–1251
Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical struc-
tures and their application to expert systems (with discussion). Journal of the Royal Statistical
Society, Series B 50:157–224
Lauritzen SL, Speed TP, Vijayan K (1984) Decomposable graphs and hypergraphs. Journal of the
Australian Mathematical Society, Series A 36:12–29
Lauritzen SL, Dawid AP, Larsen BN, Leimer HG (1990) Independence properties of directed
Markov fields. Networks 20:491–505
Leimer HG (1993) Optimal decomposition by clique separators. Discrete Mathematics 113:99–123
Matúš F (1992) On equivalence of Markov properties over undirected graphs. Journal of Applied
Probability 29:745–749
Meek C (1995) Strong completeness and faithfulness in Bayesian networks. In: Besnard P, Hanks
S (eds) Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Morgan
Kaufmann Publishers, San Francisco, CA, pp 411–418
Moussouris J (1974) Gibbs and Markov random systems with constraints. Journal of Statistical
Physics 10:11–33
Nilsson D (1998) An efficient algorithm for finding the M most probable configurations in a prob-
abilistic expert system. Statistics and Computing 8:159–173
Parter S (1961) The use of linear graphs in Gauss elimination. SIAM Review 3:119–130
Pearl J (1986) Fusion, propagation and structuring in belief networks. Artificial Intelligence
29:241–288
Pearl J (1988) Probabilistic Inference in Intelligent Systems. Morgan Kaufmann Publishers, San
Mateo, CA
Pearl J, Paz A (1987) Graphoids: A graph based logic for reasoning about relevancy relations.
In: Boulay BD, Hogg D, Steel L (eds) Advances in Artificial Intelligence – II, North-Holland,
Amsterdam, The Netherlands, pp 357–363
Richardson TS (2003) Markov properties for acyclic directed mixed graphs. Scandinavian Journal
of Statistics 30:145–158
Robinson RW (1977) Counting unlabelled acyclic digraphs. In: Little CHC (ed) Lecture Notes in
Mathematics: Combinatorial Mathematics V, vol 622, Springer-Verlag, New York
Rose DJ, Tarjan RE, Lueker GS (1976) Algorithmic aspects of vertex elimination on graphs. SIAM
Journal on Computing 5:266–283
Schwarz G (1978) Estimating the dimension of a model. The Annals of Statistics 6:461–464
Shenoy PP, Shafer G (1986) Propagating belief functions using local propagation. IEEE Expert
1:43–52
Shenoy PP, Shafer G (1990) Axioms for probability and belief–function propagation. In: Shachter
RD, Levitt TS, Kanal LN, Lemmer JF (eds) Uncertainty in Artificial Intelligence 4, North-
Holland, Amsterdam, The Netherlands, pp 169–198
Shoiket K, Geiger D (1997) A practical algorithm for finding optimal triangulations. In: Proceed-
ings of the Fourteenth National Conference on Artificial Intelligence, AAAI Press, Menlo Park,
California, pp 185–190
Spiegelhalter DJ, Lauritzen SL (1990) Sequential updating of conditional probabilities on directed
graphical structures. Networks 20:579–605
Spirtes P, Glymour C, Scheines R (1993) Causation, Prediction and Search. Springer-Verlag, New
York, reprinted by MIT Press
96 References

Studený M (1992) Conditional independence relations have no finite complete characterization.


In: Transactions of the 11th Prague Conference on Information Theory, Statistical Decision
Functions and Random Processes, Academia, Prague, Czech Republic, pp 377–396
Studený M (1993) Structural semigraphoids. International Journal of General Systems 22:207–217
Tarjan RE (1985) Decomposition by clique separators. Discrete Mathematics 55:221–232
Tarjan RE, Yannakakis M (1984) Simple linear-time algorithms to test chordality of graphs, test
acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs. SIAM Journal on Com-
puting 13:566–579
Thiele TN (1880) Om Anvendelse af mindste Kvadraters Methode i nogle Tilfælde, hvor en Kom-
plikation af visse Slags uensartede tilfældige Fejlkilder giver Fejlene en ‘systematisk’ Karakter.
Vidensk Selsk Skr 5 Rk, naturvid og mat Afd 12:381–408, french version: Sur la Compensa-
tion de quelques Erreurs quasi-systématiques par la Méthode des moindres Carrés. Reitzel,
København, 1880.
Venables WN, Ripley BD (2002) Modern Applied Statistics with S, 4th edn. Springer-Verlag, New
York
Verma T, Pearl J (1990) Equivalence and synthesis of causal models. In: Bonissone P, Henrion M,
Kanal LN, Lemmer JF (eds) Proceedings of the 6th Conference on Uncertainty in Artificial
Intelligence, North-Holland, Amsterdam, pp 255–270
Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding
algorithm. IEEE Transactions on Information Theory 13:260–269
Wagner K (1937) Über eine Eigenschaft der ebenen Komplexe. Mathematische Annalen 114:570–
590
Yannakakis M (1981) Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic
and Discrete Methods 2:77–79

You might also like