0% found this document useful (0 votes)
45 views753 pages

Notes

The document outlines the course structure for 'Artificial Intelligence 1' at FAU Erlangen-Nürnberg, focusing on symbolic AI in the first semester and statistical approaches in the second. It details prerequisites, course objectives, contents, and acknowledges contributions from various sources and students. The course is designed for both computer science students and external students from other disciplines, providing a foundational understanding of AI concepts and practices.

Uploaded by

pritamhf98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views753 pages

Notes

The document outlines the course structure for 'Artificial Intelligence 1' at FAU Erlangen-Nürnberg, focusing on symbolic AI in the first semester and statistical approaches in the second. It details prerequisites, course objectives, contents, and acknowledges contributions from various sources and students. The course is designed for both computer science students and external students from other disciplines, providing a foundational understanding of AI concepts and practices.

Uploaded by

pritamhf98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 753

Artificial Intelligence 1

Winter Semester 2024/25


– Lecture Notes –

Prof. Dr. Michael Kohlhase


Professur für Wissensrepräsentation und -verarbeitung
Informatik, FAU Erlangen-Nürnberg
[email protected]

2025-02-06
0.1. PREFACE i

0.1 Preface
0.1.1 Course Concept
Objective: The course aims at giving students a solid (and often somewhat theoretically ori-
ented) foundation of the basic concepts and practices of artificial intelligence. The course will
predominantly cover symbolic AI – also sometimes called “good old-fashioned AI (GofAI)” – in
the first semester and offers the very foundations of statistical approaches in the second. Indeed, a
full account sub symbolic, machine learning based AI deserves its own specialization courses and
needs much more mathematical prerequisites than we can assume in this course.
Context: The course “Artificial Intelligence” (AI 1 & 2) at FAU Erlangen is a two-semester
course in the “Wahlpflichtbereich” (specialization phase) in semester 5/6 of the bachelor program
“Computer Science” at FAU Erlangen. It is also available as a (somewhat remedial) course in the
“Vertiefungsmodul Künstliche Intelligenz” in the Computer Science Master’s program.
Prerequisites: AI-1 & 2 builds on the mandatory courses in the FAU bachelor’s program, in
particular the course “Grundlagen der Logik in der Informatik” [Glo], which already covers a lot
of the materials usually presented in the “knowledge and reasoning” part of an introductory AI
course. The AI 1& 2 course also minimizes overlap with the course.
The course is relatively elementary, we expect that any student who attended the mandatory
CS course at FAU Erlangen can follow it.
Open to external students: Other bachelor programs are increasingly co-opting the course as
specialization option. There is no inherent restriction to computer science students in this course.
Students with other study biographies – e.g. students from other bachelor programs our external
Master’s students should be able to pick up the prerequisites when needed.

0.1.2 Course Contents


Goal: To give students a solid foundation of the basic concepts and practices of the field of
Artificial Intelligence. The course will be based on Russell/Norvig’s book “Artificial Intelligence;
A modern Approach” [RN09]
Artificial Intelligence I (the first semester): introduces AI as an area of study, discusses
“rational agents” as a unifying conceptual paradigm for AI and covers problem solving, search,
constraint propagation, logic, knowledge representation, and planning.
Artificial Intelligence II (the second semester): is more oriented towards exposing students
to the basics of statistically based AI: We start out with reasoning under uncertainty, setting the
foundation with Bayesian Networks and extending this to rational decision theory. Building on
this we cover the basics of machine learning.

0.1.3 This Document


Format: The document mixes the slides presented in class with comments of the instructor to
give students a more complete background reference.
Caveat: This document is made available for the students of this course only. It is still very
much a draft and will develop over the course of the current course and in coming academic
years. Licensing: This document is licensed under a Creative Commons license that requires
attribution, allows commercial use, and allows derivative works as long as these are licensed
under the same license. Knowledge Representation Experiment: This document is also
an experiment in knowledge representation. Under the hood, it uses the STEX package [Koh08;
sTeX], a TEX/LATEX extension for semantic markup, which allows to export the contents into
active documents that adapt to the reader and can be instrumented with services based on the
explicitly represented meaning of the documents.
ii

0.1.4 Acknowledgments
Materials: Most of the materials in this course is based on Russel/Norvik’s book “Artificial
Intelligence — A Modern Approach” (AIMA [RN95]). Even the slides are based on a LATEX-based
slide set, but heavily edited. The section on search algorithms is based on materials obtained from
Bernhard Beckert (then Uni Koblenz), which is in turn based on AIMA. Some extensions have
been inspired by an AI course by Jörg Hoffmann and Wolfgang Wahlster at Saarland University
in 2016. Finally Dennis Müller suggested and supplied some extensions on AGI. Florian Rabe,
Max Rapp and Katja Berčič have carefully re-read the text and pointed out problems.
All course materials have been restructured and semantically annotated in the STEX format,
so that we can base additional semantic services on them.
AI Students: The following students have submitted corrections and suggestions to this and
earlier versions of the notes: Rares Ambrus, Ioan Sucan, Yashodan Nevatia, Dennis Müller, Si-
mon Rainer, Demian Vöhringer, Lorenz Gorse, Philipp Reger, Benedikt Lorch, Maximilian Lösch,
Luca Reeb, Marius Frinken, Peter Eichinger, Oskar Herrmann, Daniel Höfer, Stephan Mattejat,
Matthias Sonntag, Jan Urfei, Tanja Würsching, Adrian Kretschmer, Tobias Schmidt, Maxim On-
ciul, Armin Roth, Liam Corona, Tobias Völk, Lena Voigt, Yinan Shao, Michael Girstl, Matthias
Vietz, Anatoliy Cherepantsev, Stefan Musevski, Matthias Lobenhofer, Philipp Kaludercic, Di-
warkara Reddy, Martin Helmke, Stefan Müller, Dominik Mehlich, Paul Martini, Vishwang Dave,
Arthur Miehlich, Christian Schabesberger, Vishaal Saravanan, Simon Heilig, Michelle Fribrance,
Wenwen Wang, Xinyuan Tu, Lobna Eldeeb.

0.1.5 Recorded Syllabus


The recorded syllabus – a record the progress of the course in the academic year 2024/25– is
in the course page in the ALeA system at https://courses.voll-ki.fau.de/course-home/
ai-1. The table of contents in the AI-1 notes at https://courses.voll-ki.fau.de indicates
the material covered to date in yellow.
The recorded syllabus of AI-2 can be found at https://courses.voll-ki.fau.de/course-home/
ai-2. For the topics planned for this course, see ??.
Contents

0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.1 Course Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.2 Course Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.3 This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.4 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
0.1.5 Recorded Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

1 Preliminaries 1
1.1 Administrative Ground Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Getting Most out of AI-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Learning Resources for AI-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 AI-Supported Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 AI – Who?, What?, When?, Where?, and Why? 19


2.1 What is Artificial Intelligence? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Artificial Intelligence is here today! . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Ways to Attack the AI Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Strong vs. Weak AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 AI Topics Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 AI in the KWARC Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

I Getting Started with AI: A Conceptual Framework 33


3 Logic Programming 37
3.1 Introduction to Logic Programming and ProLog . . . . . . . . . . . . . . . . . . . 37
3.2 Programming as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Running Prolog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Knowledge Bases and Backtracking . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Programming Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.4 Advanced Relational Programming . . . . . . . . . . . . . . . . . . . . . . . 46

4 Recap of Prerequisites from Math & Theoretical Computer Science 49


4.1 Recap: Complexity Analysis in AI? . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Recap: Formal Languages and Grammars . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Mathematical Language Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Rational Agents: An AI Framework 65


5.1 Introduction: Rationality in Artificial Intelligence . . . . . . . . . . . . . . . . . . . 65
5.2 Agent/Env. as a Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Good Behavior ; Rationality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Classifying Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Types of Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

iii
iv CONTENTS

5.6 Representing the Environment in Agents . . . . . . . . . . . . . . . . . . . . . . . . 82


5.7 Rational Agents: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

II General Problem Solving 85


6 Problem Solving and Search 89
6.1 Problem Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Problem Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Uninformed Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4.1 Breadth-First Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.2 Depth-First Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.3 Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5 Informed Search Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5.2 Heuristics and their Properties . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5.3 A-Star Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5.4 Finding Good Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Adversarial Search for Game Playing 133


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 Minimax Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3 Evaluation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.4 Alpha-Beta Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.5 Monte-Carlo Tree Search (MCTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8 Constraint Satisfaction Problems 169


8.1 Constraint Satisfaction Problems: Motivation . . . . . . . . . . . . . . . . . . . . . 169
8.2 The Waltz Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.3 CSP: Towards a Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.4 Constraint Networks: Formalizing Binary CSPs . . . . . . . . . . . . . . . . . . . . 180
8.5 CSP as Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.6 Conclusion & Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

9 Constraint Propagation 189


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.2 Constraint Propagation/Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.3 Forward Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.4 Arc Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.5 Decomposition: Constraint Graphs, and Three Simple Cases . . . . . . . . . . . . . 204
9.6 Cutset Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.7 Constraint Propagation with Local Search . . . . . . . . . . . . . . . . . . . . . . . 211
9.8 Conclusion & Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

III Knowledge and Inference 215


10 Propositional Logic & Reasoning, Part I: Principles 219
10.1 Introduction: Inference with Structured State Representations . . . . . . . . . . . 219
10.1.1 A Running Example: The Wumpus World . . . . . . . . . . . . . . . . . . . 219
10.1.2 Propositional Logic: Preview . . . . . . . . . . . . . . . . . . . . . . . . . . 222
CONTENTS v

10.1.3 Propositional Logic: Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . 224


10.2 Propositional Logic (Syntax/Semantics) . . . . . . . . . . . . . . . . . . . . . . . . 224
10.3 Inference in Propositional Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.4 Propositional Natural Deduction Calculus . . . . . . . . . . . . . . . . . . . . . . . 233
10.5 Predicate Logic Without Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

11 Formal Systems 243

12 Machine-Oriented Calculi for Propositional Logic 247


12.1 Test Calculi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.1.1 Normal Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
12.2 Analytical Tableaux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.2.1 Analytical Tableaux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.2.2 Practical Enhancements for Tableaux . . . . . . . . . . . . . . . . . . . . . 253
12.2.3 Soundness and Termination of Tableaux . . . . . . . . . . . . . . . . . . . . 254
12.3 Resolution for Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
12.3.1 Resolution for Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . 256
12.3.2 Killing a Wumpus with Propositional Inference . . . . . . . . . . . . . . . . 259
12.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

13 Propositional Reasoning: SAT Solvers 263


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
13.2 Davis-Putnam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
13.3 DPLL = b (A Restricted Form of) Resolution . . . . . . . . . . . . . . . . . . . . . . 267
13.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

14 First-Order Predicate Logic 273


14.1 Motivation: A more Expressive Language . . . . . . . . . . . . . . . . . . . . . . . 273
14.2 First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
14.2.1 First-Order Logic: Syntax and Semantics . . . . . . . . . . . . . . . . . . . 277
14.2.2 First-Order Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
14.3 First-Order Natural Deduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
14.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

15 Automated Theorem Proving in First-Order Logic 291


15.1 First-Order Inference with Tableaux . . . . . . . . . . . . . . . . . . . . . . . . . . 291
15.1.1 First-Order Tableau Calculi . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
15.1.2 First-Order Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.1.3 Efficient Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
15.1.4 Implementing First-Order Tableaux . . . . . . . . . . . . . . . . . . . . . . 303
15.2 First-Order Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
15.2.1 Resolution Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
15.3 Logic Programming as Resolution Theorem Proving . . . . . . . . . . . . . . . . . 308
15.4 Summary: ATP in First-Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 311

16 Knowledge Representation and the Semantic Web 313


16.1 Introduction to Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . 313
16.1.1 Knowledge & Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.1.2 Semantic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
16.1.3 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
16.1.4 Other Knowledge Representation Approaches . . . . . . . . . . . . . . . . . 325
16.2 Logic-Based Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . . 326
16.2.1 Propositional Logic as a Set Description Language . . . . . . . . . . . . . . 327
16.2.2 Ontologies and Description Logics . . . . . . . . . . . . . . . . . . . . . . . 330
vi CONTENTS

16.2.3 Description Logics and Inference . . . . . . . . . . . . . . . . . . . . . . . . 332


16.3 A simple Description Logic: ALC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
16.3.1 Basic ALC: Concepts, Roles, and Quantification . . . . . . . . . . . . . . . 335
16.3.2 Inference for ALC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
16.3.3 ABoxes, Instance Testing, and ALC . . . . . . . . . . . . . . . . . . . . . . 346
16.4 Description Logics and the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . 348

IV Planning & Acting 357


17 Planning I: Framework 361
17.1 Logic-Based Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.2 Planning: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
17.3 Planning History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.4 STRIPS Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
17.5 Partial Order Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
17.6 PDDL Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
17.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

18 Planning II: Algorithms 401


18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
18.2 How to Relax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
18.3 Delete Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
18.4 The h+ Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
18.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

19 Searching, Planning, and Acting in the Real World 435


19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
19.2 The Furniture Coloring Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
19.3 Searching/Planning with Non-Deterministic Actions . . . . . . . . . . . . . . . . . 438
19.4 Agent Architectures based on Belief States . . . . . . . . . . . . . . . . . . . . . . . 441
19.5 Searching/Planning without Observations . . . . . . . . . . . . . . . . . . . . . . . 443
19.6 Searching/Planning with Observation . . . . . . . . . . . . . . . . . . . . . . . . . 446
19.7 Online Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
19.8 Replanning and Execution Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 454

20 Semester Change-Over 459


20.1 What did we learn in AI 1? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
20.2 Administrativa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
20.3 Overview over AI and Topics of AI-II . . . . . . . . . . . . . . . . . . . . . . . . . 468
20.3.1 What is Artificial Intelligence? . . . . . . . . . . . . . . . . . . . . . . . . . 468
20.3.2 Artificial Intelligence is here today! . . . . . . . . . . . . . . . . . . . . . . . 470
20.3.3 Ways to Attack the AI Problem . . . . . . . . . . . . . . . . . . . . . . . . 474
20.3.4 AI in the KWARC Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
20.3.5 Agents and Environments in AI2 . . . . . . . . . . . . . . . . . . . . . . . . 477

V Reasoning with Uncertain Knowledge 487


21 Quantifying Uncertainty 491
21.1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
21.2 Probabilistic Reasoning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
CONTENTS vii

22 Probabilistic Reasoning: Bayesian Networks 513


22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
22.2 Constructing Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
22.3 Inference in Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
22.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

23 Making Simple Decisions Rationally 527


23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
23.2 Decision Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
23.3 Preferences and Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
23.4 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
23.5 Multi-Attribute Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
23.6 The Value of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537

24 Temporal Probability Models 541


24.1 Modeling Time and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
24.2 Inference: Filtering, Prediction, and Smoothing . . . . . . . . . . . . . . . . . . . . 545
24.3 Hidden Markov Models – Extended Example . . . . . . . . . . . . . . . . . . . . . 551
24.4 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553

25 Making Complex Decisions 557


25.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
25.2 Utilities over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
25.3 Value/Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
25.4 Partially Observable MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
25.5 Online Agents with POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572

VI Machine Learning 577

26 Learning from Observations 581


26.1 Forms of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
26.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
26.3 Learning Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
26.4 Using Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
26.5 Evaluating and Choosing the Best Hypothesis . . . . . . . . . . . . . . . . . . . . . 591
26.6 Computational Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
26.7 Regression and Classification with Linear Models . . . . . . . . . . . . . . . . . . . 603
26.8 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
26.9 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613

27 Statistical Learning 625


27.1 Full Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
27.2 Approximations of Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . 628
27.3 Parameter Learning for Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 629

28 Reinforcement Learning 633


28.1 Reinforcement Learning: Introduction & Motivation . . . . . . . . . . . . . . . . . 633
28.2 Passive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
28.3 Active Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
viii CONTENTS

29 Knowledge in Learning 641


29.1 Logical Formulations of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
29.2 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
29.2.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
29.2.2 Top-Down Inductive Learning: FOIL . . . . . . . . . . . . . . . . . . . . . . 648
29.2.3 Inverse Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650

VII Natural Language 653


30 Natural Language Processing 657
30.1 Introduction to NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
30.2 Natural Language and its Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
30.3 Looking at Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
30.4 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
30.5 Part of Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
30.6 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
30.7 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
30.8 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676

31 Deep Learning for NLP 679


31.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
31.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
31.3 Sequence-to-Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
31.4 The Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
31.5 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691

32 What did we learn in AI 1/2? 695

VIII Excursions 707


A Completeness of Calculi for Propositional Logic 711
A.1 Abstract Consistency and Model Existence . . . . . . . . . . . . . . . . . . . . . . 711
A.2 A Completeness Proof for Propositional Tableaux . . . . . . . . . . . . . . . . . . . 717

B Conflict Driven Clause Learning 719


B.1 UP Conflict Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
B.2 Clause Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
B.3 Phase Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728

C Completeness of Calculi for First-Order Logic 733


C.1 Abstract Consistency and Model Existence . . . . . . . . . . . . . . . . . . . . . . 733
C.2 A Completeness Proof for First-Order ND . . . . . . . . . . . . . . . . . . . . . . . 739
C.3 Soundness and Completeness of First-Order Tableaux . . . . . . . . . . . . . . . . 741
C.4 Soundness and Completeness of First-Order Resolution . . . . . . . . . . . . . . . . 742
Chapter 1

Preliminaries

In this chapter, we want to get all the organizational matters out of the way, so that we can get into
the discussion of artificial intelligence content unencumbered. We will talk about the necessary
administrative details, go into how students can get most out of the course, talk about where the
various resources provided with the course can be found, and finally introduce the ALeA system,
an experimental – using AI methods – learning support system for the AI course.

1.1 Administrative Ground Rules


We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible.

Prerequisites for AI-1


 Content Prerequisites: The mandatory courses in CS@FAU; Sem. 1-4, in par-
ticular:
 Course “Algorithmen und Datenstrukturen”. (Algorithms & Data Structures)
 Course “Grundlagen der Logik in der Informatik” (GLOIN). (Logic in CS)
 Course “Berechenbarkeit und Formale Sprachen”. (Theoretical CS)

 Skillset Prerequisite: Coping with mathematical formulation of the structures


 Mathematics is the language of science (in particular computer science)
 It allows us to be very precise about what we mean. (good for you)

 Intuition: (take them with a kilo of salt)


 This is what I assume you know! (I have to assume something)
 In most cases, the dependency on these is partial and “in spirit”.
 If you have not taken these (or do not remember), read up on them as needed!

 Real Prerequisites: Motivation, interest, curiosity, hard work.(AI-1 is non-trivial)


 You can do this course if you want! (and I hope you are successful)

Michael Kohlhase: Artificial Intelligence 1 1 2025-02-06

1
2 CHAPTER 1. PRELIMINARIES

Note: I do not literally presuppose the courses on the slide above – most of you do not have a
bachelor’s degree from FAU, so you cannot have taken them. And indeed some of the content of
these courses is irrelevant for AI-1. Stating these courses is just the easiest way to specifying what
content I will be building on – and any graduate courses has to build on something.
Many of you will have taken the moral equivalent of these courses in your undergraduate studies
at your home university. If you did not, you will have to somehow catch up on the content as we
go along in AI-1. This should be possible with enough motivation.
There are essentially three skillsets that are essential for AI-1:
1. A solid understanding and practical skill in programming (whatever programming language)

2. A good understanding and practice in using mathematical language to represent complex struc-
tures
3. A solid understanding of formal languages and grammars, as well as applied complexity theory
(basics of theoretical computer science).
Without (catching up on) these the AI-1 course will be quite frustrating and hard.
We will briefly go over the most important topics in ?? to synchronize concepts and notation.
Note that if you do not have a formal education in courses like the ones mentioned above you will
very probably have to do significant remedial work.
Now we come to a topic that is always interesting to the students: the grading scheme.

Assessment, Grades
 Overall (Module) Grade:

 Grade via the exam (Klausur) ; 100% of the grade.


 Up to 10% bonus on-top for an exam with ≥ 50% points.(< 50% ; no bonus)
 Bonus points =
b percentage sum of the best 10 prepquizzes divided by 100.
 Exam: 90 minutes exam conducted in presence on paper! (∼ April 1. 2025)

 Retake Exam: 90 min exam six months later. (∼ October 1. 2025)

 Register for exams in https://campo.fau.de. (there is a deadine!)


 Note: You can de-register from an exam on https://campo.fau.de up to three
working days before exam. (do not miss that if you are not prepared)

Michael Kohlhase: Artificial Intelligence 1 2 2025-02-06

Preparedness Quizzes
 PrepQuizzes: Every tuesday 16:15 we start the lecture with a 10 min online quiz
– the PrepQuiz – about the material from the previous week. (starts in week 2)
 Motivations: We do this to

 keep you prepared and working continuously. (primary)


 update the ALeA learner model (fringe benefit)
 The prepquiz will be given in the ALeA system
1.1. ADMINISTRATIVE GROUND RULES 3

 https://courses.voll-ki.fau.de/quiz-dash/ai-1
 You have to be logged into ALeA! (via FAU IDM)

 You can take the prepquiz on your laptop or phone, . . .


 . . . in the lecture or at home . . .
 . . . via WLAN or 4G Network. (do not overload)
 Prepquizzes will only be available 16:15-16:25!

Michael Kohlhase: Artificial Intelligence 1 3 2025-02-06

This Thursday: Pretest

 This thursday we will try out the prepquiz infrastructure with a pretest!

 Presence: bring your laptop or cellphone.


 Online: you can and should take the pretest as well.
 Have a recent firefox or chrome (chrome: younger than March 2023)
 Make sure that you are logged into ALeA (via FAU IDM; see below)

 Definition 1.1.1. A pretest is an assessment for evaluating the preparedness of


learners for further studies.
 Concretely: This pretest
 establishes a baseline for the competency expectations in AI-1 and
 tests the ALeA quiz infrastructure for the prepquizzes.
 Participation in the pretest is optional; it will not influence grades in any way.
 The pretest covers the prerequisites of AI-1 and some of the material that may have
been covered in other courses.

 The test will be also used to refine the ALeA learner model, which may make
learning experience in ALeA better. (see below)

Michael Kohlhase: Artificial Intelligence 1 4 2025-02-06


4 CHAPTER 1. PRELIMINARIES

Due to the current AI hype, the course Artificial Intelligence is very popular and thus many degree
programs at FAU have adopted it for their curricula. Sometimes the course setup that fits for the
CS program does not fit the other’s very well, therefore there are some special conditions. I want
to state here.

Special Admin Conditions


 Some degree programs do not “import” the course Artificial Intelligence 1, and thus
you may not be able to register for the exam via https://campo.fau.de.

 Just send me an e-mail and come to the exam, (we do the necessary admin)
 Tell your program coordinator about AI-1/2 so that they remedy this situation
 In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.

 ECTS credits need to be divisible by five ⇝ 7.5 + 7.5 = 15.

Michael Kohlhase: Artificial Intelligence 1 5 2025-02-06

I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.

1.2 Getting Most out of AI-1


In this section we will discuss a couple of measures that students may want to consider to get
most out of the AI-1 course.
None of the things discussed in this section – homeworks, tutorials, study groups, and at-
tendance – are mandatory (we cannot force you to do them; we offer them to you as learning
opportunities), but most of them are very clearly correlated with success (i.e. passing the exam
and getting a good grade), so taking advantage of them may be in your own interest.

AI-1 Homework Assignments


 Goal: Homework assignments reinforce what was taught in lectures.
 Homework Assignments: Small individual problem/programming/proof task
 but take time to solve (at least read them directly ; questions)

 Didactic Intuition: Homework assignments give you material to test your under-
standing and show you how to apply it.
 Homeworks give no points, but without trying you are unlikely to pass the exam.
 Homeworks will be mainly peer-graded in the ALeA system.

 Didactic Motivation: Through peer grading students are able to see mistakes
in their thinking and can correct any problems in future assignments. By grading
assignments, students may learn how to complete assignments more accurately and
how to improve their future results. (not just us being lazy)

Michael Kohlhase: Artificial Intelligence 1 6 2025-02-06

It is very well-established experience that without doing the homework assignments (or something
1.2. GETTING MOST OUT OF 5

similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take very little home from the course. Just sitting in the course and nodding is not
enough!

AI-1 Homework Assignments – Howto

 Homework Workflow: in ALeA (see below)


 Homework assignments will be published on thursdays: see https://courses.
voll-ki.fau.de/hw/ai-1
 Submission of solutions via the ALeA system in the week after
 Peer grading/feedback (and master solutions) via answer classes.
 Quality Control: TAs and instructors will monitor and supervise peer grading.
 Experiment: Can we motivate enough of you to make peer assessment self-
sustaining?
 I am appealing to your sense of community responsibility here . . .
 You should only expect other’s to grade your submission if you grade their’s
(cf. Kant’s “Moral Imperative”)
 Make no mistake: The grader usually learns at least as much as the gradee.

 Homework/Tutorial Discipline:
 Start early! (many assignments need more than one evening’s work)
 Don’t start by sitting at a blank screen (talking & study groups help)
 Humans will be trying to understand the text/code/math when grading it.
 Go to the tutorials, discuss with your TA! (they are there for you!)

Michael Kohlhase: Artificial Intelligence 1 7 2025-02-06

If you have questions please make sure you discuss them with the instructor, the teaching assistants,
or your fellow students. There are three sensible venues for such discussions: online in the lectures,
in the tutorials, which we discuss now, or in the course forum – see below. Finally, it is always a
very good idea to form study groups with your friends.

Tutorials for Artificial Intelligence 1


 Approach: Weekly tutorials and homework assignments (first one in week two)
 Goal 1: Reinforce what was taught in the lectures. (you need practice)

 Goal 2: Allow you to ask any question you have in a protected environment.
 Instructor/Lead TA: Florian Rabe (KWARC Postdoc)
 Room: 11.137 @ Händler building, [email protected]
 Tutorials: One each taught by Florian Rabe (lead); Yasmeen Shawat, Hatem
Mousa, Xinyuan Tu, and Florian Guthmann.
6 CHAPTER 1. PRELIMINARIES

 Life-saving Advice: Go to your tutorial, and prepare for it by having looked at


the slides and the homework assignments!

Michael Kohlhase: Artificial Intelligence 1 8 2025-02-06

Collaboration
 Definition 1.2.1. Collaboration (or cooperation) is the process of groups of agents
acting together for common, mutual benefit, as opposed to acting in competition
for selfish benefit. In a collaboration, every agent contributes to the common goal
and benefits from the contributions of others.

 In learning situations, the benefit is “better learning”.


 Observation: In collaborative learning, the overall result can be significantly better
than in competitive learning.
 Good Practice: Form study groups. (long- or short-term)

1. those learners who work most, learn most!


2. freeloaders – individuals who only watch – learn very little!
 It is OK to collaborate on homework assignments in AI-1! (no bonus points)
 Choose your study group well! (We will (eventually) help via ALeA)

Michael Kohlhase: Artificial Intelligence 1 9 2025-02-06

As we said above, almost all of the components of the AI-1 course are optional. That even applies
to attendance. But make no mistake, attendance is important to most of you. Let me explain, . . .

Do I need to attend the AI-1 Lectures


 Attendance is not mandatory for the AI-1 course. (official version)

 Note: There are two ways of learning: (both are OK, your mileage may vary)
 Approach B: Read a book/papers (here: lecture notes)
 Approach I: come to the lectures, be involved, interrupt the instructor whenever
you have a question.
The only advantage of I over B is that books/papers do not answer questions

 Approach S: come to the lectures and sleep does not work!


 The closer you get to research, the more we need to discuss!

Michael Kohlhase: Artificial Intelligence 1 10 2025-02-06

1.3 Learning Resources for AI-1


But what if you are not in a lecture or tutorial and want to find out more about the AI-1 topics?
1.3. LEARNING RESOURCES FOR 7

Textbook, Handouts and Information, Forums, Videos


 Textbook: Russel/Norvig: Artificial Intelligence, A modern Approach [RN09].
 basically “broad but somewhat shallow”
 great to get intuitions on the basics of AI
Make sure that you read the edition ≥ 3 ⇝ vastly improved over ≤ 2.
 Lecture notes: will be posted at https://kwarc.info/teaching/AI

 more detailed than [RN09] in some areas


 I mostly prepare them as we go along (semantically preloaded ; research
resource)
 please e-mail me any errors/shortcomings you notice. (improve for the group)
 Course Videos: AI-1 will be streamed/recorded at https://fau.tv/course/
id/4047
 Organized: Video course nuggets are available at https://fau.tv/course/
id/1690 (short; organized by topic)
 Backup: The lectures from WS 2016/17 to SS 2018 have been recorded
(in English and German), see https://www.fau.tv/search/term.html?q=
Kohlhase
 Do not let the videos mislead you: Coming to class is highly correlated with
passing the exam!
 StudOn Forum: https://www.studon.fau.de/crs5832535.html for

 announcements, homeworks (my view on the forum)


 questions, discussion among your fellow students (your forum too, use it!)

Michael Kohlhase: Artificial Intelligence 1 11 2025-02-06

FAU has issued a very insightful guide on using lecture videos. It is a good idea to heed these
recommendations, even if they seem annoying at first.

Practical recommendations on Lecture Videos


 Excellent Guide: [Nor+18a] (German version at [Nor+18b])
8 CHAPTER 1. PRELIMINARIES

Using lecture Attend lectures.

recordings: Take notes.


A guide for students
Be specific.

Catch up.

Ask for help.

Don’t cut corners.

Michael Kohlhase: Artificial Intelligence 1 12 2025-02-06

NOT a Resource for : LLMs – AI-based tools like ChatGPT


 Definition 1.3.1. A large language model (LLM) is a computational model capable
of language generation or other natural language processing tasks.
 Example 1.3.2. OpenAI’s GPT, Google’s Bard, and Meta’s Llama.

 Definition 1.3.3. A chatbot is a software application or web interface that is


designed to mimic human conversation through text or voice interactions. Modern
chatbots are usually based on LLMs.
 Example 1.3.4 (ChatGPT talks about AI-1). (but remains vague)

 Note: LLM-based chatbots invent every word ! (suprpisingly often correct)


 Example 1.3.5 (In the AI-1 exam). ChatGPT scores ca. 50% of the points.
 ChatGPT can almost pass the exam . . . (We could award it a Master’s degree)
 But can you? (the AI-1 exams will be in person on paper)
1.4. AI-SUPPORTED LEARNING 9

You will only pass the exam, if you can do AI-1 yourself!
 Intuition: AI tools like GhatGPT, CoPilot, etc. (see also [She24])
 can help you solve problems, (valuable tools in production situations)
 hinders learning if used for homeworks/quizzes, etc. (like driving instead of
jogging)
 What (not) to do: (to get most of the brave new AI-supported world)
 try out these tools to get a first-hand intuition what they can/cannot do
 challenge yourself while learning so that you can also do it (mind over matter!)

Michael Kohlhase: Artificial Intelligence 1 13 2025-02-06

1.4 AI-Supported Learning


In this section we introduce the ALeA (Adaptive Learning Assistant) system, a learning support
system we have developed using symbolic AI methods – the stuff we learn about in AI-1 – and
which we will use to support students in the course. As such, ALeA does double duty in the AI-1
course it supports learning activities and serves as a showcase, what symbolic AI methods can to
in an important application.

ALeA: Adaptive Learning Assistant

 Idea: Use AI methods to help teach/learn AI (AI4AI)


 Concretely: Provide HTML versions of the AI-1 slides/lecture notes and embed
learning support services into them. (for pre/postparation of lectures)
 Definition 1.4.1. Call a document active, iff it is interactive and adapts to specific
information needs of the readers. (lecture notes on steroids)
 Intuition: ALeA serves active course materials. (PDF mostly inactive)
 Goal: Make ALeA more like a instructor + study group than like a book!
 Example 1.4.2 (Course Notes). =
b Slides + Comments

; yellow parts in table of contents (left) already covered in lectures.


10 CHAPTER 1. PRELIMINARIES

Michael Kohlhase: Artificial Intelligence 1 14 2025-02-06

The central idea in the AI4AI approach – using AI to support learning AI – and thus the ALeA
system is that we want to make course materials – i.e. what we give to students for preparing and
postparing lectures – more like teachers and study groups (only available 24/7) than like static
books.

VoLL-KI Portal at https://courses.voll-ki.fau.de


 Portal for ALeA Courses: https://courses.voll-ki.fau.de

 AI-1 in ALeA: https://courses.voll-ki.fau.de/course-home/ai-1


 All details for the course.
 recorded syllabus (keep track of material covered in course)
 syllabus of the last semesters (for over/preview)

 ALeA Status: The ALeA system is deployed at FAU for over 1000 students
taking eight courses
 (some) students use the system actively (our logs tell us)
 reviews are mostly positive/enthusiastic (error reports pour in)

Michael Kohlhase: Artificial Intelligence 1 15 2025-02-06

The ALeA AI-1 page is the central entry point for working with the ALeA system. You can get
to all the components of the system, including two presentations of the course contents (notes-
and slides-centric ones), the flashcards, the localized forum, and the quiz dashboard.
We now come to the heart of the ALeA system: its learning support services, which we will now
briefly introduce. Note that this presentation is not really sufficient to undertstand what you may
be getting out of them, you will have to try them, and interact with them sufficiently that the
learner model can get a good estimate of your competencies to adapt the results to you.

Learning Support Services in ALeA


 Idea: Embed learning support services into active course materials.
 Example 1.4.3 (Definition on Hover). Hovering on a (cyan) term reference
reminds us of its definition. (even works recursively)
1.4. AI-SUPPORTED LEARNING 11

 Example 1.4.4 (More Definitions on Click). Clicking on a (cyan) term reference


shows us more definitions from other contexts.
12 CHAPTER 1. PRELIMINARIES

 Example 1.4.5 (Guided Tour). A guided tour for a concept c assembles defini-
tions/etc. into a self-contained mini-course culminating at c.

c = count-
able ;

 . . . your idea here . . . (the sky is the limit)

Michael Kohlhase: Artificial Intelligence 1 16 2025-02-06

Note that this is only an initial collection of learning support services, we are constantly working
on additional ones. Look out for feature notifications ( ) on the upper right hand of
the ALeA screen.

(Practice/Remedial) Problems Everywhere


 Problem: Learning requires a mix of understanding and test-driven practice.
 Idea: ALeA supplies targeted practice problems everywhere.

 Concretely: Revision markers at the end of sections.


 A relatively non-intrusive overview over competency

 Click to extend it for details.


1.4. AI-SUPPORTED LEARNING 13

 Practice problems as usual. (targeted to your specific competency)

Michael Kohlhase: Artificial Intelligence 1 17 2025-02-06

While the learning support services up to now have been adressed to individual learners, we
now turn to services addressed to communities of learners, ranging from study groups with three
learners, to whole courses, and even – eventually – all the alumni of a course, if they have not
de-registered from ALeA.
Currently, the community aspect of ALeA only consists in localized interactions with the course
materials.
The ALeA system uses the semantic structure of the course materials to localize some interactions
that are otherwise often from separate applications. Here we see two:
1. one for reporting content errors – and thus making the material better for all learners – and‘’
2. a localized course forum, where forum threads can be attached to learning objects.

Localized Interactions with the Community


 Selecting text brings up localized – i.e. anchored on the selection – interactions:
 post a (public) comment or take (private) note
 report an error to the course authors/instructors

 Localized comments induce a thread in the ALeA forum (like the StudOn
Forum, but targeted towards specific learning objects.)
14 CHAPTER 1. PRELIMINARIES

 Answering questions gives karma =


b a public measure of user helpfulness.
 Notes can be anonymous (; generate no karma)

Michael Kohlhase: Artificial Intelligence 1 18 2025-02-06

Let us briefly look into how the learning support services introduced above might work, focusing
on where the necessary information might come from. Even though some of the concepts in the
discussion below may be new to AI-1 students, it is worth looking into them. Bear with us as we
try to explain the AI components of the ALeA system.

ALeA=
b Data-Driven & AI-enabled Learning Assistance
Learner Rhetoric/Didactic
Model Model

 Idea: Do what a teacher does!


Use/maintain four models: Domain Formulation
Model Model
 Ingredient 1: Domain model =
b
knowledge/theory graph DyBN POMDP MDP

 Ingredient 2: Learner model =


b
time pref
adding competency estimations

 Ingredient 3: A collection of ready- N ≤ utility


formulated learning objects
DyBN MDP POMDP
 Ingredient 4: Educational dialogue
planner ; guided tours pref
time

⟨N, ≤⟩ poset utility


(Good) teachers

 understand the objects and their properties they are talking about
 have readimade formulations how to convey them best

 and understand how these best work together


 model what the learners already know/understand and adapts them accordingly
1.4. AI-SUPPORTED LEARNING 15

A theory graph provides (modular representation of the domain)


 symbols with URIs for all concepts, objects, and relations
 definitions, notations, and verbalizations for all symbols

 “object-oriented inheritance” and views between theories.


The learner model is a function from learner IDs × symbol URIs to competency values
 competency comes in six cognitive dimensions: remember, understand, analyze,
evaluate, apply, and create.

 ALeA logs all learner interactions (keeps data learner-private)


 each interaction updates the learner model function.
Learning objects are the text fragments learners see and interact with; they are struc-
tured by

 didactic relations, e.g. tasks have prerequisites and learning objectives


 rhetoric relations, e.g. introduction, elaboration, and transition
The dialogue planner assembles learning objects into active course material using

 the domain model and didactic relations to determine the order of LOs
 the learner model to determine what to show
 the rhetoric relations to make the dialogue coherent

Michael Kohlhase: Artificial Intelligence 1 19 2025-02-06

We can use the same four models discussed in the space of guided tours to deploy additional
learning support services, which we now discuss.

New Feature: Drilling with Flashcards


 Flashcards challenge you with a task (term/problem) on the front. . .

. . . and the definition/answer is on the back.

 Self-assessment updates the learner model (before/after)


16 CHAPTER 1. PRELIMINARIES

 Idea: Challenge yourself to a card stack, keep drilling/assessing flashcards until


the learner model eliminates all.
 Bonus: Flashcards can be generated from existing semantic markup (educational
equivalent to free beer)

Michael Kohlhase: Artificial Intelligence 1 20 2025-02-06

We have already seen above how the learner model can drive the drilling with flashcards. It can
also be used for the configuration of card stacks by configuring a domain e.g. a section in the
course materials and a competency threshold. We now come to a very important issue that we
always face when we do AI systems that interface with humans. Most web technology companies
that take one the approach “the user pays for the services with their personal data, which is sold
on” or integrate advertising for renumeration. Both are not acceptable in university setting.
But abstaining from monetizing personal data still leaves the problem how to protect it from
intentional or accidental misuse. Even though the GDPR has quite extensive exceptions for
research, the ALeA system – a research prototype – adheres to the principles and mandates of
the GDPR. In particular it makes sure that personal data of the learners is only used in learning
support services directly or indirectly initiated by the learners themselves.

Learner Data and Privacy in ALeA


 Observation: Learning support services in ALeA use the learner model; they
 need the learner model data to adapt to the invidivual learner!
 collect learner interaction data (to update the learner model)
 Consequence: You need to be logged in (via your FAU IDM credentials) for useful
learning support services!
 Problem: Learner model data is highly sensitive personal data!

 ALeA Promise: The ALeA team does the utmost to keep your personal data
safe. (SSO via FAU IDM/eduGAIN, ALeA trust zone)
 ALeA Privacy Axioms:
1. ALeA only collects learner models data about logged in users.
2. Personally identifiable learner model data is only accessible to its subject
(delegation possible)
3. Learners can always query the learner model about its data.
4. All learner model data can be purged without negative consequences (except
usability deterioration)
5. Logging into ALeA is completely optional.
 Observation: Authentication for bonus quizzes are somewhat less optional, but
you can always purge the learner model later.

Michael Kohlhase: Artificial Intelligence 1 21 2025-02-06

So, now that you have an overview over what the ALeA system can do for you, let us see what
you have to concretely do to be able to use it.
1.4. AI-SUPPORTED LEARNING 17

Concrete Todos for ALeA


 Recall: You will use ALeA for the prepquizzes (or lose bonus points)
All other use is optional. (but AI-supported pre/postparation can be helpful)

 To use the ALeA system, you will have to log in via SSO: (do it now)
 go to https://courses.voll-ki.fau.de/course-home/ai-1,

 in the upper right hand corner you see ,


 log in via your FAU IDM credentials. (you should have them by now)

 You get access to your personal ALeA profile via


(plus feature notifications, manual, and language chooser)
 Problem: Most ALeA services depend on the learner model. (to adapt to you)
 Solution: Initialize your learner model with your educational history!

 Concretely: enter taken CS courses (FAU equivalents) and grades.


 ALeA uses that to estimate your CS/AI competencies. (for your benefit)
 then ALeA knows about you; I don’t! (ALeA trust zone)

Michael Kohlhase: Artificial Intelligence 1 22 2025-02-06

Even if you did not understand some of the AI jargon or the underlying methods (yet), you should
be good to go for using the ALeA system in your day-to-day work.
18 CHAPTER 1. PRELIMINARIES
Chapter 2

Artificial Intelligence – Who?,


What?, When?, Where?, and Why?

We start the course by giving an overview of (the problems, methods, and issues of ) Artificial
Intelligence, and what has been achieved so far.
Naturally, this will dwell mostly on philosophical aspects – we will try to understand what
the important issues might be and what questions we should even be asking. What the most
important avenues of attacks may be and where AI research is being carried out.
In particular the discussion will be very non-technical – we have very little basis to discuss
technicalities yet. But stay with me, this will drastically change very soon. A Video Nugget
covering the introduction of this chapter can be found at https://fau.tv/clip/id/21467.

Plot for this chapter


 Motivation, overview, and finding out what you already know
 What is Artificial Intelligence?
 What has AI already achieved?
 A (very) quick walk through the AI-1 topics.
 How can you get involved with AI at KWARC?

Michael Kohlhase: Artificial Intelligence 1 23 2025-02-06

2.1 What is Artificial Intelligence?


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21701.
The first question we have to ask ourselves is “What is Artificial Intelligence?”, i.e. how can we
define it. And already that poses a problem since the natural definition like human intelligence,
but artificially realized presupposes a definition of intelligence, which is equally problematic; even
Psychologists and Philosophers – the subjects nominally “in charge” of natural intelligence – have
problems defining it, as witnessed by the plethora of theories e.g. found at [WHI].

What is Artificial Intelligence? Definition

19
20 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?

 Definition 2.1.1 (According to


Wikipedia). Artificial Intelligence (AI)
is intelligence exhibited by machines
 Definition 2.1.2 (also). Artificial Intelli-
gence (AI) is a sub-field of computer science
that is concerned with the automation of in-
telligent behavior.
 BUT: it is already difficult to define intel-
ligence precisely.

 Definition 2.1.3 (Elaine Rich). Artificial


Intelligence (AI) studies how we can make
the computer do things that humans can still
do better at the moment.
Michael Kohlhase: Artificial Intelligence 1 24 2025-02-06

Maybe we can get around the problems of defining “what artificial intelligence is”, by just describing
the necessary components of AI (and how they interact). Let’s have a try to see whether that is
more informative.

What is Artificial Intelligence? Components


 Elaine Rich: AI studies how we can make the computer do things that humans
can still do better at the moment.

 This needs a combination of

the ability to learn

Inference

Perception
2.2. ARTIFICIAL INTELLIGENCE IS HERE TODAY! 21

Language understanding

Emotion

Michael Kohlhase: Artificial Intelligence 1 25 2025-02-06

Note that list of components is controversial as well. Some say that it lumps together cognitive
capacities that should be distinguished or forgets others, . . . . We state it here much more to get
AI-1 students to think about the issues than to make it normative.

2.2 Artificial Intelligence is here today!


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21697.
The components of Artificial Intelligence are quite daunting, and none of them are fully understood,
much less achieved artificially. But for some tasks we can get by with much less. And indeed that
is what the field of Artificial Intelligence does in practice – but keeps the lofty ideal around. This
practice of “trying to achieve AI in selected and restricted domains” (cf. the discussion starting
with slide 32) has borne rich fruits: systems that meet or exceed human capabilities in such areas.
Such systems are in common use in many domains of application.

Artificial Intelligence is here today!


22 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
2.2. ARTIFICIAL INTELLIGENCE IS HERE TODAY! 23

 in outer space
 in outer space systems
need autonomous con-
trol:
 remote control impos-
sible due to time lag
 in artificial limbs
 the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
 in household appliances
 The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
 general robotic house-
hold help is on the
horizon.
 in hospitals
 in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
 Paro is a cuddly robot
that eases solitude in
nursing homes.
24 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?

Michael Kohlhase: Artificial Intelligence 1 26 2025-02-06

We will conclude this section with a note of caution.

The AI Conundrum
 Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
 But: researchers at the Dartmouth Conference (1956) really thought they would
solve/reach AI in two/three decades.

 Consequence: AI still asks the big questions. (and still promises answers soon)
 Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
 AI Conundrum: Once AI solves a subfield it is called “computer science”.
(becomes a separate subfield of CS)
 Example 2.2.1. Functional/Logic Programming, automated theorem proving,
Planning, machine learning, Knowledge Representation, . . .
 Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.

Michael Kohlhase: Artificial Intelligence 1 27 2025-02-06

All of these phenomena can be seen in the growth of AI as an academic discipline over the course
of its now over 70 year long history.

The current AI Hype — Part of a longer Story


 The history of AI as a discipline has been very much tied to the amount of funding
– that allows us to do research and development.
 Funding levels are tied to public perception of success (especially for AI)
 Definition 2.2.2. An AI winter is a time period of low public perception and
funding for AI,
mostly because AI has failed to deliver on its – sometimes overblown – promises
An AI summer is a time period of high public perception and funding for AI
 A potted history of AI (AI summers and summers)
2.3. WAYS TO ATTACK THE AI PROBLEM 25

AI becomes
scarily effective,
ubiquitous

Excitement fades;
some applications
AI-conse- profit a lot
quences,
Biases, AI-bubble bursts,
Regulation the next AI winter
Lighthill report WWW ; comes
Dartmouth Conference Data/-
Turing Test Computing
AI Winter 2
AI Winter 1 Explosion
1987-1994
1974-1980

1950 1960 1970 1980 1990 2000 2010 2021

Michael Kohlhase: Artificial Intelligence 1 28 2025-02-06

Of course, the future of AI is still unclear, we are currently in a massive hype caused by the advent
of deep neural networks being trained on all the data of the Internet, using the computational
power of huge compute farms owned by an oligopoly of massive technology companies – we are
definitely in an AI summer.
But AI as a academic community and the tech industry also make outrageous promises, and
the media pick it up and distort it out of proportion, . . . So public opinion could flip again, sending
AI into the next winter.

2.3 Ways to Attack the AI Problem


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21717.
There are currently three main avenues of attack to the problem of building artificially intelligent
systems. The (historically) first is based on the symbolic representation of knowledge about the
world and uses inference-based methods to derive new knowledge on which to base action decisions.
The second uses statistical methods to deal with uncertainty about the world state and learning
methods to derive new (uncertain) world assumptions to act on.

Four Main Approaches to Artificial Intelligence


 Definition 2.3.1. Symbolic AI is a subfield of AI based on the assumption that
many aspects of intelligence can be achieved by the manipulation of symbols, com-
bining them into meaning-carrying structures (expressions) and manipulating them
(using processes) to produce new expressions.
 Definition 2.3.2. Statistical AI remedies the two shortcomings of symbolic AI
approaches: that all concepts represented by symbols are crisply defined, and that
all aspects of the world are knowable/representable in principle. Statistical AI adopts
sophisticated mathematical models of uncertainty and uses them to create more
accurate world models and reason about them.
 Definition 2.3.3. Subsymbolic AI (also called connectionism or neural AI) is a
subfield of AI that posits that intelligence is inherently tied to brains, where infor-
mation is represented by a simple sequence pulses that are processed in parallel via
simple calculations realized by neurons, and thus concentrates on neural computing.

 Definition 2.3.4. Embodied AI posits that intelligence cannot be achieved by


reasoning about the state of the world (symbolically, statistically, or connectivist),
but must be embodied i.e. situated in the world, equipped with a “body” that can
26 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?

interact with it via sensors and actuators. Here, the main method for realizing
intelligent behavior is by learning from the world.

Michael Kohlhase: Artificial Intelligence 1 29 2025-02-06

As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersection
of computer science (logic, programming, applied statistics), Cognitive Science (psychology, neu-
roscience), philosophy (can machines think, what does that mean?), linguistics (natural language
understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.

Two ways of reaching Artificial Intelligence?


 We can classify the AI approaches by their coverage and the analysis depth (they
are complementary)

Deep symbolic not there yet


AI-1 cooperation?

Shallow no-one wants this statistical/sub symbolic


AI-2
Analysis ↑
vs. Narrow Wide
Coverage →

 This semester we will cover foundational aspects of symbolic AI (deep/narrow


processing)
 next semester concentrate on statistical/subsymbolic AI.
(shallow/wide-coverage)

Michael Kohlhase: Artificial Intelligence 1 30 2025-02-06

We combine the topics in this way in this course, not only because this reproduces the histor-
ical development but also as the methods of statistical and subsymbolic AI share a common
basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.

Environmental Niches for both Approaches to AI


 Observation: There are two kinds of applications/tasks in AI

 Consumer tasks: consumer grade applications have tasks that must be fully
generic and wide coverage. ( e.g. machine translation like Google Translate)
 Producer tasks: producer grade applications must be high-precision, but can be
2.4. STRONG VS. WEAK AI 27

domain-specific (e.g. multilingual documentation, machinery-control, program


verification, medical technology)

Precision
100% Producer Tasks

50% Consumer Tasks

103±1 Concepts 106±1 Concepts Coverage

after Aarne Ranta [Ran17].

 General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
 A domain of producer tasks I am interested in: mathematical/technical documents.

Michael Kohlhase: Artificial Intelligence 1 31 2025-02-06

An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
million Euro and sells them into dozens of countries. Thus T must also provide comprehensive
machine operation manuals, a non-trivial undertaking, since no two machines are identical and
they must be translated into many languages, leading to hundreds of documents. As those manual
share a lot of semantic content, their management should be supported by AI techniques. It is
critical that these methods maintain a high precision, operation errors can easily lead to very
costly machine damage and loss of production. On the other hand, the domain of these manuals is
quite restricted. A machine tool has a couple of hundred components only that can be described
by a couple of thousand attributes only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.

2.4 Strong vs. Weak AI


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21724.
To get this out of the way before we begin: We now come to a distinction that is often mud-
dled in popular discussions about “Artificial Intelligence”, but should be cristal clear to students
of the course AI-1 – after all, you are upcoming “AI-specialists”.

Strong AI vs. Narrow AI


 Definition 2.4.1. With the term narrow AI (also weak AI, instrumental AI, applied
AI) we refer to the use of software to study or accomplish specific problem solving
or reasoning tasks (e.g. playing chess/go, controlling elevators, composing music,
...)
 Definition 2.4.2. With the term strong AI (also full AI, AGI) we denote the quest
for software performing at the full range of human cognitive abilities.
 Definition 2.4.3. Problems requiring strong AI to solve are called AI hard, and AI
complete, iff AGI should be able to solve them all.
28 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?

 In short: We can characterize the difference intuitively:


 narrow AI: What (most) computer scientists think AI is / should be.
 strong AI: What Hollywood authors think AI is / should be.

 Needless to say we are only going to cover narrow AI in this course!

Michael Kohlhase: Artificial Intelligence 1 32 2025-02-06

One can usually defuse public worries about “is AI going to take control over the world” by just
explaining the difference between strong AI and weak AI clearly.
I would like to add a few words on AGI, that – if you adopt them; they are not universally accepted
– will strengthen the arguments differentiating between strong and weak AI.

A few words on AGI. . .


 The conceptual and mathematical framework (agents, environments etc.) is the
same for strong AI and weak AI.
 AGI research focuses mostly on abstract aspects of machine learning (reinforce-
ment learning, neural nets) and decision/game theory (“which goals should an AGI
pursue?”).
 Academic respectability of AGI fluctuates massively, recently increased (again).
(correlates somewhat with AI winters and golden years)
 Public attention increasing due to talk of “existential risks of AI” (e.g. Hawking,
Musk, Bostrom, Yudkowsky, Obama, . . . )

 Kohlhase’s View: Weak AI is here, strong AI is very far off. (not in my lifetime)
 : But even if that is true, weak AI will affect all of us deeply in everyday life.
 Example 2.4.4. You should not train to be an accountant or truck driver!
(bots will replace you soon)

Michael Kohlhase: Artificial Intelligence 1 33 2025-02-06

I want to conclude this section with an overview over the recent protagonists – both personal and
institutional – of AGI.

AGI Research and Researchers


 “Famous” research(ers) / organizations
 MIRI (Machine Intelligence Research Institute), Eliezer Yudkowsky (Formerly
known as “Singularity Institute”)
 Future of Humanity Institute Oxford (Nick Bostrom),
 Google (Ray Kurzweil),
 AGIRI / OpenCog (Ben Goertzel),
 petrl.org (People for the Ethical Treatment of Reinforcement Learners).
(Obviously somewhat tongue-in-cheek)
 Be highly skeptical about any claims with respect to AGI! (Kohlhase’s View)
2.5. AI TOPICS COVERED 29

Michael Kohlhase: Artificial Intelligence 1 34 2025-02-06

2.5 AI Topics Covered


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21719.
We will now preview the topics covered by the course “Artificial Intelligence” in the next two
semesters.

Topics of AI-1 (Winter Semester)


 Getting Started
 What is Artificial Intelligence? (situating ourselves)
 Logic programming in Prolog (An influential paradigm)
 Intelligent Agents (a unifying framework)
 Problem Solving
 Problem Solving and search (Black Box World States and Actions)
 Adversarial search (Game playing) (A nice application of search)
 constraint satisfaction problems (Factored World States)
 Knowledge and Reasoning
 Formal Logic as the mathematics of Meaning
 Propositional logic and satisfiability (Atomic Propositions)
 First-order logic and theorem proving (Quantification)
 Logic programming (Logic + Search; Programming)
 Description logics and semantic web
 Planning

 Planning Frameworks
 Planning Algorithms
 Planning and Acting in the real world

Michael Kohlhase: Artificial Intelligence 1 35 2025-02-06

Topics of AI-2 (Summer Semester)


 Uncertain Knowledge and Reasoning
 Uncertainty
 Probabilistic reasoning
 Making Decisions in Episodic Environments
 Problem Solving in Sequential Environments
 Foundations of machine learning
30 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?

 Learning from Observations


 Knowledge in Learning
 Statistical Learning Methods

 Communication (If there is time)


 Natural Language Processing
 Natural Language for Communication

Michael Kohlhase: Artificial Intelligence 1 36 2025-02-06

AI1SysProj: A Systems/Project Supplement to AI-1


 The AI-1 course concentrates on concepts, theory, and algorithms of symbolic AI.

 Problem: Engineering/Systems Aspects of AI are very important as well.


 Partial Solution: Getting your hands dirty in the homeworks and the Kalah
Challenge
 Full Solution: AI1SysProj: AI-1 Systems Project (10 ECTS, 30-50places)

 For each Topic of AI-1, where will be a mini-project in AI1SysProj


 e.g. for game-play there will be Chinese Checkers (more difficult than Kalah)
 e.g. for CSP we will schedule TechFak courses or exams (from real data)
 solve challenges by implementing the AI-1 algorithms or use SoA systems

 Question: Should I take AI1SysProj in my first semester? (i.e. now)


 Answer: It depends . . . (on your situation)
 most master’s programs require a 10-ECTS “Master’s Project”(Master AI: two)
 there will be a great pressure on project places (so reserve one early)
 BUT 10 ECTS =
b 250-300 hours involvement by definition (1/3 of your
time/ECTS)
 BTW: There will also be an AI2SysProj next semester! (another chance)

Michael Kohlhase: Artificial Intelligence 1 37 2025-02-06

2.6 AI in the KWARC Group


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21725.
Now allow me to beat my own drum. In my research group at FAU, we do research on
a particular kind of Artificial Intelligence: logic, language, and information. This may not be
the most fashionable or well-hyped area in AI, but it is challenging, well-respected, and – most
importantly – fun.

The KWARC Research Group


2.6. AI IN THE KWARC GROUP 31

 Observation: The ability to represent knowledge about the world and to draw
logical inferences is one of the central components of intelligent behavior.
 Thus: reasoning components of some form are at the heart of many AI systems.

 KWARC Angle: Scaling up (web-coverage) without dumbing down (too much)


 Content markup instead of full formalization (too tedious)
 User support and quality control instead of “The Truth” (elusive anyway)
 use Mathematics as a test tube ( Mathematics =
b Anything Formal )
 care more about applications than about philosophy (we cannot help getting
this right anyway as logicians)
 The KWARC group was established at Jacobs Univ. in 2004, moved to FAU Erlan-
gen in 2016
 see http://kwarc.info for projects, publications, and links

Michael Kohlhase: Artificial Intelligence 1 38 2025-02-06

Research in the KWARC group ranges over a variety of topics, which range from foundations of
mathematics to relatively applied web information systems. I will try to organize them into three
pillars here.

Overview: KWARC Research and Projects

Applications: eMath 3.0, Active Documents, Active Learning, Semantic Spread-


sheets/CAD/CAM, Change Mangagement, Global Digital Math Library, Math
Search Systems, SMGloM: Semantic Multilingual Math Glossary, Serious Games,
...
Foundations of Math: KM & Interaction: Semantization:
 MathML, OpenM ath  Semantic Interpretation  LATEXML: LATEX ; XML
 advanced Type Theories (aka. Framing)  STEX: Semantic LATEX
 Mmt: Meta Meta The-  math-literate interaction  invasive editors
ory  MathHub: math archi-  Context-Aware IDEs
 Logic Morphisms/Atlas ves & active docs
 Mathematical Corpora
 Theorem Prover/CAS In-  Active documents: em-
bedded semantic services  Linguistics of Math
teroperability
 Model-based Education  ML for Math Semantics
 Mathematical Model- Extraction
s/Simulation
Foundations: Computational Logic, Web Technologies, OMDoc/Mmt

Michael Kohlhase: Artificial Intelligence 1 39 2025-02-06

For all of these areas, we are looking for bright and motivated students to work with us. This
can take various forms, theses, internships, and paid students assistantships.

Research Topics in the KWARC Group


 We are always looking for bright, motivated KWARCies.
 We have topics in for all levels! (Enthusiast, Bachelor, Master, Ph.D.)
32 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?

 List of current topics: https://gl.kwarc.info/kwarc/thesis-projects/


 Automated Reasoning: Maths Representation in the Large
 Logics development, (Meta)n -Frameworks
 Math Corpus Linguistics: Semantics Extraction
 Serious Games, Cognitive Engineering, Math Information Retrieval, Legal Rea-
soning, . . .
 . . . last but not least: KWARC is the home of ALeA!
 We always try to find a topic at the intersection of your and our interests.
1
 We also sometimes have positions!. (HiWi, Ph.D.: 2 E-13, PostDoc: full E-13)

Michael Kohlhase: Artificial Intelligence 1 40 2025-02-06

Sciences like physics or geology, and engineering need high-powered equipment to perform mea-
surements or experiments. computer science and in particular the KWARC group needs high
powered human brains to build systems and conduct thought experiments.
The KWARC group may not always have as much funding as other AI research groups, but
we are very dedicated to give the best possible research guidance to the students we supervise.
So if this appeals to you, please come by and talk to us.
Part I

Getting Started with AI: A


Conceptual Framework

33
35

This part of the lecture notes sets the stage for the technical parts of the course by establishing
a common framework (Rational Agents) that gives context and ties together the various methods
discussed in the course.
After having seen what AI can do and where AI is being employed today (see ??), we will now

1. introduce a programming language to use in the course,


2. prepare a conceptual framework in which we can think about “intelligence” (natural and arti-
ficial), and
3. recap some methods and results from theoretical computer science that we will need throughout
the course.
ad 1. Prolog: For the programming language we choose Prolog, historically one of the most
influential “AI programming languages”. While the other AI programming language: Lisp which
gave rise to the functional programming programming paradigm has been superseded by typed
languages like SML, Haskell, Scala, and F#, Prolog is still the prime example of the declarative
programming paradigm. So using Prolog in this course gives students the opportunity to explore
this paradigm. At the same time, Prolog is well-suited for trying out algorithms in symbolic AI the
topic of this semester since it internalizes the more complex primitives of the algorithms presented
here.
ad 2. Rational Agents: The conceptual framework centers around rational agents which
combine aspects of purely cognitive architectures (an original concern for the field of AI) with the
more recent realization that intelligence must interact with the world (embodied AI) to grow and
learn. The cognitive architectures aspect allows us to place and relate the various algorithms and
methods we will see in this course. Unfortunately, the “situated AI” aspect will not be covered in
this course due to the lack of time and hardware.
ad 3. Topics of Theoretical Computer Science: When we evaluate the methods and
algorithms introduced in AI-1, we will need to judge their suitability as agent functions. The main
theoretical tool for that is complexity theory; we will give a short motivation and overview of the
main methods and results as far as they are relevant for AI-1 in ??.
In the second half of the semester we will transition from search-based methods for problem
solving to inference-based ones, i.e. where the problem formulation is described as expressions of a
formal language which are transformed until an expression is reached from which the solution can
be read off. Phrase structure grammars are the method of choice for describing such languages;
we will introduce/recap them in ??.

Enough philosophy about “Intelligence” (Artificial or Natural)


 So far we had a nice philosophical chat, about “intelligence” et al.

 As of today, we look at technical stuff!


 Before we go into the algorithms and data structures proper, we will
1. introduce a programming language for AI-1
2. prepare a conceptual framework in which we can think about “intelligence” (nat-
ural and artificial), and
3. recap some methods and results from theoretical computer science.

Michael Kohlhase: Artificial Intelligence 1 41 2025-02-06


36
Chapter 3

Logic Programming

We will now learn a new programming paradigm: logic programming, which is one of the most
influential paradigms in AI. We are going to study Prolog (the oldest and most widely used) as a
concrete example of ideas behind logic programming and use it for our homeworks in this course.
As Prolog is a representative of a programming paradigm that is new to most students, pro-
gramming will feel weird and tedious at first. But subtracting the unusual syntax and program
organization logic programming really only amounts to recursive programming just as in func-
tional programming (the other declarative programming paradigm). So the usual advice applies,
keep staring at it and practice on easy examples until the pain goes away.

3.1 Introduction to Logic Programming and ProLog


Logic programming is a programming paradigm that differs from functional and imperative pro-
gramming in the basic procedural intuition. Instead of transforming the state of the memory by
issuing instructions (as in imperative programming), or computing the value of a function on some
arguments, logic programming interprets the program as a body of knowledge about the respective
situation, which can be queried for consequences.
This is actually a very natural conception of program; after all we usually run (imperative or
functional) programs if we want some question answered. Video Nuggets covering this section
can be found at https://fau.tv/clip/id/21752 and https://fau.tv/clip/id/21753.
.

Logic Programming
 Idea: Use logic as a programming language!

 We state what we know about a problem (the program) and then ask for results
(what the program would compute).
 Example 3.1.1.

Program Leibniz is human x+0=x


Sokrates is human If x + y = z then x + s(y) = s(z)
Sokrates is a greek 3 is prime
Every human is fallible
Query Are there fallible greeks? is there a z with s(s(0)) + s(0) = z
Answer Yes, Sokrates! yes s(s(s(0)))

37
38 CHAPTER 3. LOGIC PROGRAMMING

 How to achieve this? Restrict a logic calculus sufficiently that it can be used as
computational procedure.
 Remark: This idea leads a totally new programming paradigm: logic programming.

 Slogan: Computation = Logic + Control (Robert Kowalski 1973; [Kow97])


 We will use the programming language Prolog as an example.

Michael Kohlhase: Artificial Intelligence 1 42 2025-02-06

We now formally define the language of Prolog, starting off the atomic building blocks.

Prolog Terms and Literals


 Definition 3.1.2. Prolog expresses knowledge about the world via

 constants denoted by lowercase strings,


 variables denoted by strings starting with an uppercase letter or _, and
 functions and predicates (lowercase strings) applied to terms.
 Definition 3.1.3. A Prolog term is

 a Prolog variable, or constant, or


 a Prolog function applied to terms.
A Prolog literal is a constant or a predicate applied to terms.
 Example 3.1.4. The following are

 Prolog terms: john, X, _, father(john), . . .


 Prolog literals: loves(john,mary), loves(john,_), loves(john,wife_of(john)),. . .

Michael Kohlhase: Artificial Intelligence 1 43 2025-02-06

Now we build up Prolog programs from those building blocks.

Prolog Programs: Facts and Rules


 Definition 3.1.5. A Prolog program is a sequence of clauses, i.e.
 facts of the form l., where l is a literal, (a literal and a dot)
 rules of the form h:−b1 ,. . .,bn ., where n > 0. h is called the head literal (or
simply head) and the bi are together called the body of the rule.
A rule h:−b1 ,. . .,bn ., should be read as h (is true) if b1 and . . . and bn are.
 Example 3.1.6. Write “something is a car if it has a motor and four wheels” as
car(X) :− has_motor(X),has_wheels(X,4). (variables are uppercase)
This is just an ASCII notation for m(x) ∧ w(x, 4) ⇒ car(x).
 Example 3.1.7. The following is a Prolog program:
human(leibniz).
human(sokrates).
3.1. INTRODUCTION TO LOGIC PROGRAMMING AND PROLOG 39

greek(sokrates).
fallible(X):−human(X).

The first three lines are Prolog facts and the last a rule.

Michael Kohlhase: Artificial Intelligence 1 44 2025-02-06

The whole point of writing down a knowledge base (a Prolog program with knowledge about the
situation), if we do not have to write down all the knowledge, but a (small) subset, from which
the rest follows. We have already seen how this can be done: with logic. For logic programming
we will use a logic called “first-order logic” which we will not formally introduce here.

Prolog Programs: Knowledge bases


 Intuition: The knowledge base given by a Prolog program is the set of facts that
can be derived from it under the if/and reading above.

 Definition 3.1.8. The knowledge base given by Prolog program is that set of facts
that can be derived from it by Modus Ponens (MP), ∧I and instantiation.

A A⇒B A B A
MP ∧I Subst
B A∧B [B/X](A)

Michael Kohlhase: Artificial Intelligence 1 45 2025-02-06

?? introduces a very important distinction: that between a Prolog program and the knowledge
base it induces. Whereas the former is a finite, syntactic object (essentially a string), the latter
may be an infinite set of facts, which represents the totality of knowledge about the world or the
aspects described by the program.
As knowledge bases can be infinite, we cannot pre-compute them. Instead, logic programming
languages compute fragments of the knowledge base by need; i.e. whenever a user wants to check
membership; we call this approach querying: the user enters a query expression and the system
answers yes or no. This answer is computed in a depth first search process.

Querying the Knowledge Base: Size Matters


 Idea: We want to see whether a fact is in the knowledge base.
 Definition 3.1.9. A query is a list of Prolog literals called goal literal (also subgoals
or simply goals). We write a query as ?−A1 , . . ., An . where Ai are goals.

 Problem: Knowledge bases can be big and even infinite. (cannot pre-compute)
 Example 3.1.10. The knowledge base induced by the Prolog program
nat(zero).
nat(s(X)) :− nat(X).

contains the facts nat(zero), nat(s(zero)), nat(s(s(zero))), . . .

Michael Kohlhase: Artificial Intelligence 1 46 2025-02-06


40 CHAPTER 3. LOGIC PROGRAMMING

Querying the Knowledge Base: Backchaining


 Definition 3.1.11. Given a query Q: ?− A1 , . . ., An . and rule R: h:− b1 ,. . .,bn ,
backchaining computes a new query by
1. finding terms for all variables in h to make h and A1 equal and
2. replacing A1 in Q with the body literals of R, where all variables are suitably
replaced.
 Backchaining motivates the names goal/subgoal:
 the literals in the query are “goals” that have to be satisfied,
 backchaining does that by replacing them by new “goals”.

 Definition 3.1.12. The Prolog interpreter keeps backchaining from the top to the
bottom of the program until the query
 succeeds, i.e. contains no more goals, or (answer: true)
 fails, i.e. backchaining becomes impossible. (answer: false)

 Example 3.1.13 (Backchaining). We continue ??


?− nat(s(s(zero))).
?− nat(s(zero)).
?− nat(zero).
true

Michael Kohlhase: Artificial Intelligence 1 47 2025-02-06

Note that backchaining replaces the current query with the body of the rule suitably instantiated.
For rules with a long body this extends the list of current goals, but for facts (rules without a
body), backchaining shortens the list of current goals. Once there are no goals left, the Prolog
interpreter finishes and signals success by issuing the string true.
If no rules match the current subgoal, then the interpreter terminates and signals failure with the
string false,

Querying the Knowledge Base: Failure


 If no instance of a query can be derived from the knowledge base, then the Prolog
interpreter reports failure.
 Example 3.1.14. We vary ?? using 0 instead of zero.
?− nat(s(s(0))).
?− nat(s(0)).
?− nat(0).
FAIL
false

Michael Kohlhase: Artificial Intelligence 1 48 2025-02-06

We can extend querying from simple yes/no answers to programs that return values by simply
using variables in queries. In this case, the Prolog interpreter returns a substitution.
3.2. PROGRAMMING AS SEARCH 41

Querying the Knowledge base: Answer Substitutions


 Definition 3.1.15. If a query contains variables, then Prolog will return an answer
substitution as the result to the query, i.e the values for all the query variables
accumulated during repeated backchaining.
 Example 3.1.16. We talk about (Bavarian) cars for a change, and use a query
with a variables
has_wheels(mybmw,4).
has_motor(mybmw).
car(X):−has_wheels(X,4),has_motor(X).
?− car(Y) % query
?− has_wheels(Y,4),has_motor(Y). % substitution X = Y
?− has_motor(mybmw). % substitution Y = mybmw
Y = mybmw % answer substitution
true

Michael Kohlhase: Artificial Intelligence 1 49 2025-02-06

In ?? the first backchaining step binds the variable X to the query variable Y, which gives us the
two subgoals has_wheels(Y,4),has_motor(Y). which again have the query variable Y. The next
backchaining step binds this to mybmw, and the third backchaining step exhausts the subgoals.
So the query succeeds with the (overall) answer substitution Y = mybmw. With this setup, we
can already do the “fallible Greeks” example from the introduction.

PROLOG: Are there Fallible Greeks?


 Program:
human(leibniz).
human(sokrates).
greek(sokrates).
fallible(X):−human(X).

 Example 3.1.17 (Query). ?−fallible(X),greek(X).

 Answer substitution: [sokrates/X]

Michael Kohlhase: Artificial Intelligence 1 50 2025-02-06

3.2 Programming as Search


In this section, we want to really use Prolog as a programming language, so let use first get our tools
set up. Video Nuggets covering this section can be found at https://fau.tv/clip/id/21754
and https://fau.tv/clip/id/21827.

3.2.1 Running Prolog


We will now discuss how to use a Prolog interpreter to get to know the language. The SWI
Prolog interpreter can be downloaded from http://www.swi-prolog.org/. To start the Prolog
interpreter with pl or prolog or swipl from the shell. The SWI manual is available at http:
//www.swi-prolog.org/pldoc/
42 CHAPTER 3. LOGIC PROGRAMMING

We will introduce working with the interpreter using unary natural numbers as examples: we
first add the fact1 to the knowledge base
unat(zero).
which asserts that the predicate unat2 is true on the term zero. Generally, we can add a fact to
the knowledge base either by writing it into a file (e.g. example.pl) and then “consulting it” by
writing one of the following three commands into the interpreter:
[example]
consult(’example.pl’).
consult(’example’).
or by directly typing
assert(unat(zero)).
into the Prolog interpreter. Next tell Prolog about the following rule
assert(unat(suc(X)) :− unat(X)).
which gives the Prolog runtime an initial (infinite) knowledge base, which can be queried by
?− unat(suc(suc(zero))).
Even though we can use any text editor to program Prolog, but running Prolog in a modern
editor with language support is incredibly nicer than at the command line, because you can see
the whole history of what you have done. Its better for debugging too.

3.2.2 Knowledge Bases and Backtracking

Depth-First Search with Backtracking


 So far, all the examples led to direct success or to failure. (simple KB)

 Definition 3.2.1 (Prolog Search Procedure). The Prolog interpreter employs


top-down, left-right depth first search, concretely, Prolog search:
 works on the subgoals in left right order.
 matches first query with the head literals of the clauses in the program in top-
down order.
 if there are no matches, fail and backtracks to the (chronologically) last back-
track point.
 otherwise backchain on the first match, keep the other matches in mind for
backtracking via backtrack points.

We say that a goal G matches a head H, iff we can make them equal by replacing
variables in H with terms.
 We can force backtracking to compute more answers by typing ;.

Michael Kohlhase: Artificial Intelligence 1 51 2025-02-06

Note: With the Prolog search procedure detailed above, computation can easily go into infinite
loops, even though the knowledge base could provide the correct answer. Consider for instance
the simple program
“unary natural numbers”; we cannot use the predicate nat and the constructor function s here, since their
1 for

meaning is predefined in Prolog


2 for “unary natural numbers”.
3.2. PROGRAMMING AS SEARCH 43

p(X):− p(X).
p(X):− q(X).
q(X).

If we query this with ?− p(john), then DFS will go into an infinite loop because Prolog expands
by default the first predicate. However, we can conclude that p(john) is true if we start expanding
the second predicate.
In fact this is a necessary feature and not a bug for a programming language: we need to
be able to write non-terminating programs, since the language would not be Turing complete
otherwise. The argument can be sketched as follows: we have seen that for Turing machines the
halting problem is undecidable. So if all Prolog programs were terminating, then Prolog would be
weaker than Turing machines and thus not Turing complete.
We will now fortify our intuition about the Prolog search procedure by an example that extends
the setup from ?? by a new choice of a vehicle that could be a car (if it had a motor).

Backtracking by Example
 Example 3.2.2. We extend ??:
has_wheels(mytricycle,3).
has_wheels(myrollerblade,3).
has_wheels(mybmw,4).
has_motor(mybmw).
car(X):-has_wheels(X,3),has_motor(X). % cars sometimes have three wheels
car(X):-has_wheels(X,4),has_motor(X). % and sometimes four.
?- car(Y).
?- has_wheels(Y,3),has_motor(Y). % backtrack point 1
Y = mytricycle % backtrack point 2
?- has_motor(mytricycle).
FAIL % fails, backtrack to 2
Y = myrollerblade % backtrack point 2
?- has_motor(myrollerblade).
FAIL % fails, backtrack to 1
?- has_wheels(Y,4),has_motor(Y).
Y = mybmw
?- has_motor(mybmw).
Y=mybmw
true

Michael Kohlhase: Artificial Intelligence 1 52 2025-02-06

In general, a Prolog rule of the form A:−B,C reads as A, if B and C. If we want to express A if
B or C, we have to express this two separate rules A:−B and A:−C and leave the choice which
one to use to the search procedure.
In ?? we indeed have two clauses for the predicate car/1; one each for the cases of cars with three
and four wheels. As the three-wheel case comes first in the program, it is explored first in the
search process.
Recall that at every point, where the Prolog interpreter has the choice between two clauses for a
predicate, chooses the first and leaves a backtrack point. In ?? this happens first for the predicate
car/1, where we explore the case of three-wheeled cars. The Prolog interpreter immediately has
to choose again – between the tricycle and the rollerblade, which both have three wheels. Again,
it chooses the first and leaves a backtrack point. But as tricycles do not have motors, the subgoal
has_motor(mytricycle) fails and the interpreter backtracks to the chronologically nearest backtrack
point (the second one) and tries to fulfill has_motor(myrollerblade). This fails again, and the next
backtrack point is point 1 – note the stack-like organization of backtrack points which is in keeping
with the depth-first search strategy – which chooses the case of four-wheeled cars. This ultimately
succeeds as before with y=mybmw.
44 CHAPTER 3. LOGIC PROGRAMMING

3.2.3 Programming Features


We now turn to a more classical programming task: computing with numbers. Here we turn
to our initial example: adding unary natural numbers. If we can do that, then we have to consider
Prolog a programming language.

Can We Use This For Programming?


 Question: What about functions? E.g. the addition function?
 Question: We cannot define functions, in Prolog!
 Idea (back to math): use a three-place predicate.
 Example 3.2.3. add(X,Y,Z) stands for X+Y=Z

 Now we can directly write the recursive equations X + 0 = X (base case) and
X + s(Y ) = s(X + Y ) into the knowledge base.
add(X,zero,X).
add(X,s(Y),s(Z)) :− add(X,Y,Z).

 Similarly with multiplication and exponentiation.


mult(X,zero,zero).
mult(X,s(Y),Z) :− mult(X,Y,W), add(X,W,Z).

expt(X,zero,s(zero)).
expt(X,s(Y),Z) :− expt(X,Y,W), mult(X,W,Z).

Michael Kohlhase: Artificial Intelligence 1 53 2025-02-06

Note: Viewed through the right glasses logic programming is very similar to functional program-
ming; the only difference is that we are using n + 1 ary relations rather than n ary function. To see
how this works let us consider the addition function/relation example above: instead of a binary
function + we program a ternary relation add, where relation add(X,Y ,Z) means X + Y = Z. We
start with the same defining equations for addition, rewriting them to relational style.
The first equation is straight-forward via our correspondence and we get the Prolog fact
add(X,zero,X). For the equation X + s(Y ) = s(X + Y ) we have to work harder, the straight-
forward relational translation add(X,s(Y),s(X+Y)) is impossible, since we have only partially
replaced the function + with the relation add. Here we take refuge in a very simple trick that we
can always do in logic (and mathematics of course): we introduce a new name Z for the offending
expression X + Y (using a variable) so that we get the fact add(X,s(Y ),s(Z)). Of course this is
not universally true (remember that this fact would say that “X + s(Y ) = s(Z) for all X, Y , and
Z”), so we have to extend it to a Prolog rule add(X,s(Y),s(Z)):−add(X,Y,Z). which relativizes to
mean “X + s(Y ) = s(Z) for all X, Y , and Z with X + Y = Z”.
Indeed the rule implements addition as a recursive predicate, we can see that the recursion
relation is terminating, since the left hand sides have one more constructor for the successor
function. The examples for multiplication and exponentiation can be developed analogously, but
we have to use the naming trick twice.
We now apply the same principle of recursive programming with predicates to other examples
to reinforce our intuitions about the principles.

More Examples from elementary Arithmetic


3.2. PROGRAMMING AS SEARCH 45

 Example 3.2.4. We can also use the add relation for subtraction without changing
the implementation. We just use variables in the “input positions” and ground terms
in the other two. (possibly very inefficient “generate and test approach”)
?−add(s(zero),X,s(s(s(zero)))).
X = s(s(zero))
true

 Example 3.2.5. Computing the nth Fibonacci number (0, 1, 1, 2, 3, 5, 8, 13,. . . ;


add the last two to get the next), using the addition predicate above.
fib(zero,zero).
fib(s(zero),s(zero)).
fib(s(s(X)),Y):−fib(s(X),Z),fib(X,W),add(Z,W,Y).

 Example 3.2.6. Using Prolog’s internal floating-point arithmetic: a goal of the


form ?− D is e. — where e is a ground arithmetic expression binds D to the result
of evaluating e.
fib(0,0).
fib(1,1).
fib(X,Y):− D is X − 1, E is X − 2,fib(D,Z),fib(E,W), Y is Z + W.

Michael Kohlhase: Artificial Intelligence 1 54 2025-02-06

Note: Note that the is relation does not allow “generate and test” inversion as it insists on the
right hand being ground. In our example above, this is not a problem, if we call the fib with
the first (“input”) argument a ground term. Indeed, it matches the last rule with a goal ?− g,Y.,
where g is a ground term, then g−1 and g−2 are ground and thus D and E are bound to the
(ground) result terms. This makes the input arguments in the two recursive calls ground, and we
get ground results for Z and W, which allows the last goal to succeed with a ground result for
Y. Note as well that re-ordering the bodys literal of the rule so that the recursive calls are called
before the computation literals will lead to failure.
We will now add the primitive data structure of lists to Prolog; they are constructed by prepending
an element (the head) to an existing list (which becomes the rest list or “tail” of the constructed
one).

Adding Lists to Prolog


 Definition 3.2.7. In Prolog, lists are represented by list terms of the form
1. [a,b,c,. . .] for list literals, and
2. a first/rest constructor that represents a list with head F and rest list R as [F|R].
 Observation: Just as in functional programming, we can define list operations by
recursion, only that we program with relations instead of with functions.
 Example 3.2.8. Predicates for member, append and reverse of lists in default
Prolog representation.
member(X,[X|_]).
member(X,[_|R]):−member(X,R).

append([],L,L).
append([X|R],L,[X|S]):−append(R,L,S).
46 CHAPTER 3. LOGIC PROGRAMMING

reverse([],[]).
reverse([X|R],L):−reverse(R,S),append(S,[X],L).

Michael Kohlhase: Artificial Intelligence 1 55 2025-02-06

Logic programming is the third large programming paradigm (together with functional program-
ming and imperative programming).

Relational Programming Techniques


 Example 3.2.9. Parameters have no unique direction “in” or “out”
?− rev(L,[1,2,3]).
?− rev([1,2,3],L1).
?− rev([1|X],[2|Y]).

 Example 3.2.10. Symbolic programming by structural induction:


rev([],[]).
rev([X|Xs],Ys) :− ...

 Example 3.2.11. Generate and test:


sort(Xs,Ys) :− perm(Xs,Ys), ordered(Ys).

Michael Kohlhase: Artificial Intelligence 1 56 2025-02-06

From a programming practice point of view it is probably best understood as “relational program-
ming” in analogy to functional programming, with which it shares a focus on recursion.
The major difference to functional programming is that “relational programming” does not have
a fixed input/output distinction, which makes the control flow in functional programs very direct
and predictable. Thanks to the underlying search procedure, we can sometime make use of the
flexibility afforded by logic programming.
If the problem solution involves search (and depth first search is sufficient), we can just get by
with specifying the problem and letting the Prolog interpreter do the rest. In ?? we just specify
that list Xs can be sorted into Ys, iff Ys is a permutation of Xs and Ys is ordered. Given a concrete
(input) list Xs, the Prolog interpreter will generate all permutations of Ys of Xs via the predicate
perm/2 and then test them whether they are ordered.
This is a paradigmatic example of logic programming. We can (sometimes) directly use the
specification of a problem as a program. This makes the argument for the correctness of the
program immediate, but may make the program execution non optimal.

3.2.4 Advanced Relational Programming


It is easy to see that the running time of the Prolog program from ?? is not O(nlog2 (n)) which
is optimal for sorting algorithms. This is the flip side of the flexibility in logic programming. But
Prolog has ways of dealing with that: the cut operator, which is a Prolog atom, which always
succeeds, but which cannot be backtracked over. This can be used to prune the search tree in
Prolog. We will not go into that here but refer the readers to the literature.

Specifying Control in Prolog


 Remark 3.2.12. The running time of the program from ?? is not O(nlog2 (n))
3.2. PROGRAMMING AS SEARCH 47

which is optimal for sorting algorithms.


sort(Xs,Ys) :− perm(Xs,Ys), ordered(Ys).

 Idea: Gain computational efficiency by shaping the search!

Michael Kohlhase: Artificial Intelligence 1 57 2025-02-06

Functions and Predicates in Prolog


 Remark 3.2.13. Functions and predicates have radically different roles in Prolog.
 Functions are used to represent data. (e.g. father(john) or s(s(zero)))
 Predicates are used for stating properties about and computing with data.
 Remark 3.2.14. In functional programming, functions are used for both.
(even more confusing than in Prolog if you think about it)
 Example 3.2.15. Consider again the reverse predicate for lists below:
An input datum is e.g. [1,2,3], then the output datum is [3,2,1].
reverse([],[]).
reverse([X|R],L):−reverse(R,S),append(S,[X],L).

We “define” the computational behavior of the predicate rev, but the list constructors
[. . .] are just used to construct lists from arguments.
 Example 3.2.16 (Trees and Leaf Counting). We represent (unlabelled) trees via
the function t from tree lists to trees. For instance, a balanced binary tree of depth
2 is t([t([t([]),t([])]),t([t([]),t([])])]). We count leaves by
leafcount(t([]),1).
leafcount(t([V]),W) :− leafcount(V,W).
leafcount(t([X|R]),Y) :− leafcount(X,Z), leafcount(t(R),W), Y is Z + W.

Michael Kohlhase: Artificial Intelligence 1 58 2025-02-06

For more information on Prolog

RTFM (b
= “read the fine manuals”)

 RTFM Resources: There are also lots of good tutorials on the web,
 I personally like [Fis; LPN],
 [Fla94] has a very thorough logic-based introduction,
48 CHAPTER 3. LOGIC PROGRAMMING

 consult also the SWI Prolog Manual [SWI],

Michael Kohlhase: Artificial Intelligence 1 59 2025-02-06


Chapter 4

Recap of Prerequisites from Math &


Theoretical Computer Science

In this chapter we will briefly recap some of the prerequisites from theoretical computer science
that are needed for understanding Artificial Intelligence 1.

4.1 Recap: Complexity Analysis in AI?


We now come to an important topic which is not really part of Artificial Intelligence but which
adds an important layer of understanding to this enterprise: We (still) live in the era of Moore’s
law (the computing power available on a single CPU doubles roughly every two years) leading to an
exponential increase. A similar rule holds for main memory and disk storage capacities. And the
production of computer (using CPUs and memory) is (still) very rapidly growing as well; giving
mankind as a whole, institutions, and individual exponentially grow of computational resources.
In public discussion, this development is often cited as the reason why (strong) AI is inevitable.
But the argument is fallacious if all the algorithms we have are of very high complexity (i.e. at
least exponential in either time or space). So, to judge the state of play in Artificial Intelligence,
we have to know the complexity of our algorithms.
In this section, we will give a very brief recap of some aspects of elementary complexity theory
and make a case of why this is a generally important for computer scientists.
A Video Nugget covering this section can be found at https://fau.tv/clip/id/21839 and
https://fau.tv/clip/id/21840.
To get a feeling what we mean by “fast algorithm”, we do some preliminary computations.

Performance and Scaling

 Suppose we have three algorithms to choose from. (which one to select)


 Systematic analysis reveals performance characteristics.
 Example 4.1.1. For a computational problem of size n we have

49
50CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
100 ... ... ...
1 000 ... ... ...
10 000 ... ... ...
1 000 000 ... ... ...

Michael Kohlhase: Artificial Intelligence 1 60 2025-02-06

The last number in the rightmost column may surprise you. Does the run time really grow that
fast? Yes, as a quick calculation shows; and it becomes much worse, as we will see.

What?! One year?

 210 = 1 024 (1024µs ≃ 1ms)


 245 = 35 184 372 088 832 (3.5×1013 µs ≃ 3.5×107 s ≃ 1.1Y )
 Example 4.1.2. We denote all times that are longer than the age of the universe
with −

performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
< 100 100ms 7s 1016 Y
1 000 1s 12min −
10 000 10s 20h −
1 000 000 1.6min 2.5mon −

Michael Kohlhase: Artificial Intelligence 1 61 2025-02-06

So it does make a difference for larger computational problems what algorithm we choose. Consid-
erations like the one we have shown above are very important when judging an algorithm. These
evaluations go by the name of “complexity theory”.
Let us now recapitulate some notions of elementary complexity theory: we are interested in the
worst-case growth of the resources (time and space) required by an algorithm in terms of the sizes
of its arguments. Mathematically we look at the functions from input size to resource size and
classify them into “big-O” classes, abstracting from constant factors (which depend on the machine
thealgorithm runs on and which we cannot control) and initial (algorithm startup) factors.

Recap: Time/Space Complexity of Algorithms


 We are mostly interested in worst-case complexity in AI-1.
4.1. RECAP: COMPLEXITY ANALYSIS IN AI? 51

 Definition 4.1.3. We say that an algorithm α that terminates in time t(n) for all
inputs of size n has running time T (α) := t.
Let S ⊆ N → N be a set of natural number functions, then we say that α has time
complexity in S (written T (α)∈S or colloquially T (α)=S), iff t∈S. We say α has
space complexity in S, iff α uses only memory of size s(n) on inputs of size n and
s∈S.
 Time/space complexity depends on size measures. (no canonical one)

 Definition 4.1.4. The following sets are often used for S in T (α):

Landau set class name rank Landau set class name rank
O(1) constant 1 O(n2 ) quadratic 4
O(log2 (n)) logarithmic 2 O(nk ) polynomial 5
O(n) linear 3 O(kn ) exponential 6

where O(g) = {f | ∃k > 0.f ≤a k · g} and f ≤a g (f is asymptotically bounded by g),


iff there is an n0 ∈ N, such that f (n) ≤ g(n) for all n > n0 .
 Lemma 4.1.5 (Growth Ranking). For k ′ > 2 and k > 1 we have

O(1)⊂O(log2 (n))⊂O(n)⊂O(n2 )⊂O(nk )⊂O(k n )

 For AI-1: I expect that given an algorithm, you can determine its complexity class.
(next)

Michael Kohlhase: Artificial Intelligence 1 62 2025-02-06

Advantage: Big-Oh Arithmetics


 Practical Advantage: Computing with Landau sets is quite simple. (good
simplification)

 Theorem 4.1.6 (Computing with Landau Sets).


1. If O(c · f ) = O(f ) for any constant c ∈ N. (drop constant factors)
2. If O(f ) ⊆ O(g), then O(f + g) = O(g). (drop low-complexity summands)
3. If O(f · g) = O(f ) · O(g). (distribute over products)

 These are not all of “big-Oh calculation rules”, but they’re enough for most purposes
 Applications: Convince yourselves using the result above that
 O(4n3 + 3n + 71000n ) = O(2n )
 O(n)⊂O(n · log2 (n))⊂O(n2 )

Michael Kohlhase: Artificial Intelligence 1 63 2025-02-06

OK, that was the theory, . . . but how do we use that in practice?
What I mean by this is that given an algorithm, we have to determine the time complexity.
This is by no means a trivial enterprise, but we can do it by analyzing the algorithm instruction
by instruction as shown below.
52CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

Determining the Time/Space Complexity of Algorithms


 Definition 4.1.7. Given a function Γ that assigns variables v to functions Γ(v)
and α an imperative algorithm, we compute the

 time complexity TΓ (α) of program α and


 the context CΓ (α) introduced by α
by joint induction on the structure of α:
 constant: can be accessed in constant time
If α = δ for a data constant δ, then TΓ (α)∈O(1).
 variable: need the complexity of the value
If α = v with v ∈ dom(Γ), then TΓ (α)∈O(Γ(v)).
 application: compose the complexities of the function and the argument
If α = φ(ψ) with TΓ (φ)∈O(f ) and TΓ∪CΓ (φ) (ψ)∈O(g), then TΓ (α)∈O(f ◦ g)
and CΓ (α) = CΓ∪CΓ (φ) (ψ).
 assignment: has to compute the value ; has its complexity
If α is v:= φ with TΓ (φ)∈S, then TΓ (α)∈S and CΓ (α) = Γ ∪ (v,S).
 composition: has the maximal complexity of the components
If α is φ ; ψ, with TΓ (φ)∈P and TΓ∪CΓ (ψ) (ψ)∈Q, then TΓ (α)∈max {P , Q} and
CΓ (α) = CΓ∪CΓ (ψ) (ψ).
 branching: has the maximal complexity of the condition and branches
If α is ifγthenφelseψend, with TΓ (γ)∈C, TΓ∪CΓ (γ) (φ)∈P , TΓ∪CΓ (γ) (φ)∈Q,
and then TΓ (α)∈max {C, P , Q} and CΓ (α) = Γ ∪ CΓ (γ) ∪ CΓ∪CΓ (γ) (φ) ∪
CΓ∪CΓ (γ) (ψ).
 looping: multiplies complexities
If α is whileγdoφend, with TΓ (γ)∈O(f ), TΓ∪CΓ (γ) (φ)∈O(g), then TΓ (α)∈O(f (n)·
g(n)) and CΓ (α) = CΓ∪CΓ (γ) (φ).
 The time complexity T (α) is just T∅ (α), where ∅ is the empty function.
 Recursion is much more difficult to analyze ; recurrences and Master’s theorem.

Michael Kohlhase: Artificial Intelligence 1 64 2025-02-06

As instructions in imperative programs can introduce new variables, which have their own time
complexity, we have to carry them around via the introduced context, which has to be defined
co-recursively with the time complexity. This makes ?? rather complex. The main two cases to
note here are
• the variable case, which “uses” the context Γ and

• the assignment case, which extends the introduced context by the time complexity of the value.
The other cases just pass around the given context and the introduced context systematically.
Let us now put one motivation for knowing about complexity theory into the perspective of the
job market; here the job as a scientist.
Please excuse the chemistry pictures, public imagery for CS is really just quite boring, this is
what people think of when they say “scientist”. So, imagine that instead of a chemist in a lab, it’s
me sitting in front of a computer.
4.1. RECAP: COMPLEXITY ANALYSIS IN AI? 53

Why Complexity Analysis? (General)


 Example 4.1.8. Once upon a time I was trying to invent an efficient algorithm.
 My first algorithm attempt didn’t work, so I had to try harder.

 But my 2nd attempt didn’t work either, which got me a bit agitated.

 The 3rd attempt didn’t work either. . .

 And neither the 4th. But then:


54CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

 Ta-da . . . when, for once, I turned around and looked in the other direction–
CAN one actually solve this efficiently? – NP hardness was there to rescue me.

Michael Kohlhase: Artificial Intelligence 1 65 2025-02-06

The meat of the story is that there is no profit in trying to invent an algorithm, which we could
have known that cannot exist. Here is another image that may be familiar to you.

Why Complexity Analysis? (General)

 Example 4.1.9. Trying to find a sea route east to India (from Spain) (does not
exist)

 Observation: Complexity theory saves you from spending lots of time trying to
4.2. RECAP: FORMAL LANGUAGES AND GRAMMARS 55

invent algorithms that do not exist.

Michael Kohlhase: Artificial Intelligence 1 66 2025-02-06

It’s like, you’re trying to find a route to India (from Spain), and you presume it’s somewhere to
the east, and then you hit a coast, but no; try again, but no; try again, but no; ... if you don’t
have a map, that’s the best you can do. But NP hardness gives you the map: you can check
that there actually is no way through here. But what is this notion of NP completness alluded
to above? We observe that we can analyze the complexity of problems by the complexity of the
algorithms that solve them. This gives us a notion of what to expect from solutions to a given
problem class, and thus whether efficient (i.e. polynomial time) algorithms can exist at all.

Reminder (?): NP and PSPACE (details ; e.g. [GJ79])


 Turing Machine: Works on a tape consisting of cells, across which its Read/Write
head moves. The machine has internal states. There is a transition function that
specifies – given the current cell content and internal state – what the subsequent
internal state will be, how what the R/W head does (write a symbol and/or move).
Some internal states are accepting.
 Decision problems are in NP if there is a non deterministic Turing machine that
halts with an answer after time polynomial in the size of its input. Accepts if at
least one of the possible runs accepts.

 Decision problems are in NPSPACE, if there is a non deterministic Turing ma-


chine that runs in space polynomial in the size of its input.
 NP vs. PSPACE: Non-deterministic polynomial space can be simulated in deter-
ministic polynomial space. Thus PSPACE = NPSPACE, and hence (trivially)
NP ⊆ PSPACE.
It is commonly believed that NP̸⊇PSPACE. (similar to P ⊆ NP)

Michael Kohlhase: Artificial Intelligence 1 67 2025-02-06

The Utility of Complexity Knowledge (NP-Hardness)


 Assume: In 3 years from now, you have finished your studies and are working in
your first industry job. Your boss Mr. X gives you a problem and says Solve It!. By
which he means, write a program that solves it efficiently.
 Question: Assume further that, after trying in vain for 4 weeks, you got the next
meeting with Mr. X. How could knowing about NP hardness help?
 Answer: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 1 68 2025-02-06

4.2 Recap: Formal Languages and Grammars


One of the main ways of designing rational agents in this course will be to define formal languages
that represent the state of the agent environment and let the agent use various inference techniques
56CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

to predict effects of its observations and actions to obtain a world model. In this section we recap
the basics of formal languages and grammars that form the basis of a compositional theory for
them.

The Mathematics of Strings


 Definition 4.2.1. An alphabet A is a finite set; we call each element a ∈ A a
character, and an n tuple s ∈ An a string (of length n over A).

 Definition 4.2.2. Note that A0 = {⟨⟩}, where ⟨⟩ is the (unique) 0-tuple. With
the definition above we consider ⟨⟩ as the string of length 0 and call it the empty
string and denote it with ϵ.
 Note: Sets ̸= strings, e.g. {1, 2, 3} = {3, 2, 1}, but ⟨1, 2, 3⟩ =
̸ ⟨3, 2, 1⟩.
 Notation: We will often write a string ⟨c1 , . . ., cn ⟩ as ”c1 . . .cn ”, for instance
”abc” for ⟨a, b, c⟩
 Example 4.2.3. Take A = {h, 1, /} as an alphabet. Each of the members h, 1,
and / is a character. The vector ⟨/, /, 1, h, 1⟩ is a string of length 5 over A.
 Definition 4.2.4 (String Length). Given a string s we denote its length with |s|.

 Definition 4.2.5. The concatenation conc(s, t) of two strings s = ⟨s1 , ..., sn ⟩ ∈ An


and t = ⟨t1 , ..., tm ⟩ ∈ Am is defined as ⟨s1 , ..., sn , t1 , ..., tm ⟩ ∈ An+m .
We will often write conc(s, t) as s + t or simply st
 Example 4.2.6. conc(”text”, ”book”) = ”text” + ”book” = ”textbook”

Michael Kohlhase: Artificial Intelligence 1 69 2025-02-06

We have multiple notations for concatenation, since it is such a basic operation, which is used
so often that we will need very short notations for it, trusting that the reader can disambiguate
based on the context.
Now that we have defined the concept of a string as a sequence of characters, we can go on to
give ourselves a way to distinguish between good strings (e.g. programs in a given programming
language) and bad strings (e.g. such with syntax errors). The way to do this by the concept of a
formal language, which we are about to define.

Formal Languages
S
 Definition 4.2.7. Let A be an alphabet, then we define the sets A+ := i∈N+ Ai
of nonempty string and A∗ :=A+ ∪ {ϵ} of strings.
 Example 4.2.8. If A = {a, b, c}, then A∗ = {ϵ, a, b, c, aa, ab, ac, ba, . . . , aaa, . . . }.
 Definition 4.2.9. A set L ⊆ A∗ is called a formal language over A.

 Definition 4.2.10. We use c[n] for the string that consists of the character c
repeated n times.
 Example 4.2.11. #[5] = ⟨#, #, #, #, #⟩
 Example 4.2.12. The set M := {ba[n] | n ∈ N} of strings that start with character
b followed by an arbitrary numbers of a’s is a formal language over A = {a, b}.
4.2. RECAP: FORMAL LANGUAGES AND GRAMMARS 57

 Definition 4.2.13. Let L1 , L2 , L ⊆ Σ∗ be formal languages over Σ.


 Intersection and union: L1 ∩ L2 , L1 ∪ L2 .
 Language complement L: L := Σ∗ \L.
 The language concatenation of L1 and L2 : L1 L2 := {uw | u ∈ L1 , w ∈ L2 }.
We often use L1 L2 instead of L1 L2 .
 Language power L: L0 := {ϵ}, Ln+1 := LLn , where Ln := {w1 . . .wn | wi ∈
L, for i = 1. . .n}, (for n ∈ N).

S S
 language Kleene closure L: L := n∈N L and also L := n∈N+ L .
n + n

 The reflection of a language L: LR := {wR | w ∈ L}.

Michael Kohlhase: Artificial Intelligence 1 70 2025-02-06

There is a common misconception that a formal language is something that is difficult to under-
stand as a concept. This is not true, the only thing a formal language does is separate the “good”
from the bad strings. Thus we simply model a formal language as a set of stings: the “good”
strings are members, and the “bad” ones are not.
Of course this definition only shifts complexity to the way we construct specific formal languages
(where it actually belongs), and we have learned two (simple) ways of constructing them: by
repetition of characters, and by concatenation of existing languages. As mentioned above,
the purpose of a formal language is to distinguish “good” from “bad” strings. It is maximally
general, but not helpful, since it does not support computation and inference. In practice we
will be interested in formal languages that have some structure, so that we can represent formal
languages in a finite manner (recall that a formal language is a subset of A∗ , which may be infinite
and even undecidable – even though the alphabet A is finite).
To remedy this, we will now introduce phrase structure grammars (or just grammars), the stan-
dard tool for describing structured formal languages.

Phrase Structure Grammars (Theory)


 Recap: A formal language is an arbitrary set of symbol sequences.

 Problem: This may be infinite and even undecidable even if A is finite.


 Idea: Find a way of representing formal languages with structure finitely.
 Definition 4.2.14. A phrase structure grammar (also called type 0 grammar,
unrestricted grammar, or just grammar) is a tuple ⟨N , Σ, P , S ⟩ where

 N is a finite set of nonterminal symbols,


 Σ is a finite set of terminal symbols, members of Σ ∪ N are called symbols.
 P is a finite set of production rules: pairs p := h → b (also written as h⇒b),
∗ ∗ ∗
where h ∈ (Σ ∪ N ) N (Σ ∪ N ) and b ∈ (Σ ∪ N ) . The string h is called the
head of p and b the body.
 S ∈ N is a distinguished symbol called the start symbol (also sentence symbol).
The sets N and Σ are assumed to be disjoint. Any word w ∈ Σ∗ is called a terminal
word.
 Intuition: Production rules map strings with at least one nonterminal to arbitrary
other strings.
58CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

 Notation: If we have n rules h → bi sharing a head, we often write h → b1 | . . . | bn


instead.

Michael Kohlhase: Artificial Intelligence 1 71 2025-02-06

We fortify our intuition about these – admittedly very abstract – constructions by an example
and introduce some more vocabulary.

Phrase Structure Grammars (cont.)


 Example 4.2.15. A simple phrase structure grammar G:

S → NP Vi
NP → Article N
Article → the | a | an
N → dog | teacher | . . .
Vi → sleeps | smells | . . .

Here S , is the start symbol, NP , Article, N , and Vi are nonterminals.


 Definition 4.2.16. A production rule whose head is a single non-terminal and
whose body consists of a single terminal is called lexical or a lexical insertion rule.
Definition 4.2.17. The subset of lexical rules of a grammar G is called the lexicon
of G and the set of body symbols the vocabulary (or alphabet). The nonterminals
in their heads are called lexical categories of G.
 Definition 4.2.18. The non-lexicon production rules are called structural, and the
nonterminals in the heads are called phrasal or syntactic categories.

Michael Kohlhase: Artificial Intelligence 1 72 2025-02-06

Now we look at just how a grammar helps in analyzing formal languages. The basic idea is that
a grammar accepts a word, iff the start symbol can be rewritten into it using only the rules of the
grammar.

Phrase Structure Grammars (Theory)


 Idea: Each symbol sequence in a formal language can be analyzed/generated by
the grammar.

 Definition 4.2.19. Given a phrase structure grammar G := ⟨N , Σ, P , S ⟩, we say


∗ ∗
G derives t ∈ (Σ ∪ N ) from s ∈ (Σ ∪ N ) in one step, iff there is a production

rule p ∈ P with p = h → b and there are u, v ∈ (Σ ∪ N ) , such that s = suhv and
p
t = ubv. We write s→G t (or s→G t if p is clear from the context) and use →∗G for
the reflexive transitive closure of →G . We call s→∗ G t a G derivation of t from s.
A →G B
TEST1:
C →G D
4.2. RECAP: FORMAL LANGUAGES AND GRAMMARS 59

s →G2 asb
A →G B →G2 aaSbb
TEST2: →G C TEST3: →G2 aaaSbbb
→G D →G2 aaaaSbbbb
→G2 aaaabbbb

 Definition 4.2.20. Given a phrase structure grammar G := ⟨N , Σ, P , S ⟩, we say



that s ∈ (N ∪ Σ) is a sentential form of G, iff S→∗ G s. A sentential form that
does not contain nontermials is called a sentence of G, we also say that G accepts
s. We say that G rejects s, iff it is not a sentence of G.
 Definition 4.2.21. The language L(G) of G is the set of its sentences. We say
that L(G) is generated by G.
Definition 4.2.22. We call two grammars equivalent, iff they have the same lan-
guages.
Definition 4.2.23. A grammar G is said to be universal if L(G) = Σ∗ .
 Definition 4.2.24. Parsing, syntax analysis, or syntactic analysis is the process of
analyzing a string of symbols, either in a formal or a natural language by means of
a grammar.

Michael Kohlhase: Artificial Intelligence 1 73 2025-02-06

Again, we fortify our intuitions with ??.

Phrase Structure Grammars (Example)


 Example 4.2.25. In the grammar G from ??:
1. Article teacher Vi is a sentential
form,

S →G NP Vi
→G Article N Vi
→G Article teacher Vi S → NP Vi
NP → Article N
Article → the | a | an | . . .
2. The teacher sleeps is a sentence. N → dog | teacher | . . .
S →∗G Article teacher Vi Vi → sleeps | smells | . . .
→G the teacher Vi
→G the teacher sleeps

Michael Kohlhase: Artificial Intelligence 1 74 2025-02-06

Note that this process indeed defines a formal language given a grammar, but does not provide
an efficient algorithm for parsing, even for the simpler kinds of grammars we introduce below.

Grammar Types (Chomsky Hierarchy [Cho65])


60CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

 Observation: The shape of the grammar determines the “size” of its language.
 Definition 4.2.26. We call a grammar:
1. context-sensitive (or type 1), if the bodies of production rules have no less symbols
than the heads,
2. context-free (or type 2), if the heads have exactly one symbol,
3. regular (or type 3), if additionally the bodies are empty or consist of a nonterminal,
optionally followed by a terminal symbol.
By extension, a formal language L is called context-sensitive/context-free/regular
(or type 1/type 2/type 3 respectively), iff it is the language of a respective grammar.
Context-free grammars are sometimes CFGs and context-free languages CFLs.
 Example 4.2.27 (Context-sensitive). The language {a[n] b[n] c[n] } is accepted by

S → abc|A
A → aAB c|abc
cB → Bc
bB → bb

 Example 4.2.28 (Context-free). The language {a[n] b[n] } is accepted by S → a S b|


ϵ.
 Example 4.2.29 (Regular). The language {a[n] } is accepted by S → S a
 Observation: Natural languages are probably context-sensitive but parsable in
real time! (like languages low in the hierarchy)

Michael Kohlhase: Artificial Intelligence 1 75 2025-02-06

While the presentation of grammars from above is sufficient in theory, in practice the various
grammar rules are difficult and inconvenient to write down. Therefore computer science – where
grammars are important to e.g. specify parts of compilers – has developed extensions – notations
that can be expressed in terms of the original grammar rules – that make grammars more readable
(and writable) for humans. We introduce an important set now.

Useful Extensions of Phrase Structure Grammars


 Definition 4.2.30. The Bachus Naur form or Backus normal form (BNF) is a
metasyntax notation for context-free grammars.
It extends the body of a production rule by mutiple (admissible) constructors:
 alternative: s1 | . . . | sn ,
 repetition: s∗ (arbitrary many s) and s+ (at least one s),
 optional: [s] (zero or one times),
 grouping: (s1 ; . . . ; sn ), useful e.g. for repetition,
 character sets: [s−t] (all characters c with s≤c≤t for a given ordering on the
characters), and
 complements: [∧ s1 ,. . .,sn ], provided that the base alphabet is finite.
4.3. MATHEMATICAL LANGUAGE RECAP 61

 Observation: All of these can be eliminated, .e.g (; many more rules)


 replace X → Z (s∗ ) W with the production rules X → Z Y W , Y → ϵ, and
Y → Y s.
 replace X → Z (s+ ) W with the production rules X → Z Y W , Y → s, and
Y → Y s.

Michael Kohlhase: Artificial Intelligence 1 76 2025-02-06

We will now build on the notion of BNF grammar notations and introduce a way of writing
down the (short) grammars we need in AI-1 that gives us even more of an overview over what is
happening.

An Grammar Notation for AI-1


 Problem: In grammars, notations for nonterminal symbols should be
 short and mnemonic (for the use in the body)
 close to the official name of the syntactic category (for the use in the head)

 In AI-1 we will only use context-free grammars (simpler, but problem still applies)
 in AI-1: I will try to give “grammar overviews” that combine those, e.g. the
grammar of first-order logic.

variables X ∈ V1
function constants fk ∈ Σfk
predicate constants pk ∈ Σp k
terms t ::= X variable
| f0 constant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
| A1 ∧ A2 conjunction
| ∀X.A quantifier

Michael Kohlhase: Artificial Intelligence 1 77 2025-02-06

We will generally get by with context-free grammars, which have highly efficient into parsing
algorithms, for the formal language we use in this course, but we will not cover the algorithms in
AI-1.

4.3 Mathematical Language Recap


We already clarified above that we will use mathematical language as the main vehicle for speci-
fying the concepts underlying the AI algorithms in this course.
In this section, we will recap (or introduce if necessary) an important conceptual practice of
modern mathematics: the use of mathematical structures.

Mathematical Structures
 Observation: Mathematicians often cast classes of complex objects as mathemat-
ical structures.
62CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE

 We have just seen an example of a mathematical structure: (repeated here for


convenience)
 Definition 4.3.1. A phrase structure grammar (also called type 0 grammar, unre-
stricted grammar, or just grammar) is a tuple ⟨N , Σ, P , S ⟩ where

 N is a finite set of nonterminal symbols,


 Σ is a finite set of terminal symbols, members of Σ ∪ N are called symbols.
 P is a finite set of production rules: pairs p := h → b (also written as h⇒b),
∗ ∗ ∗
where h ∈ (Σ ∪ N ) N (Σ ∪ N ) and b ∈ (Σ ∪ N ) . The string h is called the
head of p and b the body.
 S ∈ N is a distinguished symbol called the start symbol (also sentence symbol).
The sets N and Σ are assumed to be disjoint. Any word w ∈ Σ∗ is called a terminal
word.
 Intuition: All grammars share structure: they have four components, which again
share struccture, which is further described in the definition above.
 Observation: Even though we call production rules “pairs” above, they are also
mathematical structures ⟨h, b⟩ with a funny notation h → b.

Michael Kohlhase: Artificial Intelligence 1 78 2025-02-06

Note that the idea of mathematical structures has been picked up by most programming lan-
guages in various ways and you should therefore be quite familiar with it once you realize the
parallelism.

Mathematical Structures in Programming


 Observation: Most programming languages have some way of creating “named
structures”. Referencing components is usually done via “dot notation”.
 Example 4.3.2 (Structs in C). C data structures for representing grammars:
struct grule {
char[][] head;
char[][] body;
}
struct grammar {
char[][] nterminals;
char[][] termininals;
grule[] grules;
char[] start;
}
int main() {
struct grule r1;
r1.head = "foo";
r1.body = "bar";
}

 Example 4.3.3 (Classes in OOP). Classes in object-oriented programming lan-


guages are based on the same ideas as mathematical structures, only that OOP
adds powerful inheritance mechanisms.
4.3. MATHEMATICAL LANGUAGE RECAP 63

Michael Kohlhase: Artificial Intelligence 1 79 2025-02-06

Even if the idea of mathematical structures may be familiar from programming, it may be quite
intimidating to some students in the mathematical notation we will use in this course. Therefore
will – when we get around to it – use a special overview notation in AI-1. We introduce it below.

In AI-1 we use a mixture between Math and Programming Styles


 In AI-1 we use mathematical notation, . . .
 Definition 4.3.4. A structure signature combines the components, their “types”,
and accessor names of a mathematical structure in a tabular overview.

 Example 4.3.5.
* N Set nonterminal symbols, +
Σ Set terminal symbols,
grammar =
P {h → b | . . . } production rules,
S N start symbol
 ∗ ∗ 
h (Σ ∪ N ) , N , (Σ ∪ N ) head,
production rule h→b = ∗
b (Σ ∪ N ) body

Read the first line “N Set nonterminal symbols” in the structure above as “N is in
an (unspecified) set and is a nonterminal symbol”.
Here – and in the future – we will use Set for the class of sets ; “N is a set”.

 I will try to give structure signatures where necessary.

Michael Kohlhase: Artificial Intelligence 1 80 2025-02-06


64CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
Chapter 5

Rational Agents: a Unifying


Framework for Artificial Intelligence

In this chapter, we introduce a framework that gives a comprehensive conceptual model for the
multitude of methods and algorithms we cover in this course. The framework of rational agents
accommodates two traditions of AI.
Initially, the focus of AI research was on symbolic methods concentrating on the mental processes
of problem solving, starting from Newell/Simon’s “physical symbol hypothesis”:
A physical symbol system has the necessary and sufficient means for general intelligent action.
[NS76]
Here a symbol is a representation an idea, object, or relationship that is physically manifested in
(the brain of) an intelligent agent (human or artificial).
Later – in the 1980s – the proponents of embodied AI posited that most features of cognition,
whether human or otherwise, are shaped – or at least critically influenced – by aspects of the
entire body of the organism. The aspects of the body include the motor system, the perceptual
system, bodily interactions with the environment (situatedness) and the assumptions about the
world that are built into the structure of the organism. They argue that symbols are not always
necessary since
The world is its own best model. It is always exactly up to date. It always has every detail
there is to be known. The trick is to sense it appropriately and often enough. [Bro90]
The framework of rational agents initially introduced by Russell and Wefald in [RW91] – ac-
commodates both, it situates agents with percepts and actions in an environment, but does not
preclude physical symbol systems – i.e. systems that manipulate symbols as agent functions. Rus-
sell and Norvig make it the central metaphor of their book “Artificial Intelligence – A modern
approach” [RN03], which we follow in this course.

5.1 Introduction: Rationality in Artificial Intelligence


We now introduce the notion of rational agents as entities in the world that act optimally (given
the available information). We situate rational agents in the scientific landscape by looking at
variations of the concept that lead to slightly different fields of study.

What is AI? Going into Details


 Recap: AI studies how we can make the computer do things that humans can still
do better at the moment. (humans are proud to be rational)

65
66 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

 What is AI?: Four possible answers/facets: Systems that

think like humans think rationally


act like humans act rationally

expressed by four different definitions/quotes:

Humanly Rational
Thinking “The exciting new effort “The formalization of mental
to make computers think faculties in terms of computa-
. . . machines with human-like tional models” [CM85]
minds” [Hau85]
Acting “The art of creating machines “The branch of CS concerned
that perform actions requiring with the automation of appro-
intelligence when performed by priate behavior in complex situ-
people” [Kur90] ations” [LS93]

 Idea: Rationality is performance-oriented rather than based on imitation.

Michael Kohlhase: Artificial Intelligence 1 81 2025-02-06

So, what does modern AI do?


 Acting Humanly: Turing test, not much pursued outside Loebner prize

 b building pigeons that can fly so much like real pigeons that they can fool
=
pigeons
 Not reproducible, not amenable to mathematical analysis
 Thinking Humanly: ; Cognitive Science.

 How do humans think? How does the (human) brain work?


 Neural networks are a (extremely simple so far) approximation
 Thinking Rationally: Logics, Formalization of knowledge and inference
 You know the basics, we do some more, fairly widespread in modern AI

 Acting Rationally: How to make good action choices?


 Contains logics (one possible way to make intelligent decisions)
 We are interested in making good choices in practice (e.g. in AlphaGo)

Michael Kohlhase: Artificial Intelligence 1 82 2025-02-06

We now discuss all of the four facets in a bit more detail, as they all either contribute directly
to our discussion of AI methods or characterize neighboring disciplines.

Acting humanly: The Turing test


 Introduced by Alan Turing (1950) “Computing machinery and intelligence” [Tur50]:
5.1. INTRODUCTION: RATIONALITY IN ARTIFICIAL INTELLIGENCE 67

 “Can machines think?” −→ “Can machines behave intelligently?”


 Definition 5.1.1. The Turing test is an operational test for intelligent behavior
based on an imitation game over teletext (arbitrary topic)

 It was predicted that by 2000, a machine might have a 30% chance of fooling a lay
person for 5 minutes.
 Note: In [Tur50], Alan Turing

 anticipated all major arguments against AI in following 50 years and


 suggested major components of AI: knowledge, reasoning, language understand-
ing, learning
 Problem: Turing test is not reproducible, constructive, or amenable to mathe-
matical analysis!

Michael Kohlhase: Artificial Intelligence 1 83 2025-02-06

Thinking humanly: Cognitive Science


 1960s: “cognitive revolution”: information processing psychology replaced prevail-
ing orthodoxy of behaviorism.
 Requires scientific theories of internal activities of the brain

 What level of abstraction? “Knowledge” or “circuits”?


 How to validate?: Requires
1. Predicting and testing behavior of human subjects or (top-down)
2. Direct identification from neurological data. (bottom-up)

 Definition 5.1.2. Cognitive science is the interdisciplinary, scientific study of the


mind and its processes. It examines the nature, the tasks, and the functions of
cognition.
 Definition 5.1.3. Cognitive neuroscience studies the biological processes and as-
pects that underlie cognition, with a specific focus on the neural connections in the
brain which are involved in mental processes.
 Both approaches/disciplines are now distinct from AI.
 Both share with AI the following characteristic: the available theories do not explain
(or engender) anything resembling human-level general intelligence

 Hence, all three fields share one principal direction!


68 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

Michael Kohlhase: Artificial Intelligence 1 84 2025-02-06

Thinking rationally: Laws of Thought


 Normative (or prescriptive) rather than descriptive
 Aristotle: what are correct arguments/thought processes?
 Several Greek schools developed various forms of logic: notation and rules of
derivation for thoughts; may or may not have proceeded to the idea of mechaniza-
tion.
 Direct line through mathematics and philosophy to modern AI
 Problems:

1. Not all intelligent behavior is mediated by logical deliberation


2. What is the purpose of thinking? What thoughts should I have out of all the
thoughts (logical or otherwise) that I could have?

Michael Kohlhase: Artificial Intelligence 1 85 2025-02-06

Acting Rationally
 Idea: Rational behavior =
b doing the right thing!
 Definition 5.1.4. Rational behavior consists of always doing what is expected to
maximize goal achievement given the available information.
 Rational behavior does not necessarily involve thinking e.g., blinking reflex — but
thinking should be in the service of rational action.
 Aristotle: Every art and every inquiry, and similarly every action and pursuit, is
thought to aim at some good. (Nicomachean Ethics)

Michael Kohlhase: Artificial Intelligence 1 86 2025-02-06

The Rational Agents


 Definition 5.1.5. An agent is an entity that perceives and acts.
 Central Idea: This course is about designing agent that exhibit rational behavior,
i.e. for any given class of environments and tasks, we seek the agent (or class of
agents) with the best performance.

 Caveat: Computational limitations make perfect rationality unachievable


; design best program for given machine resources.

Michael Kohlhase: Artificial Intelligence 1 87 2025-02-06


5.2. AGENT/ENV. AS A FRAMEWORK 69

5.2 Agents and Environments as a Framework for AI


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21843.
Given the discussion in the previous section, especially the ideas that “behaving rationally” could
be a suitable – since operational – goal for AI research, we build this into the paradigm “rational
agents” introduced by Stuart Russell and Eric H. Wefald in [RW91].

Agents and Environments


 Definition 5.2.1. An agent is anything that
 perceives its environment via sensors (a means of sensing the environment)
 acts on it with actuators (means of changing the environment).

Definition 5.2.2. Any recognizable, coherent employment of the actuators of an


agent is called an action.

 Example 5.2.3. Agents include humans, robots, softbots, thermostats, etc.

 remark: The notion of an agent and its environment is intentionally designed to


be inclusive. We will classify and discuss subclasses of both later

Michael Kohlhase: Artificial Intelligence 1 88 2025-02-06

One possible objection to this is that the agent and the environment are conceptualized as separate
entities; in particular, that the image suggests that the agent itself is not part of the environment.
Indeed that is intended, since it makes thinking about agents and environments easier and is of
little consequence in practice. In particular, the offending separation is relatively easily fixed if
needed.
Let us now try to express the agent/environment ideas introduced above in mathematical language
to add the precision we need to start the process towards the implementation of rational agents.

Modeling Agents Mathematically and Computationally


 Definition 5.2.4. A percept is the perceptual input of an agent at a specific time
instant.
 Definition 5.2.5. Any recognizable, coherent employment of the actuators of an
agent is called an action.

 Definition 5.2.6. The agent function f a of an agent a maps from percept histories
to actions:
f a : P∗ → A
70 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

 We assume that agents can always perceive their own actions. (but not necessarily
their consequences)
 Problem: Agent functions can become very big and may be uncomputable.
(theoretical tool only)

 Definition 5.2.7. An agent function can be implemented by an agent program


that runs on a (physical or hypothetical) agent architecture.

Michael Kohlhase: Artificial Intelligence 1 89 2025-02-06

Here we already see a problem that will recur often in this course: The mathematical formulation
gives us an abstract specification of what we want (here the agent function), but not directly a
way of how to obtain it. Here, the solution is to choose a computational model for agents (an
agent architecture) and see how the agent function can be implemented in a agent program.

Agent Schema: Visualizing the Internal Agent Structure


 Agent Schema: We will use the following kind of agent schema to visualize the
internal
Section structure
Agents and of
2.1. an agent:
Environments 35

Agent Sensors
Percepts

Environment

Actions
Actuators

Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
described by the agent function that maps any given percept sequence to an action.
AGENT FUNCTION

We can
Michael imagine
Kohlhase: Intelligence the
tabulating
Artificial 1 agent function that90describes any given agent; for most
2025-02-06

agents, this would be a very large table—infinite, in fact, unless we place a bound on the
Let us fortify our intuition about
length of percept all ofwethis
sequences wantwith an example,
to consider. which
Given an agent we will
to experiment use
with, weoften
can, in the course
in principle, construct this table by trying out all possible percept sequences and recording
of the AI-1 course.which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
Example: Vacuum-Cleaner World and Agent
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical description; the agent program is a concrete implementation, running
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
locations: squares A and B. The vacuum agent perceives which square it is in and whether
there is dirt in the square. It can choose to move left, move right, suck up the dirt, or do
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise, move to the other square. A partial tabulation of this agent function is shown
in Figure 2.3 and an agent program that implements it appears in Figure 2.8 on page 48.
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
by filling in the right-hand column in various ways. The obvious question, then, is this: What
is the right way to fill out the table? In other words, what makes an agent good or bad,
intelligent or stupid? We answer these questions in the next section.

1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
show later in this chapter that it can be very intelligent.
5.2. AGENT/ENV. AS A FRAMEWORK 71

Percept sequence Action


[A, Clean] Right
[A, Dirty] Suck
[B, Clean] Lef t
[B, Dirty] Suck
[A, Clean], [A, Clean] Right
[A, Clean], [A, Dirty] Suck
[A, Clean], [B, Clean] Lef t
[A, Clean], [B, Dirty] Suck
 percepts: location and con- [A, Dirty], [A, Clean] Right
[A, Dirty], [A, Dirty] Suck
tents, e.g., [A, Dirty] .. ..
. .
 actions: Lef t, Right, Suck, [A, Clean], [A, Clean], [A, Clean] Right
N oOp [A, Clean], [A, Clean], [A, Dirty] Suck
.. ..
. .

 Science Question: What is the right agent function?


 AI Question: Is there an agent architecture and agent program that implements
it.

Michael Kohlhase: Artificial Intelligence 1 91 2025-02-06

The first implementation idea inspired by the table in last slide would just be table lookup algo-
rithm.

Table-Driven Agents
 Idea: We can just implement the agent function as a lookup table and lookup
actions.
 We can directly implement this:
function Table−Driven−Agent(percept) returns an action
persistent table /∗ a table of actions indexed by percept sequences ∗/
var percepts /∗ a sequence, initially empty ∗/
append percept to the end of percepts
action := lookup(percepts, table)
return action

 Problem: Why is this not a good idea?


 The table is much too large: even with n binary percepts whose order of occur-
rence does not matter, we have 2n rows in the table.
 Who is supposed to write this table anyways, even if it “only” has a million
entries?

Michael Kohlhase: Artificial Intelligence 1 92 2025-02-06

Example: Vacuum-Cleaner Agent Program


 A much better implementation idea is to trigger actions from specific percepts.

 Example 5.2.8 (Agent Program).


procedure Reflex−Vacuum−Agent [location,status] returns an action
72 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

if status = Dirty then return Suck


else if location = A then return Right
else if location = B then return Left

 This is the kind of agent programs we will be looking for in AI-1.

Michael Kohlhase: Artificial Intelligence 1 93 2025-02-06

5.3 Good Behavior ; Rationality


Now we try understand the mathematics of rational behavior in our quest to make the rational
agents paradigm implementable and take steps for realizing AI. A Video Nugget covering this
section can be found at https://fau.tv/clip/id/21844.

Rationality
 Idea: Try to design agents that are successful! (aka. “do the right thing”)

 Problem: What do we mean by “successful”, how do we measure “success”?


 Definition 5.3.1. A performance measure is a function that evaluates a sequence
of environments.
 Example 5.3.2. A performance measure for a vacuum cleaner could

 award one point per “square” cleaned up in time T ?


 award one point per clean “square” per time step, minus one per move?
 penalize for > k dirty squares?
 Definition 5.3.3. An agent is called rational, if it chooses whichever action max-
imizes the expected value of the performance measure given the percept sequence
to date.
 Critical Observation: We only need to maximize the expected value, not the
actual value of the performance measure!
 Question: Why is rationality a good quality to aim for?

Michael Kohlhase: Artificial Intelligence 1 94 2025-02-06

Let us see how the observation that we only need to maximize the expected value, not the actual
value of the performance measure affects the consequences.

Consequences of Rationality: Exploration, Learning, Autonomy


 Note: A rational agent need not be perfect:
 It only needs to maximize expected value (rational ̸= omniscient)
 need not predict e.g. very unlikely but catastrophic events in the future
 Percepts may not supply all relevant information (rational ̸= clairvoyant)
 if we cannot perceive things we do not need to react to them.
 but we may need to try to find out about hidden dangers (exploration)
5.3. GOOD BEHAVIOR ; RATIONALITY 73

 Action outcomes may not be as expected (rational ̸= successful)


 but we may need to take action to ensure that they do (more often)
(learning)
 Note: Rationality may entail exploration, learning, autonomy (depending on the
environment / task)
 Definition 5.3.4. An agent is called autonomous, if it does not rely on the prior
knowledge about the environment of the designer.
 Autonomy avoids fixed behaviors that can become unsuccessful in a changing en-
vironment. (anything else would be
irrational)
 The agent may have to learn all relevant traits, invariants, properties of the envi-
ronment and actions.

Michael Kohlhase: Artificial Intelligence 1 95 2025-02-06

For the design of agent for a specific task – i.e. choose an agent architecture and design an
agent program, we have to take into account the performance measure, the environment, and the
characteristics of the agent itself; in particular its actions and sensors.

PEAS: Describing the Task Environment


 Observation: To design a rational agent, we must specify the task environment in
terms of performance measure, environment, actuators, and sensors, together called
the PEAS components.
 Example 5.3.5. When designing an automated taxi:
 Performance measure: safety, destination, profits, legality, comfort, . . .
 Environment: US streets/freeways, traffic, pedestrians, weather, . . .
 Actuators: steering, accelerator, brake, horn, speaker/display, . . .
 Sensors: video, accelerometers, gauges, engine sensors, keyboard, GPS, . . .
 Example 5.3.6 (Internet Shopping Agent). The task environment:

 Performance measure: price, quality, appropriateness, efficiency


 Environment: current and future WWW sites, vendors, shippers
 Actuators: display to user, follow URL, fill in form
 Sensors: HTML pages (text, graphics, scripts)

Michael Kohlhase: Artificial Intelligence 1 96 2025-02-06

The PEAS criteria are essentially a laundry list of what an agent design task description should
include.

Examples of Agents: PEAS descriptions


74 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

Agent Type Performance Environment Actuators Sensors


measure
Chess/Go player win/loose/draw game board moves board position
Medical diagno- accuracy of di- patient, staff display ques- keyboard entry
sis system agnosis tions, diagnoses of symptoms
Part-picking percentage of conveyor belt jointed arm and camera, joint
robot parts in correct with parts, bins hand angle sensors
bins
Refinery con- purity, yield, refinery, opera- valves, pumps, temperature,
troller safety tors heaters, displays pressure, chem-
ical sensors
Interactive En- student’s score set of students, display exer- keyboard entry
glish tutor on test testing accuracy cises, sugges-
tions, correc-
tions

Michael Kohlhase: Artificial Intelligence 1 97 2025-02-06

Agents
 Which are agents?
(A) James Bond.
(B) Your dog.
(C) Vacuum cleaner.
(D) Thermometer.

 Answer: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 1 98 2025-02-06

5.4 Classifying Environments


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21869.
It is important to understand that the kind of the environment has a very profound effect on the
agent design. Depending on the kind, different kinds of agents are needed to be successful. So be-
fore we discuss common kind of agents in ??, we will classify kinds environments.

Environment types
 Observation 5.4.1. Agent design is largely determined by the type of environment
it is intended for.

 Problem: There is a vast number of possible kinds of environments in AI.


 Solution: Classify along a few “dimensions”. (independent characteristics)
 Definition 5.4.2. For an agent a we classify the environment e of a by its type,
which is one of the following. We call e

1. fully observable, iff the a’s sensors give it access to the complete state of the
environment at any point in time, else partially observable.
5.5. TYPES OF AGENTS 75

2. deterministic, iff the next state of the environment is completely determined by


the current state and a’s action, else stochastic.
3. episodic, iff a’s experience is divided into atomic episodes, where it perceives and
then performs a single action. Crucially, the next episode does not depend on
previous ones. Non-episodic environments are called sequential.
4. dynamic, iff the environment can change without an action performed by a, else
static. If the environment does not change but a’s performance measure does,
we call e semidynamic.
5. discrete, iff the sets of e’s state and a’s actions are countable, else continuous.
6. single-agent, iff only a acts on e; else multi-agent (when must we count parts of
e as agents?)

Michael Kohlhase: Artificial Intelligence 1 99 2025-02-06

Some examples will help us understand the classification of environments better.

Environment Types (Examples)


 Example 5.4.3. Some environments classified:

Solitaire Backgammon Internet shopping Taxi


fully observable No Yes No No
deterministic Yes No Partly No
episodic No Yes No No
static Yes Semi Semi No
discrete Yes Yes Yes No
single-agent Yes No Yes (except auctions) No

 Note: Take the example above with a grain of salt. There are often multiple
interpretations that yield different classifications and different agents. (agent
designer’s choice)
 Example 5.4.4. Seen as a multi-agent game, chess is deterministic, as a single-
agent game, it is stochastic.
 Observation 5.4.5. The real world is (of course) a partially observable, stochastic,
sequential, dynamic, continuous, and multi-agent environment. (worst case for AI)
 Preview: We will concentrate on the “easy” environment types (fully observ-
able, deterministic, episodic, static, and single-agent) in AI-1 and extend them to
“realworld”-compatible ones in AI-2.

Michael Kohlhase: Artificial Intelligence 1 100 2025-02-06

In the AI-1 course we will work our way from the simpler environment types to the more general
ones. Each environment type wil need its own agent types specialized to surviving and doing well
in them.

5.5 Types of Agents


We will now discuss the main types of agents we will encounter in this course, get an impression
of the variety, and what they can and cannot do. We will start from simple reflex agents, add
76 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

state, and utility, and finally add learning. A Video Nugget covering this section can be found
at https://fau.tv/clip/id/21926.

Agent Types
 Observation: So far we have described (and analyzed) agents only by their be-
havior (cf. agent function f : P ∗ → A).
 Problem: This does not help us to build agents. (the goal of AI)

 To build an agent, we need to fix an agent architecture and come up with an agent
program that runs on it.
 Preview: Four basic types of agent architectures in order of increasing generality:
1. simple reflex agents
2. model-based agents
3. goal-based agents
4. utility-based agents
All these can be turned into learning agents.

Michael Kohlhase: Artificial Intelligence 1 101 2025-02-06

Simple reflex agents


 Definition 5.5.1. A simple reflex agent is an agent a that only bases its actions
on the last percept: so the agent function simplifies to f a : P → A.
 Agent
Section 2.4. Schema:
The Structure of Agents 49

Agent Sensors

What the world


is like now
Environment

Condition-action rules What action I


should do now

Actuators

Figure 2.9 Schematic diagram of a simple reflex agent.


 Example 5.5.2 (Agent Program).
procedurefunction
Reflex−Vacuum−Agent ) returns an action returns an action
[location,status]
S IMPLE -R EFLEX -AGENT( percept
if status = thena set
Dirty rules,
persistent: . . of. condition–action rules
state ← I NTERPRET-I NPUT( percept )
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
returnKohlhase:
Michael action Artificial Intelligence 1 102 2025-02-06

Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.

trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
percept, and the RULE -M ATCH function returns the first rule in the set of rules that matches
5.5. TYPES OF AGENTS 77

Simple reflex agents (continued)


 General Agent Program:
function Simple−Reflex−Agent (percept) returns an action
persistent: rules /∗ a set of condition−action rules∗/
state := Interpret−Input(percept)
rule := Rule−Match(state,rules)
action := Rule−action[rule]
return action

 Problem: Simple reflex agents can only react to the perceived state of the envi-
ronment, not to changes.
 Example 5.5.3. Automobile tail lights signal braking by brightening. A simple
reflex agent would have to compare subsequent percepts to realize.

 Problem: Partially observable environments get simple reflex agents into trouble.
 Example 5.5.4. Vacuum cleaner robot with defective location sensor ; infinite
loops.

Michael Kohlhase: Artificial Intelligence 1 103 2025-02-06

Model-based Reflex Agents: Idea


 Idea: Keep track of the state of the world we cannot see in an internal model.
Section 2.4. The Structure of Agents 51
 Agent Schema:

Sensors
State
How the world evolves What the world
is like now
Environment

What my actions do

Condition-action rules What action I


should do now

Agent Actuators

Figure 2.11 A model-based reflex agent.


Michael Kohlhase: Artificial Intelligence 1 104 2025-02-06

function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action


persistent: state, the agent’s current conception of the world state
Model-based Reflex Agents: Definition
model , a description of how the next state depends on current state and action
rules, a set of condition–action rules
action, the most recent action, initially none
 Definition 5.5.5. A model-based agent is an agent whose actions depend on
state ← U PDATE -S TATE(state, action , percept , model )
 a rule
world
← Rmodel: a set
ULE -M ATCH S ofrules)
(state, possible states.
action ← rule.ACTION
return action

Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
78 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

 a sensor model S that given a state s and a percepts p determines a new state
S(s, p).
 a transition model T , that predicts a new state T (s, a) from a state s and an
action a.
 An action function f that maps (new) states to an actions.
If the world model of a model-based agent A is in state s and A has taken action
a, A will transition to state s′ = T (S(p, s), a) and take action a′ = f (s′ ).
 Note: As different percept sequences lead to different states, so the agent function
f a : P ∗ → A no longer depends only on the last percept.

 Example 5.5.6 (Tail Lights Again). Model-based agents can do the ?? if the
states include a concept of tail light brightness.

Michael Kohlhase: Artificial Intelligence 1 105 2025-02-06

Model-Based Agents (continued)


 Observation 5.5.7. The agent program for a model-based agent is of the following
form:
function Model−Based−Agent (percept) returns an action
var state /∗ a description of the current state of the world ∗/
persistent rules /∗ a set of condition−action rules ∗/
var action /∗ the most recent action, initially none ∗/
state := Update−State(state,action,percept)
rule := Rule−Match(state,rules)
action := Rule−action(rule)
return action

 Problem: Having a world model does not always determine what to do (rationally).
 Example 5.5.8. Coming to an intersection, where the agent has to decide between
going left and right.

Michael Kohlhase: Artificial Intelligence 1 106 2025-02-06

Goal-based Agents
 Problem: A world model does not always determine what to do (rationally).
 Observation: Having a goal in mind does! (determines future actions)
 Agent Schema:
52
5.5. TYPES OF AGENTS Chapter 2. Intelligent Agents 79

Sensors

State
What the world
How the world evolves is like now

Environment
What it will be like
What my actions do if I do action A

What action I
Goals should do now

Agent Actuators

Figure 2.13 A model-based, goal-based agent. It keeps track of the world state as well as
a set of goals it is trying to achieve, and chooses an action that will (eventually) lead to the
Michael Kohlhase: Artificial Intelligence 1 107 2025-02-06
achievement of its goals.

Goal-based
example, theagents
taxi may be(continued)
driving back home, and it may have a rule telling it to fill up with
gas on the way home unless it has at least half a tank. Although “driving back home” may
seem to an aspect of the world state, the fact of the taxi’s destination is actually an aspect of
 Definition 5.5.9. state.
the agent’s internal A goal-based
If you findagent is a model-based
this puzzling, agent
consider that withcould
the taxi transition model
be in exactly
that
Tthe deliberates
same place at theactions based
same time, but on 3 andtoa reach
intending worlda model:
different It employs
destination.

 a set G of goals and a goal function f that given a (new) state s selects an
2.4.4 Goal-based agents
action a to best reach G.
Knowing something about the current state of the environment is not always enough to decide
The
whataction
to do. function is then
For example, 7→ fjunction,
at a sroad (T (s), G).
the taxi can turn left, turn right, or go straight
on. The correct decision depends on where the taxi is trying to get to. In other words, as well
GOAL
 Observation:
as a current stateAdescription,
goal-based theagent
agent is more
needs flexible
some in goal
sort of the knowledge it can
information that utilize.
describes
situations that
 Example are desirable—for
5.5.10. A goal-basedexample, beingeasily
agent can at thebe
passenger’s
changed destination. The agent
to go to a new desti-
program can combine this with the model (the same information as was used in the model-
nation, a model-based agent’s rules make it go to exactly one destination.
based reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
agent’s structure.
Sometimes goal-based
Michael Kohlhase: action selection
Artificial Intelligence 1 is straightforward—for
108 example, when goal sat-
2025-02-06
isfaction results immediately from a single action. Sometimes it will be more tricky—for
example, when the agent has to consider long sequences of twists and turns in order to find a
Utility-based the goal. Search (Chapters 3 to 5) and planning (Chapters 10 and 11) are the
way to achieve Agents
subfields of AI devoted to finding action sequences that achieve the agent’s goals.
Notice that decision making of this kind is fundamentally different from the condition–
 Definition 5.5.11. earlier,
action rules described A utility-based agent consideration
in that it involves uses a worldofmodel along with
the future—both a utility
“What will
function
happen ifthat
I do models its preferences
such-and-such?” and “Willamong the me
that make states of that
happy?” world.
In the reflex It chooses
agent the
designs,
action that leadsistonotthe
this information best expected
explicitly utility.
represented, because the built-in rules map directly from

 Agent Schema:
54
80 Chapter
CHAPTER 5. RATIONAL 2.
AGENTS: Intelligent
AN Agents
AI FRAMEWORK

Sensors
State
What the world
How the world evolves is like now

Environment
What it will be like
What my actions do if I do action A

Utility How happy I will be


in such a state

What action I
should do now

Agent Actuators

Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
Michael Kohlhase: Artificial Intelligence 1 109 2025-02-06
action that leads to the best expected utility, where expected utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.

Utility-based vs. Goal-based Agents


outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any
rational agent must behave as if it possesses a utility function whose expected value it tries
 Question:
to maximize. WhatAn agent is the
that difference
possesses anbetween goal-based
explicit utility functionand
canutility-based
make rational agents?
decisions
with a general-purpose algorithm that does not depend on
 Utility-based Agents are a Generalization: We can always force goal-directedness the specific utility function being
maximized. In this way, the “global” definition of rationality—designating as rational those
by a utility function that only rewards goal states.
agent functions that have the highest performance—is turned into a “local” constraint on
rational-agent Agents
 Goal-based designs that cancan dobeless: expressed in a simple
A utility program.
function allows rational decisions where
The utility-based
mere goals are inadequate: agent structure appears in Figure 2.14. Utility-based agent programs
appear in Part IV, where we design decision-making agents that must handle the uncertainty
 conflicting
inherent goals or partially observable
in stochastic (utilityenvironments.
gives tradeoff to make rational decisions)
At this point, the reader may be wondering, “Is it that simple? We just build agents that
 goals obtainable by uncertain actions (utility × likelihood helps)
maximize expected utility, and we’re done?” It’s true that such agents would be intelligent,
but it’s not simple. A utility-based agent has to model and keep track of its environment,
tasks that have involved a great deal of research on 110
Michael Kohlhase: Artificial Intelligence 1
perception, representation,
2025-02-06
reasoning,
and learning. The results of this research fill many of the chapters of this book. Choosing
the utility-maximizing course of action is also a difficult task, requiring ingenious algorithms
Learning Agents
that fill several more chapters. Even with these algorithms, perfect rationality is usually
unachievable in practice because of computational complexity, as we noted in Chapter 1.

 Definition 5.5.12. A learning agent is an agent that augments the performance


2.4.6 Learning agents
element – which determines actions from percept sequences with
We have described agent programs with various methods for selecting actions. We have
 a so
not, learning element
far, explained howwhich makes
the agent improvements
programs come intotobeing.
the agent’s components,
In his famous early paper,
Turing (1950)
 a critic considers
which givesthe idea of actually
feedback programming
to the learning his intelligent
element based onmachines by hand.
an external per-
formance standard,
 a problem generator which suggests actions that lead to new and informative
experiences.
 The performance element is what we took for the whole agent above.

Michael Kohlhase: Artificial Intelligence 1 111 2025-02-06

Learning Agents
 Agent Schema:
Section 2.4. The Structure of Agents 55
5.5. TYPES OF AGENTS 81

Performance standard

Critic Sensors

feedback

Environment
changes
Learning Performance
element element
knowledge
learning
goals

Problem
generator

Actuators
Agent

Figure 2.15 A general learning agent.


Michael Kohlhase: Artificial Intelligence 1 112 2025-02-06

He estimates how much


Learning Agents: workExample this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
them. In many
Example areas
5.5.13of AI, this isTaxi
(Learning nowAgent).
the preferred
It has themethod
componentsfor creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
 Performance element: the knowledge and procedures for selecting driving actions.
in initially unknown environments and to become more competent than its initial knowledge
(this controls the actual driving)
alone mightallow. In this section, we briefly introduce the main ideas(e.g.
critic: observes the world and informs the learning element
of learning
when
agents.
Throughout the book, we
passengers comment
complain brutalon opportunities and methods for learning in particular
braking)
kinds of agents. Part V goes into much more depth on the learning algorithms (e.g.
 Learning element modifies the braking rules in the performance element
themselves.
A learning agent
earlier, softer) can be divided into four conceptual components, as shown in Fig-
LEARNING ELEMENT ure 2.15. The most generator
 Problem importantmight distinction
experiment is with
between
brakingtheon learning
different road element,
surfaces which is re-
PERFORMANCE
sponsible for making improvements, and the performance element, which is responsible for
ELEMENT
 The learning element can make changes to any “knowledge components” of the
selecting external
diagram, actions.
e.g. in theThe performance element is what we have previously considered
to be the entire agent: it takes in percepts and decides on actions. The learning element uses
 model from the percept sequence (how the world evolves)
CRITIC feedback from the critic on how the agent is doing and determines how the performance
 success likelihoods by observing action outcomes (what my actions do)
element should be modified to do better in the future.
Thedesign of the learning
Observation: here, the element passengerdepends very
complaints much
serve onofthe
as part thedesign
“externalof perfor-
the performance
element. When trying to design an agent that learns a certain capability, the offirst
mance standard” since they correlate to the overall outcome – e.g. in form tipsquestion is
or blacklists.
not “How am I going to get it to learn this?” but “What kind of performance element will my
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
Michael Kohlhase: Artificial Intelligence 1 113 2025-02-06
can be constructed to improve every part of the agent.
The critic tells the learning element how well the agent is doing with respect to a fixed
Domain-Specific
performance standard. Thevs. critic Generalis necessary Agents because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
indicating that it has checkmated its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance
82 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

Domain-Specific Agent vs. General Agent

vs.
Solver specific to a particular prob- vs. Solver based on description in a
lem (“domain”). general problem-description language
(e.g., the rules of any board game).
More efficient. vs. Much less design/maintenance work.

 What kind of agent are you?

Michael Kohlhase: Artificial Intelligence 1 114 2025-02-06

5.6 Representing the Environment in Agents


We now come to a very important topic, which has a great influence on agent design: how does
the agent represent the environment. After all, in all agent designs above (except the simple
reflex agent) maintain a notion of world state and how the world state evolves given percepts and
actions. The form of this model crucially influences the algorithms we can build. A Video
Nugget covering this section can be found at https://fau.tv/clip/id/21925.

Representing the Environment in Agents


 We have seen various components of agents that answer questions like

 What is the world like now?


 What action should I do now?
 What do my actions do?
 Next natural question: How do these work? (see the rest of the course)

 Important Distinction: How the agent implements the world model.


 Definition 5.6.1. We call a state representation
 atomic, iff it has no internal structure (black box)
 factored, iff each state is characterized by attributes and their values.
 structured, iff the state includes representations of objects, their properties and
relationships.
 Intuition: From atomic to structured, the representations agent designer more
flexibility and the algorithms more components to process.

 Also The additional internal structure will make the algorithms more complex.

Michael Kohlhase: Artificial Intelligence 1 115 2025-02-06


5.7. RATIONAL AGENTS: SUMMARY 83

Again, we fortify our intuitions with a an illustration and an example.

Atomic/Factored/Structured State Representations


 Schematically: We can visualize the three kinds by

B C

B C

(a) Atomic (b) Factored (b) Structured

 Example 5.6.2. Consider the problem of finding a driving route from one end of
a country to the other via some sequence of cities.
 In an atomic representation the state is represented by the name of a city.
 In a factored representation we may have attributes “gps-location”, “gas”,. . .
(allows information sharing between states and uncertainty)
 But how to represent a situation, where a large truck blocking the road, since it
is trying to back into a driveway, but a loose cow is blocking its path. (attribute
“TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow” is unlikely)
 In a structured representation, we can have objects for trucks, cows, etc. and
their relationships. (at “run-time”)

Michael Kohlhase: Artificial Intelligence 1 116 2025-02-06

Note: The set of states in atomic representations and attributes in factored ones is determined
at design time, while the objects and their relationships in structured ones are discovered at
“runtime”.
Here – as always when we evaluate representations – the crucial aspect to look out for are the
idendity conditions: when do we consider two representations equal, and when can we (or more
crucially algorithms) distinguish them.
For instance for factored representations, make world representations equal, iff the values of
the attributes – that are determined at agent design time and thus immutable by the agent –
are all equual. So the agent designer has to make sure to add all the attributes to the chosen
representation that are necessary to distinguish environments that the agent program needs to
treat differently.
It is tempting to think that the situation with atomic representations is easier, since we can
“simply” add enough states for the necesssary distictions, but in practice this set of states may
have to be infinite, while in factored or structured representations we can keep representations
finite.

5.7 Rational Agents: Summary

Summary
 Agents interact with environments through actuators and sensors.
84 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK

 The agent function describes what the agent does in all circumstances.
 The performance measure evaluates the environment sequence.
 A perfectly rational agent maximizes expected performance.

 Agent programs implement (some) agent functions.


 PEAS descriptions define task environments.
 Environments are categorized along several dimensions:
fully observable? deterministic? episodic? static? discrete? single-agent?

 Several basic agent architectures exist:


reflex, model-based, goal-based, utility-based

Michael Kohlhase: Artificial Intelligence 1 117 2025-02-06

Corollary: We are Agent Designers!


 State: We have seen (and will add more details to) different

 agent architectures,
 corresponding agent programs and algorithms, and
 world representation paradigms.
 Problem: Which one is the best?

 Answer: That really depends on the environment type they have to survive/thrive
in! The agent designer – i.e. you – has to choose!
 The course gives you the necessary competencies.

 There is often more than one reasonable choice.


 Often we have to build agents and let them compete to
see what really works.

 Consequence: The rational agents paradigm used in this course challenges you
to become a good agent designer.

Michael Kohlhase: Artificial Intelligence 1 118 2025-02-06


Part II

General Problem Solving

85
87

This part introduces search-based methods for general problem solving using atomic and factored
representations of states.
Concretely, we discuss the basic techniques of search-based symbolic AI. First in the shape of
classical and heuristic search and adversarial search paradigms. Then in constraint propagation,
where we see the first instances of inference-based methods.
88
Chapter 6

Problem Solving and Search

In this chapter, we will look at a class of algorithms called search algorithms. These are
algorithms that help in quite general situations, where there is a precisely described problem, that
needs to be solved. Hence the name “General Problem Solving” for the area.

6.1 Problem Solving


A Video Nugget covering this section can be found at https://fau.tv/clip/id/21927.
Before we come to the search algorithms themselves, we need to get a grip on the types of problems
themselves and how we can represent them, and on what the various types entail for the problem
solving process.
The first step is to classify the problem solving process by the amount of knowledge we have
available. It makes a difference, whether we know all the factors involved in the problem before
we actually are in the situation. In this case, we can solve the problem in the abstract, i.e. make
a plan before we actually enter the situation (i.e. offline), and then when the problem arises, only
execute the plan. If we do not have complete knowledge, then we can only make partial plans, and
have to be in the situation to obtain new knowledge (e.g. by observing the effects of our actions or
the actions of others). As this is much more difficult we will restrict ourselves to offline problem
solving.

Problem Solving: Introduction


 Recap: Agents perceive the environment and compute an action.
 In other words: Agents continually solve “the problem of what to do next”.

 AI Goal: Find algorithms that help solving problems in general.


 Idea: If we can describe/represent problems in a standardized way, we may have
a chance to find general algorithms.
 Concretely: We will use the following two concepts to describe problems

 States: A set of possible situations in our problem domain (=


b environments)
 Actions: that get us from one state to another. (=
b agents)
A sequence of actions is a solution, if it brings us from an initial state to a goal
state. Problem solving computes solutions from problem formulations.

89
90 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Definition 6.1.1. In offline problem solving an agent computing an action sequence


based complete knowledge of the environment.
 Remark 6.1.2. Offline problem solving only works in fully observable, deterministic,
static, and episodic environments.

 Definition 6.1.3. In online problem solving an agent computes one action at a


time based on incoming perceptions.
 This Semester: We largely restrict ourselves to offline problem solving. (easier)

Michael Kohlhase: Artificial Intelligence 1 119 2025-02-06

We will use the following problem as a running example. It is simple enough to fit on one slide
and complex enough to show the relevant features of the problem solving algorithms we want to
talk about.

Example: Traveling in Romania


 Scenario: An agent is on holiday in Romania; currently in Arad; flight home leaves
tomorrow
68 from Bucharest; how to get there?
Chapter We
3. have
SolvingaProblems
map:by Searching

Oradea
71
Neamt

Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu

Figure 3.2 A simplified road map of part of Romania.

 Formulate the Problem:


Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
 States: various
wherecities.
the opponent’s king is under attack and can’t escape.
• A path cost function that assigns a numeric cost to each path. The problem-solving
 Actions: drive
PATH COST
between cities.
agent chooses a cost function that reflects its own performance measure. For the agent
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
 Solution: Appropriate
in kilometers.sequence
In this chapter,of
we cities, e.g.:
assume that the costArad,
of a pathSibiu, Fagaras,
can be described as the Bucharest
STEP COST sum of the costs of the individual actions along the path.3 The step cost of taking action
a in state s to reach state s! is denoted by c(s, a, s! ). The step costs for Romania are
shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4
Michael Kohlhase: Artificial Intelligence 1 120 2025-02-06
The preceding elements define a problem and can be gathered into a single data structure
that is given as input to a problem-solving algorithm. A solution to a problem is an action
Given this example to sequence fortifythatourleadsintuitions, wetocan
from the initial state a goal now turn quality
state. Solution to the formal
is measured definition
by the of problem
path cost function, and an optimal solution has the lowest path cost among all solutions.
formulation and their solutions.
OPTIMAL SOLUTION

3.1.2 Formulating problems


Problem Formulation
In the preceding section we proposed a formulation of the problem of getting to Bucharest in
terms of the initial state, actions, transition model, goal test, and path cost. This formulation
seems reasonable, but it is still a model—an abstract mathematical description—and not the
 Definition 6.1.4. A problem formulation models a situation using states and
This assumption is algorithmically convenient but also theoretically justifiable—see page 649 in Chapter 17.
3

actions at an appropriate levelcostsofareabstraction.(do


The implications of negative
4 explored in Exercise 3.8. not model things like “put on my
left sock”, etc.)
 it describes the initial state (we are in Arad)
6.1. PROBLEM SOLVING 91

 it also limits the objectives by specifying goal states. (excludes, e.g. to stay
another couple of weeks.)
A solution is a sequence of actions that leads from the initial state to a goal state.
Problem solving computes solutions from problem formulations.

 Finding the right level of abstraction and the required (not more!) information is
often the key to success.

Michael Kohlhase: Artificial Intelligence 1 121 2025-02-06

The Math of Problem Formulation: Search Problems


 Definition 6.1.5. A search problem Π := ⟨S , A, T , I , G ⟩ consists of a set S of
states, a set A of actions, and a transition model T : A×S → P(S) that assigns to
any action a ∈ A and state s ∈ S a set of successor states.
Certain states in S are designated as goal states (also called terminal state) (G ⊆ S
with G ̸= ∅) and initial states I ⊆ S.
 Definition 6.1.6. We say that an action a ∈ A is applicable in state s ∈ S, iff
T (a, s) ̸= ∅ and that any s′ ∈ T (a, s) is a result of applying action a to state s.
We
S call Ta : S → P(S) with Ta (s) := T (a, s) the result relation for a and TA :=
a∈A Ta the result relation of Π.

 Definition 6.1.7. The graph ⟨S, TA ⟩ is called the state space induced by Π.
 Definition 6.1.8. A solution for Π consists of a sequence a1 , . . ., an of actions
such that for all 1 < i ≤ n

 ai is applicable to state si−1 , where s0 ∈ I and


 si ∈ Tai (si−1 ), and sn ∈ G.
 Idea: A solution bring us from I to a goal state via applicable actions.

 Definition 6.1.9. Often we add a cost function c : A → R+ 0 that associates a step


cost c(a) to an action a ∈ A. The cost of a solution is the sum of the step costs of
its actions.

Michael Kohlhase: Artificial Intelligence 1 122 2025-02-06

Observation: The formulation of problems from ?? uses an atomic (black-box) state represen-
tation. It has enough functionality to construct the state space but nothing else. We will come
back to this in slide ??.
Remark 6.1.10. Note that search problems formalize problem formulations by making many of
the implicit constraints explicit.

Structure Overview: Search Problem


 The structure overview for search problems:
92 CHAPTER 6. PROBLEM SOLVING AND SEARCH

S Set states,
* +
A Set actions,
search problem = T A×S → P(S) transition model,
I S initial state,
G P(S) goal states

Michael Kohlhase: Artificial Intelligence 1 123 2025-02-06

We will now specialize ?? to deterministic, fully observable environments, i.e. environments where
actions only have one – assured – outcome state.

Search Problems in deterministic, fully observable Environments


 This semester, we will restrict ourselves to search problems, where(extend in AI II)

 |T (a, s)| ≤ 1 for the transition models and (⇝ deterministic environment)


 I = {s0 } (⇝ fully observable environment)
Definition 6.1.11. We call a search problem with transition model T deterministic,
iff |T (a, s)| ≤ 1.


 Definition 6.1.12. In a deterministic search problem, Ta induces partial function


Sa : S ⇀S whose natural domain is the set of states where a is applicable: Sa (s):=s′
if Ta = {s′ } and undefined at s otherwise. We call Sa the successor function for a
and Sa (s) the successor state of s.

 Definition 6.1.13. The predicate that tests for goal states is called a goal test.

Michael Kohlhase: Artificial Intelligence 1 124 2025-02-06

6.2 Problem Types


Note that the definition of a search problem is very general, it applies to many many real-world
problems. So we will try to characterize these by difficulty. A Video Nugget covering this
section can be found at https://fau.tv/clip/id/21928.

Problem types
 Definition 6.2.1. A search problem is called a single state problem, iff it is
 fully observable (at least the initial state)
 deterministic (unique successor states)
 static (states do not change other than by our own actions)
 discrete (a countable number of states)
 Definition 6.2.2. A search problem is called a multi state problem
 states partially observable (e.g. multiple initial states)
 deterministic, static, discrete
6.2. PROBLEM TYPES 93

 Definition 6.2.3. A search problem is called a contingency problem, iff


 the environment is non deterministic (solution can branch, depending on
contingencies)
 the state space is unknown (like a baby, agent has to learn about states and
actions)

Michael Kohlhase: Artificial Intelligence 1 125 2025-02-06

We will explain these problem types with another example. The problem P is very simple: We
have a vacuum cleaner and two rooms. The vacuum cleaner is in one room at a time. The floor
can be dirty or clean.
The possible states are determined by the position of the vacuum cleaner and the information,
whether each room is dirty or not. Obviously, there are eight states: S = {1, 2, 3, 4, 5, 6, 7, 8} for
simplicity.
The goal is to have both rooms clean, the vacuum cleaner can be anywhere. So the set G of
goal states is {7, 8}. In the single-state version of the problem, [right, suck] shortest solution, but
[suck, right, suck] is also one. In the multiple-state version we have

[right{2, 4, 6, 8}, suck{4, 8}, lef t{3, 7}, suck{7}]

Example: vacuum-cleaner world


70 Chapter 3. Solving Problems by Searching
 Single-state Problem:
R
L R

L
S S

 Start in 5 L
R
R L
R
R

L L

 Solution: [right, suck] S


S S
S
R
L R

S S

Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
 Multiple-state Problem: Right, S = Suck.

3.2.1 Toy problems


 Start in {1, 2, 3, 4, 5, 6, 7, 8} The first example we examine is the vacuum world first introduced in Chapter 2. (See
Figure 2.2.) This can be formulated as a problem as follows:
 Solution: [right, suck, lef t, suck] right →state{2,
• States: The 4, 6,by8}
is determined both the agent location and the dirt locations. The
agent is in one of two locations, each of which might or might not contain dirt. Thus,
there are → {4, 8}
suckn · 2 states.
2 × 2 = 8 possible world
n
2
states. A larger environment with n locations has

lef t
• Initial → {3, 7}
state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
suckSuck. Larger→environments
{7} might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
have no effect. The complete state space is shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
Michael Kohlhase: Artificial Intelligence 1 126 Each step costs 1, so the path cost
• Path cost: 2025-02-06
is the number of steps in the path.
Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
Example: Vacuum-Cleaner World (continued) space. The object is to reach a specified goal state, such as the one shown on the right of the
figure. The standard formulation is as follows:

 Contingency Problem:
94 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Murphy’s Law: suck can dirty a clean


carpet 70 Chapter 3. Solving Problems by Searching

Local sensing: dirty/notdirty at lo-


R
 L R

cation only
L
S S

R R
L R L R

 Start in: {1, 3} S


L
S S
L

S
R

Solution:
L R

 [suck, right, suck] L

suck → {5, 7} S S

Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
right → {6, 8} Right, S = Suck.

suck → {6, 8} 3.2.1 Toy problems


The first example we examine is the vacuum world first introduced in Chapter 2. (See
Figure 2.2.) This can be formulated as a problem as follows:
• States: The state is determined by both the agent location and the dirt locations. The
 better: [suck, right, if dirt then suck] (decide whether in 6 or 8 using local
agent is in one of two locations, each of which might or might not contain dirt. Thus,
there are 2 × 22 = 8 possible world states. A larger environment with n locations has

sensing) n · 2n states.
• Initial state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
Suck. Larger environments might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
Michael Kohlhase: Artificial Intelligence 1 have no effect. The complete state space is2025-02-06
127 shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.

In the contingency version of P a solution is the following:


Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
space. The object is to reach a specified goal state, such as the one shown on the right of the
[suck{5, 7}, right → {6, 8}, suck → {6, 8}, suck{5, 7}] figure. The standard formulation is as follows:

etc. Of course, local sensing can help: narrow {6, 8} to {6} or {8}, if we are in the first, then
suck.

Single-state problem formulation


 Defined by the following four items
1. Initial state: (e.g. Arad)
2. Successor function Sa (s): (e.g. SgoZer = {(Arad,Zerind), (goSib,Sibiu), . . . })
3. Goal test: (e.g. x = Bucharest (explicit test) )
noDirt(x) (implicit test)
4. Path cost (optional):(e.g. sum of distances, number of operators executed, etc.)
 Solution: A sequence of actions leading from the initial state to a goal state.

Michael Kohlhase: Artificial Intelligence 1 128 2025-02-06

“Path cost”: There may be more than one solution and we might want to have the “best” one in
a certain sense.

Selecting a state space


 Abstraction: Real world is absurdly complex!
State space must be abstracted for problem solving.

 (Abstract) state: Set of real states.


 (Abstract) operator: Complex combination of real actions.
 Example: Arad → Zerind represents complex set of possible routes.
 (Abstract) solution: Set of real paths that are solutions in the real world.
6.2. PROBLEM TYPES 95

Michael Kohlhase: Artificial Intelligence 1 129 2025-02-06

“State”: e.g., we don’t care about tourist attractions found in the cities along the way. But this is
problem dependent. In a different problem it may well be appropriate to include such information
in the notion of state.
“Realizability”: one could also say that the abstraction must be sound wrt. reality.

Example:
Section 3.2. The
Example 8-puzzle
Problems 71

7 2 4 1 2

5 6 3 4 5

8 3 1 6 7 8

Start State Goal State


States integer locations of tiles
Figure 3.4 A typical instance of the 8-puzzle.
Actions lef t, right, up, down
States? Actions?. . .
Goal test = goal state?
• States: A state description specifies the location of each of the eight tiles and the blank
in one of the nine squares.
Path cost 1 per move
• Initial state: Any state can be designated as the initial state. Note that any given goal
can be reached from exactly half of the possible initial states (Exercise 3.4).
Michael Kohlhase: Artificial Intelligence 1 130 2025-02-06
• Actions: The simplest formulation defines the actions as movements of the blank space
Left, Right, Up, or Down. Different subsets of these are possible depending on where
How many states the areblank
there?is. N factorial, so it is not obvious that the problem is in NP. One
needs to show, for example,
• Transition thatGiven
model: polynomial length
a state and action, thissolutions do always
returns the resulting state;exist. Can be done by
for example,
combinatorial arguments
if we applyon Leftstate
to the space
start state graph (really
in Figure ?).resulting state has the 5 and the blank
3.4, the
Some rule-books give a different goal state for the 8-puzzle: starting with 1, 2, 3 in the top row
switched.
and having the •holdGoalintest:
theThis lowerchecks rightwhether corner. This
the state is completely
matches irrelevant
the goal configuration shown forinthe
Fig- example and
its significance to ure 3.4. (Other goal configurations are possible.)
AI-1.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
36 Example: Vacuum-cleaner
What abstractions have we included here? The actions are abstracted
Chapter 2. toIntelligent
their beginning
Agents and
final states, ignoring the intermediate locations where the block is sliding. We have abstracted
away actions such as shaking the board when pieces get stuck and ruled out extracting the
pieces with a knife and putting
A them back again.BWe are left with a description of the rules of
the puzzle, avoiding all the details of physical manipulations.
SLIDING-BLOCK
PUZZLES The 8-puzzle belongs to the family of sliding-block puzzles, which are often used as
test problems for new search algorithms in AI. This family is known to be NP-complete,
so one does not expect to find methods significantly better in the worst case than the search
algorithms described in this chapter and the next. The 8-puzzle has 9!/2 = 181, 440 reachable
states and is easily solved. The 15-puzzle (on a 4 × 4 board) has around 1.3 trillion states, and
States in a fewinteger
random instances can be solved optimally dirtbyand
milliseconds the robot locations
best search algorithms.
Figure 2.2 A vacuum-cleaner world with just25
The 24-puzzle has around 10 two
(on a 5 × 5 board)Actions lef
locations.
states, and random
t, right, suck,instances
noOp take several
States?
hours to solveActions?.
optimally. . .
sequence Goal test notdirty?
8-QUEENS PROBLEM The goal ofPercept
the 8-queens problem is to place eight queens on a chessboardAction
such that
no queen attacks [A, anyClean]
Path cost 1 per operation (0 forRight
other. (A queen attacks any piece in the same row, column
noOp)
or diago-
[A, Dirty]
nal.) Figure 3.5 shows Suck column is
an attempted solution that fails: the queen in the rightmost
[B, Clean]
attacked by the queen at the top left. Left
[B, Dirty] Suck
[A, Clean], [A, Clean] Right
[A, Clean], [A, Dirty] Suck
..
Michael Kohlhase: Artificial Intelligence 1 131 ..
2025-02-06
. .
[A, Clean], [A, Clean], [A, Clean] Right
[A, Clean], [A, Clean], [A, Dirty] Suck
Example: Robotic assembly ..
.
..
.
Figure 2.3 Partial tabulation of a simple agent function for the vacuum-cleaner world
shown in Figure 2.2.

Before closing this section, we should emphasize that the notion of an agent is meant to
be a tool for analyzing systems, not an absolute characterization that divides the world into
agents and non-agents. One could view a hand-held calculator as an agent that chooses the
96 CHAPTER 6. PROBLEM SOLVING AND SEARCH

States? Actions?. . .
States real-valued coordinates of
robot joint angles and parts of the object to be assembled
Actions continuous motions of robot joints
Goal test assembly complete?
Path cost time to execute

Michael Kohlhase: Artificial Intelligence 1 132 2025-02-06

General Problems
 Question: Which are “Problems”?

(A) You didn’t understand any of the lecture.


(B) Your bus today will probably be late.
(C) Your vacuum cleaner wants to clean your apartment.
(D) You want to win a chess game.

 Answer: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 1 133 2025-02-06

6.3 Search
A Video Nugget covering this section can be found at https://fau.tv/clip/id/21956.

Tree Search Algorithms


 Note: The state space of a search problem ⟨S , A, T , I , G ⟩ is a graph ⟨S, TA ⟩.

 As graphs are difficult to compute with, we often compute a corresponding tree


and work on that. (standard trick in graph algorithms)
 Definition 6.3.1. Given a search problem P := ⟨S , A, T , I , G ⟩, the tree search
algorithm consists of the simulated exploration of state space ⟨S, TA ⟩ in a search
tree formed by successively expanding already explored states. (offline algorithm)
procedure Tree−Search (problem, strategy) : <a solution or failure>
6.3. SEARCH 97

<initialize the search tree using the initial state of problem>


loop
if <there are no candidates for expansion> <return failure> end if
<choose a leaf node for expansion according to strategy>
if <the node contains a goal state> return <the corresponding solution>
else <expand the node and add the resulting nodes to the search tree>
end if
end loop
end procedure

We expand a node n by generating all successors of n and inserting them as children


of n in the search tree.

Michael Kohlhase: Artificial Intelligence 1 134 2025-02-06

Tree Search: Example

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Michael Kohlhase: Artificial Intelligence 1 135 2025-02-06

Let us now think a bit more about the implementation of tree search algorithms based on the
ideas discussed above. The abstract, mathematical notions of a search problem and the induced
tree search algorithm gets further refined here.

Implementation: States vs. nodes


98 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Recap: A state is a (representation of) a physical configuration.


 Definition 6.3.2 (Implementing a Search Tree).
Section 3.3. Searching for Solutions 79

A search tree node is a data structure that in-


cludes accessors for parent, children, depth, path
PARENT

cost, insertion order, etc. 5 4 Node ACTION = Right


PATH-COST = 6
A goal node (initial node) is a search tree node 6 1 88
STATE

labeled with a goal state (initial state).


7 3 22

Figure 3.10 Nodes are the data structures from which the search tree is constructed. Each
has a parent, a state, and various bookkeeping fields. Arrows point from child to parent.
 Observation: A set of search tree nodes that can all (recursively) reach a single
initial node form a search tree. components for a child node. (they
The functionimplement
C HILD -N ODE takesit)
Given the components for a parent node, it is easy to see how to compute the necessary
a parent node and an action
and returns the resulting child node:

 Observation: Paths in the search tree correspond to paths in the state space.
function C HILD -N ODE( problem, parent , action) returns a node
return a node with
 Definition 6.3.3. We define the path cost of a node
S =n in a search
problem.R (parent.S tree ), to be
, actionT
TATE ESULT TATE
P = parent , A = action,ARENT CTION
the sum of the step costs on the path from n to theP root
-C of T .P. -C + problem.S -C (parent.S
= parent ATH OST ATH OST TEP OST TATE, action )

 Observation: As a search tree node has access to parents, we can read off the
The node data structure is depicted in Figure 3.10. Notice how the PARENT pointers
string the nodes together into a tree structure. These pointers also allow the solution path to be
solution from a goal node. extracted when a goal node is found; we use the S OLUTION function to return the sequence
of actions obtained by following parent pointers back to the root.
Up to now, we have not been very careful to distinguish between nodes and states, but in
writing detailed algorithms it’s important to make that distinction. A node is a bookkeeping
data structure used to represent the search tree. A state corresponds to a configuration of the
Michael Kohlhase: Artificial Intelligence 1 136 2025-02-06
world. Thus, nodes are on particular paths, as defined by PARENT pointers, whereas states
are not. Furthermore, two different nodes can contain the same world state if that state is
generated via two different search paths.
Now that we have nodes, we need somewhere to put them. The frontier needs to be
It is very important to understand the fundamental difference between a state in a search problem,
stored in such a way that the search algorithm can easily choose the next node to expand
according to its preferred strategy. The appropriate data structure for this is a queue. The
a node search tree employed by the tree search algorithm, and the implementation in a search tree
QUEUE

operations on a queue are as follows:

node. The implementation above is faithful in the sense ••that the implemented data structures
E MPTY ?(queue) returns true only if there are no more elements in the queue.
P OP(queue) removes the first element of the queue and returns it.
contain all the information needed in the tree search algorithm.
• I NSERT (element, queue) inserts an element and returns the resulting queue.

So we can use it to refine the idea of a tree search algorithm into an implementation.

Implementation of Search Algorithms


 Definition 6.3.4 (Implemented Tree Search Algorithm).
procedure Tree_Search (problem,strategy)
fringe := insert(make_node(initial_state(problem)))
loop
if empty(fringe) fail end if
node := first(fringe,strategy)
if GoalTest(node) return node
else fringe := insert(expand(node,problem))
end if
end loop
end procedure

The fringe is the set of search tree nodes not yet expanded in tree search.
 Idea: We treat the fringe as an abstract data type with three accessors: the
 binary function first retrieves an element from the fringe according to a strategy.
 binary function insert adds a (set of) search tree node into a fringe.
 unary predicate empty to determine whether a fringe is the empty set.
 The strategy determines the behavior of the fringe (data structure) (see below)

Michael Kohlhase: Artificial Intelligence 1 137 2025-02-06


6.4. UNINFORMED SEARCH STRATEGIES 99

Note: The pseudocode in ?? is still relatively underspecified – leaves many implementation


details unspecified. Here are the specifications of the functions used without.

• make_node constructs a search tree node from a state.
• initial_state accesses the initial state of a search problem.
• State returns the state associated with its aregument.
• GoalNode checks whether its argument is a goal node
• expand = creates new search tree nodes by for all successor states.
Essentially, only the first function is non-trivial (as the strategy argument shows) In fact it is the
only place, where the strategy is used in the algorithm.
An alternative implementation would have been to make the fringe a queue, and insert order
the fringe as the strategy sees fit. Then first can just return the first element of the queue. This
would have lead to a different signature, possibly different runtimes, but the same overall result
of the algorithm.

Search strategies
 Definition 6.3.5. A strategy is a function that picks a node from the fringe of a
search tree. (equivalently, orders the fringe and picks the first.)

 Definition 6.3.6 (Important Properties of Strategies).

completeness does it always find a solution if one exists?


time complexity number of nodes generated/expanded
space complexity maximum number of nodes in memory
optimality does it always find a least cost solution?

 Time and space complexity measured in terms of:

b maximum branching factor of the search tree


d minimal graph depth of a solution in the search tree
m maximum graph depth of the search tree (may be ∞)

Complexity means here always worst-case complexity!

Michael Kohlhase: Artificial Intelligence 1 138 2025-02-06

Note that there can be infinite branches, see the search tree for Romania.

6.4 Uninformed Search Strategies


Video Nuggets covering this section can be found at https://fau.tv/clip/id/21994 and
https://fau.tv/clip/id/21995.

Uninformed search strategies


 Definition 6.4.1. We speak of an uninformed search algorithm, if it only uses the
information available in the problem definition.
100 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Next: Frequently used search algorithms


 Breadth first search
 Uniform cost search
 Depth first search
 Depth limited search
 Iterative deepening search

Michael Kohlhase: Artificial Intelligence 1 139 2025-02-06

The opposite of uninformed search is informed or heuristic search that uses a heuristic function
that adds external guidance to the search process. In the Romania example, one could add the
heuristic to prefer cities that lie in the general direction of the goal (here SE).
Even though heuristic search is usually much more efficient, uninformed search is important
nonetheless, because many problems do not allow to extract good heuristics.

6.4.1 Breadth-First Search Strategies

Breadth-First Search
 Idea: Expand the shallowest unexpanded node.

 Definition 6.4.2. The breadth first search (BFS) strategy treats the fringe as a
FIFO queue, i.e. successors go in at the end of the fringe.
 Example 6.4.3 (Synthetic).

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O
6.4. UNINFORMED SEARCH STRATEGIES 101

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

Michael Kohlhase: Artificial Intelligence 1 140 2025-02-06

We will now apply the breadth first search strategy to our running example: Traveling in Romania.
Note that we leave out the green dashed nodes that allow us a preview over what the search tree
will look like (if expanded). This gives a much cleaner picture we assume that the readers already
have grasped the mechanism sufficiently.

Breadth-First Search: Romania


 Example 6.4.4.

Arad
102 CHAPTER 6. PROBLEM SOLVING AND SEARCH

Arad

Sibiu Timisoara Zerind

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea Arad Lugoj

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Michael Kohlhase: Artificial Intelligence 1 141 2025-02-06

Breadth-first search: Properties

Completeness Yes (if b is finite)


Time complexity 1+b+b2 +b3 +. . .+bd , so O(bd ), i.e. exponential
 in d
Space complexity O(bd ) (fringe may be whole level)
Optimality Yes (if cost = 1 per step), not optimal in general

 Disadvantage: Space is the big problem (can easily generate nodes at


500MB/sec =b 1.8TB/h)
 Optimal?: No! If cost varies for different steps, there might be better solutions
below the level of the first one.

 An alternative is to generate all solutions and then pick an optimal one. This works
only, if m is finite.

Michael Kohlhase: Artificial Intelligence 1 142 2025-02-06

The next idea is to let cost drive the search. For this, we will need a non-trivial cost function: we
will take the distance between cities, since this is very natural. Alternatives would be the driving
time, train ticket cost, or the number of tourist attractions along the way.
Of course we need to update our problem formulation with the necessary information.
6.4. UNINFORMED SEARCH STRATEGIES 103

68 Romania with Step Costs as Distances


Chapter 3. Solving Problems by Searching

Oradea
71
Neamt

Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu

Figure 3.2 A simplified road map of part of Romania.


Michael Kohlhase: Artificial Intelligence 1 143 2025-02-06

Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
Uniform-cost search
where the opponent’s king is under attack and can’t escape.
PATH COST • A path cost function that assigns a numeric cost to each path. The problem-solving
 Idea: agent chooses a cost function that reflects its own performance measure. For the agent
Expand least cost unexpanded node.
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
in kilometers.
 Definition 6.4.5.InUniform-cost
this chapter, we search
assume that the cost
(UCS) of a path
is the can bewhere
strategy described
theasfringe
the is
sum of the costs of the individual actions along the path.3 The step cost of taking action
STEP COST
ordered by increasing path cost. ! !
a in state s to reach state s is denoted by c(s, a, s ). The step costs for Romania are
shown in Figure to
 Note: Equivalent 3.2breadth
as route distances. We assume
first search that step
if all step costs
costs nonnegative.4
areareequal.
The preceding elements define a problem and can be gathered into a single data structure
 Synthetic
that is givenExample:
as input to a problem-solving algorithm. A solution to a problem is an action
sequence that leads from the initial state to a goal state. Solution quality is measured by the
OPTIMAL SOLUTION path cost function, and an optimal solution has the lowest path cost among all solutions.
Arad

3.1.2 Formulating problems


Arad
In the preceding section we proposed a formulation of the problem of getting to Bucharest in
terms of the initial state, actions,
140 transition model,
118 goal test,
75 and path cost. This formulation
seems reasonable, but it is still a model—an abstract mathematical description—and not the
Sibiu Timisoara Zerind
3 This assumption is algorithmically convenient but also theoretically justifiable—see page 649 in Chapter 17.
4 The implications of negative costs are explored in Exercise 3.8.

Arad
140 118 75
Sibiu Timisoara Zerind
71 75

Oradea Arad
104 CHAPTER 6. PROBLEM SOLVING AND SEARCH

Arad
140 118 75
Sibiu Timisoara Zerind
118 111 71 75

Arad Lugoj Oradea Arad

Arad
140 118 75
Sibiu Timisoara Zerind
140 99 151 80 118 111 71 75

Arad Fagaras Oradea R. Vilcea Arad Lugoj Oradea Arad

Michael Kohlhase: Artificial Intelligence 1 144 2025-02-06

Note that we must sum the distances to each leaf. That is, we go back to the first level after the
third step.

Uniform-cost search: Properties

Completeness Yes (if step costs ≥ ϵ > 0)


Time complexity number of nodes with path cost less than that of opti-
mal solution
Space complexity ditto
Optimality Yes

Michael Kohlhase: Artificial Intelligence 1 145 2025-02-06

If step cost is negative, the same situation as in breadth first search can occur: later solutions may
be cheaper than the current one.
If step cost is 0, one can run into infinite branches. UCS then degenerates into depth first
search, the next kind of search algorithm we will encounter. Even if we have infinite branches,
where the sum of step costs converges, we can get into trouble, since the search is forced down
these infinite paths before a solution can be found.
Worst case is often worse than BFS, because large trees with small steps tend to be searched
first. If step costs are uniform, it degenerates to BFS.

6.4.2 Depth-First Search Strategies

Depth-first Search
 Idea: Expand deepest unexpanded node.

 Definition 6.4.6. Depth-first search (DFS) is the strategy where the fringe is
organized as a (LIFO) stack i.e. successors go in at front of the fringe.
 Definition 6.4.7. Every node that is pushed to the stack is called a backtrack
point. The action of popping a non-goal node from the stack and continuing the
search with the new top element of the stack (a backtrack point by construction)
is called backtracking, and correspondingly the DFS algorithm backtracking search.
6.4. UNINFORMED SEARCH STRATEGIES 105

 Note: Depth first search can perform infinite cyclic excursions


Need a finite, non cyclic state space (or repeated state checking)

Michael Kohlhase: Artificial Intelligence 1 146 2025-02-06

Depth-First Search
 Example 6.4.8 (Synthetic).

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O
106 CHAPTER 6. PROBLEM SOLVING AND SEARCH

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O
6.4. UNINFORMED SEARCH STRATEGIES 107

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

B C

D E F G

H I J K L M N O

Michael Kohlhase: Artificial Intelligence 1 147 2025-02-06


108 CHAPTER 6. PROBLEM SOLVING AND SEARCH

Depth-First Search: Romania


 Example 6.4.9 (Romania).

Arad

Arad

Sibiu Timisoara Zerind

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea

Arad

Sibiu Timisoara Zerind

Arad Fagaras Oradea R. Vilcea

Sibiu Timisoara Zerind

Michael Kohlhase: Artificial Intelligence 1 148 2025-02-06

Depth-first search: Properties

Completeness Yes: if search tree finite


No: if search tree contains infinite paths or
loops
Time complexity O(bm )
(we need to explore until max depth m in any
 case!)
Space complexity O(bm) (i.e. linear space)
(need at most store m levels and at each level
at most b nodes)
Optimality No (there can be many better solutions in the
unexplored part of the search tree)

 Disadvantage: Time terrible if m much larger than d.


 Advantage: Time may be much less than breadth first search if solutions are
dense.
6.4. UNINFORMED SEARCH STRATEGIES 109

Michael Kohlhase: Artificial Intelligence 1 149 2025-02-06

Iterative deepening search


 Definition 6.4.10. Depth limited search is depth first search with a depth limit.

 Definition 6.4.11. Iterative deepening search (IDS) is depth limited search with
ever increasing depth limits. We call the difference between successive depth limits
the step size.

 procedure Tree_Search (problem)


<initialize the search tree using the initial state of problem>
for depth = 0 to ∞
result := Depth_Limited_search(problem,depth)
if depth ̸= cutoff return result end if
end for
end procedure

Michael Kohlhase: Artificial Intelligence 1 150 2025-02-06

Ilustration: Iterative Deepening Search at various Limit Depths

A A

A A A A

B C B C B C B C

A A A A

B C B C B C B C

D E F G D E F G D E F G D E F G

A A A A

B C B C B C B C

D E F G D E F G D E F G D E F G
110 CHAPTER 6. PROBLEM SOLVING AND SEARCH

Michael Kohlhase: Artificial Intelligence 1 151 2025-02-06

Iterative deepening search: Properties

Completeness Yes
Time complexity (d+1)·b0 +d·b1 +(d−1)·b2 +. . .+bd ∈ O(bd+1 )

Space complexity O(b · d)
Optimality Yes (if step cost = 1)

 Consequence: IDS used in practice for search spaces of large, infinite, or unknown
depth.

Michael Kohlhase: Artificial Intelligence 1 152 2025-02-06

Note: To find a solution (at depth d) we have to search the whole tree up to d. Of course since
we do not save the search state, we have to re-compute the upper part of the tree for the next
level. This seems like a great waste of resources at first, however, IDS tries to be complete without
the space penalties.
However, the space complexity is as good as DFS, since we are using DFS along the way. Like
in BFS, the whole tree on level d (of optimal solution) is explored, so optimality is inherited from
there. Like BFS, one can modify this to incorporate uniform cost search behavior.
As a consequence, variants of IDS are the method of choice if we do not have additional
information.

Comparison BFS (optimal) and IDS (not)


 Example 6.4.12. IDS may fail to be be optimal at step sizes > 1.
Comparison
Comparison
6.4. UNINFORMED SEARCH STRATEGIES 111
Breadth-first search Iterative deepening search
Breadth-first search Iterative deepening search
Breadth first search Iterative deepening search

Kohlhase:
Kohlhase:Künstliche
KünstlicheIntelligenz 1 1
Intelligenz 150150 JulyJuly
5, 2018
5, 2018

Michael Kohlhase: Artificial Intelligence 1 153 2025-02-06

6.4.3 Further Topics

Tree Search vs. Graph Search


 We have only covered tree search algorithms.
 States duplicated in nodes are a huge problem for efficiency.

 Definition 6.4.13. A graph search algorithm is a variant of a tree search algorithm


that prunes nodes whose state has already been considered (duplicate pruning),
essentially using a DAG data structure.
 Observation 6.4.14. Tree search is memory intensive it has to store the fringe so
keeping a list of “explored states” does not lose much.

 Graph versions of all the tree search algorithms considered here exist, but are more
difficult to understand (and to prove properties about).
 The (time complexity) properties are largely stable under duplicate pruning. (no
gain in the worst case)

 Definition 6.4.15. We speak of a search algorithm, when we do not want to


distinguish whether it is a tree or graph search algorithm. (difference considered an
implementation detail)

Michael Kohlhase: Artificial Intelligence 1 154 2025-02-06

Uninformed Search Summary


 Tree/Graph Search Algorithms: Systematically explore the state tree/graph
112 CHAPTER 6. PROBLEM SOLVING AND SEARCH

induced by a search problem in search of a goal state. Search strategies only differ
by the treatment of the fringe.
 Search Strategies and their Properties: We have discussed

Breadth Uniform Depth Iterative


Criterion first cost first deepening
Completeness Yes1 Yes2 No Yes
Time complexity bd ≈ bd bm bd+1
Space complexity bd ≈ bd bm bd
Optimality Yes∗ Yes No Yes∗
1 2
Conditions b finite 0 < ϵ ≤ cost

Michael Kohlhase: Artificial Intelligence 1 155 2025-02-06

Search Strategies; the XKCD Take

 More Search Strategies?: (from https://xkcd.com/2407/)

Michael Kohlhase: Artificial Intelligence 1 156 2025-02-06

6.5 Informed Search Strategies

Summary: Uninformed Search/Informed Search


 Problem formulation usually requires abstracting away real-world details to define
a state space that can feasibly be explored.
 Variety of uninformed search strategies.
 Iterative deepening search uses only linear space and not much more time than
6.5. INFORMED SEARCH STRATEGIES 113

other uninformed algorithms.


 Next Step: Introduce additional knowledge about the problem (heuristic search)
 Best-first-, A∗ -strategies (guide the search by heuristics)
 Iterative improvement algorithms.
 Definition 6.5.1. A search algorithm is called informed, iff it uses some form of
external information – that is not part of the search problem – to guide the search.

Michael Kohlhase: Artificial Intelligence 1 157 2025-02-06

6.5.1 Greedy Search


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/22015.

Best-first search
 Idea: Order the fringe by estimated “desirability” (Expand most desirable
unexpanded node)
 Definition 6.5.2. An evaluation function assigns a desirability value to each node
of the search tree.

 Note: A evaluation function is not part of the search problem, but must be added
externally.
 Definition 6.5.3. In best first search, the fringe is a queue sorted in decreasing
order of desirability.

 Special cases: Greedy search, A∗ search

Michael Kohlhase: Artificial Intelligence 1 158 2025-02-06

This is like UCS, but with an evaluation function related to problem at hand replacing the path
cost function.
If the heuristic is arbitrary, we expect incompleteness!
Depends on how we measure “desirability”.
Concrete examples follow.

Greedy search
 Idea: Expand the node that appears to be closest to the goal.
 Definition 6.5.4. A heuristic is an evaluation function h on states that estimates
the cost from n to the nearest goal state. We speak of heuristic search if the search
algorithm uses a heuristic in some way.
 Note: All nodes for the same state must have the same h-value!
 Definition 6.5.5. Given a heuristic h, greedy search is the strategy where the
fringe is organized as a queue sorted by increasing h value.

 Example 6.5.6. Straight-line distance from/to Bucharest.


114 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Note: Unlike uniform cost search the node evaluation function has nothing to do
with the nodes expanded so far

internal search control ; external search control


partial solution cost ; goal cost estimation

Michael Kohlhase: Artificial Intelligence 1 159 2025-02-06

In greedy search we replace the objective cost to construct the current solution with a heuristic or
subjective measure from which we think it gives a good idea how far we are from a solution. Two
things have shifted:

• we went from internal (determined only by features inherent in the search space) to an external/heuris-
tic cost
• instead of measuring the cost to build the current partial solution, we estimate how far we are
from the desired goal

Romania with Straight-Line Distances


 Example 6.5.7 (Informed Travel). hSLD (n) = straight − line distance to Bucharest

Arad 366 Mehadia 241 Bucharest 0 Neamt 234


Craiova 160 Oradea 380 Drobeta 242 Pitesti 100
Eforie 161 Rimnicu Vilcea 193 Fragaras 176 Sibiu 253
Giurgiu 77 Timisoara 329 Hirsova 151 Urziceni 80
68 Iasi 226 Vaslui 199
ChapterLugoj
3. 244
Solving Zerind
Problems 374
by Searching

Oradea
71
Neamt

Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu

Figure 3.2 A simplified road map of part of Romania.

Michael Kohlhase: Artificial Intelligence 1 160 2025-02-06


Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
where the opponent’s king is under attack and can’t escape.
Greedy Search: Romania
PATH COST • A path cost function that assigns a numeric cost to each path. The problem-solving
agent chooses a cost function that reflects its own performance measure. For the agent
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
in kilometers. In this chapter, we assume that the cost of a path can be described as the
Arad
STEP COST sum of the costs of the individual actions along the path.3 The step cost of taking action
! 366 by c(s, a, s! ). The step costs for Romania are
a in state s to reach state s is denoted
shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4
The preceding elements define a problem and can be gathered into a single data structure
that is given as input to a problem-solving algorithm. A solution to a problem is an action
sequence that leads from the initial state to a goal state. Solution quality is measured by the
OPTIMAL SOLUTION path cost function, and an optimal solution has the lowest path cost among all solutions.
6.5. INFORMED SEARCH STRATEGIES 115

Arad
366
Sibiu Timisoara Zerind
253 329 374

Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad Fagaras Oradea R. Vilcea

366 176 380 193

Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad Fagaras Oradea R. Vilcea

366 176 380 193

Sibiu Bucharest

253 0

Michael Kohlhase: Artificial Intelligence 1 161 2025-02-06

Let us fortify our intuitions with another example: navigation in a simple maze. Here the states
are the cells in the grid underlying the maze and the actions navigating to one of the adjoining
cells. The initial and goal states are the left upper and right lower corners of the grid. To see the
influence of the chosen heuristic (indicated by the red number in the cell), we compare the search
induced goal distance function with a heuristic based on the Manhattan distance. Just follow the
greedy search by following the heuristic gradient.
HeuristicFunctions
Heuristic FunctionsininPath
PathPlanning
Planning

I Example 6.5.8
 Example 4.4 (The
(Themaze
mazesolved).
solved). We indicate h∗ by giving the goal distance:
We indicate h⇤ by giving the goal distance
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 18 17 16 15 14 12 13 14 15 16 17 18
2 23 19 18 17 13 12 11 12 13 14 15 16 17
3 22 21 20 16 12 11 10
4 23 22 21 15 14 13 9 8 4 3 2 1
5 24 23 22 16 15 9 8 7 6 5 1 0

G
I Example 4.5 (Maze Heuristic: the good case).
 Example
We use the6.5.9 (Maze distance
Manhattan Heuristic: The
to the goalgood
as a case).
heuristicWe use the Manhattan
distance to the goal as a heuristic:

Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018


Heuristic Functions in Path Planning

I Example 4.4 (The maze solved).


We indicate h⇤ by giving the goal distance
116 I Example 4.5 (Maze Heuristic:CHAPTER 6. PROBLEM SOLVING AND SEARCH
the good case).
We use the Manhattan distance to the goal as a heuristic
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 16 15 14 13 12 10 9 8 7 6 5 4
2 17 15 14 13 11 10 9 8 7 6 5 4 3
3 16 15 14 12
Heuristic Functions in Path10Planning
9 8
4 15 14 13 11 10 9 7 6 4 3 2 1
I Example
5 144.413(The 12 maze10solved). 9 7 6 5 4 3 1 0

We indicate h by giving the goal distance
I Example 4.5 (Maze Heuristic: the good case).
G
We use the Manhattan distance to the goal as a heuristic
 Example 6.5.10 (Maze Heuristic: The bad case). We use the Manhattan
Idistance
Example to 4.6
the (Maze
goal as Heuristic: the bad case).
a heuristic again:
We use the Manhattan distance to the goal
Kohlhase: Künstliche Intelligenz 1 160
as a heuristicJuly
again
5, 2018

I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

G
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018

Michael Kohlhase: Artificial Intelligence 1 162 2025-02-06

Not surprisingly, the first maze is searchless, since we are guided by the perfect heuristic. In cases,
where there is a choice, the this has no influence on the length (or in other cases cost) of the
solution.
In the “good case” example, greedy search performs well, but there is some limited backtracking
needed, for instance when exploring the left lower corner 3×3 area before climbing over the second
wall.
In the “bad case”, greedy search is led down the lower garden path, which has a dead end, and
does not lead to the goal. This suggests that there we can construct adversary examples – i.e.
example mazes where we can force greedy search into arbitrarily bad performance.

Greedy search: Properties

Completeness No: Can get stuck in infinite loops.


Complete in finite state spaces with repeated
state checking

Time complexity O(bm )
Space complexity O(bm )
Optimality No

 Example 6.5.11. Greedy search can get stuck going from Iasi to Oradea:
Iasi → Neamt → Iasi → Neamt → · · ·
6.5. INFORMED SEARCH STRATEGIES 117
68 Chapter 3. Solving Problems by Searching

Oradea
71
Neamt

Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu

Figure 3.2 A simplified road map of part of Romania.

 Worst-case Time: Sometimes


Sametheasgoaldepth first
is specified by search.
an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
where the opponent’s king is under attack and can’t escape.
 Worst-case Space:• A Same as breadth
path cost function
PATH COST first cost
that assigns a numeric search. (⇝
to each path. The repeated state checking)
problem-solving
agent chooses a cost function that reflects its own performance measure. For the agent
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
 But: A good heuristic can Ingive
in kilometers. dramatic
this chapter, improvements.
we assume that the cost of a path can be described as the
STEP COST sum of the costs of the individual actions along the path.3 The step cost of taking action
a in state s to reach state s! is denoted by c(s, a, s! ). The step costs for Romania are
shown in Figure 3.2 as route distances. We assume that step costs are nonnegative.4

Michael Kohlhase:The preceding elements define


Artificial Intelligence 1 a problem and can be gathered 163 into a single data structure2025-02-06
that is given as input to a problem-solving algorithm. A solution to a problem is an action
sequence that leads from the initial state to a goal state. Solution quality is measured by the
path cost function, and an optimal solution has the lowest path cost among all solutions.
Remark 6.5.12. Greedy search is similar to UCS. Unlike the latter, the node evaluation function
OPTIMAL SOLUTION

3.1.2 Formulating problems


has nothing to do with the nodes explored so far. This can prevent nodes from being enumerated
In the preceding section we proposed a formulation of the problem of getting to Bucharest in
systematically as they are interms
UCS andstate,BFS.
of the initial actions, transition model, goal test, and path cost. This formulation
seems reasonable, but it is still a model—an abstract mathematical description—and not the
For completeness, we need repeated state checking
This assumption is algorithmically
3 as the
convenient but also theoretically example
justifiable—see page 649 inshows.
Chapter 17. This enforces complete
enumeration of the state spaceThe (provided that it is finite), and thus gives us completeness.
implications of negative costs are explored in Exercise 3.8.
4

Note that nothing prevents from all nodes being searched in worst case; e.g. if the heuristic
function gives us the same (low) estimate on all nodes except where the heuristic mis-estimates
the distance to be high. So in the worst case, greedy search is even worse than BFS, where d
(depth of first solution) replaces m.
The search procedure cannot be optimal, since actual cost of solution is not considered.
For both, completeness and optimality, therefore, it is necessary to take the actual cost of
partial solutions, i.e. the path cost, into account. This way, paths that are known to be expensive
are avoided.

6.5.2 Heuristics and their Properties


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/22019.

Heuristic Functions
 Definition 6.5.13. Let Π be a search problem with states S. A heuristic function
(or short heuristic) for Π is a function h : S → R+
0 ∪ {∞} so that h(s) = 0 whenever
s is a goal state.

 h(s) is intended as an estimate the distance between state s and the nearest goal
state.
 Definition 6.5.14. Let Π be a search problem with states S, then the function
h∗ : S → R+ ∗
0 ∪ {∞}, where h (s) is the cost of a cheapest path from s to a goal
state, or ∞ if no such path exists, is called the goal distance function for Π.

 Notes:
 h(s) = 0 on goal states: If your estimator returns “I think it’s still a long way”
on a goal state, then its intelligence is, um . . .
118 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Return value ∞: To indicate dead ends, from which the goal state can’t be
reached anymore.
 The distance estimate depends only on the state s, not on the node (i.e., the
path we took to reach s).

Michael Kohlhase: Artificial Intelligence 1 164 2025-02-06

Where does the word “Heuristic” come from?


 Ancient Greek word ϵυρισκϵιν (=
b “I find”) (aka. ϵυρϵκα!)
 Popularized in modern science by George Polya: “How to solve it” [Pól73]
 Same word often used for “rule of thumb” or “imprecise solution method”.

Michael Kohlhase: Artificial Intelligence 1 165 2025-02-06

Heuristic Functions: The Eternal Trade-Off


 “Distance Estimate”? (h is an arbitrary function in principle)
 In practice, we want it to be accurate (aka: informative), i.e., close to the actual
goal distance.
 We also want it to be fast, i.e., a small overhead for computing h.
 These two wishes are in contradiction!
 Example 6.5.15 (Extreme cases).

 h = 0: no overhead at all, completely un-informative.


 h = h∗ : perfectly accurate, overhead =
b solving the problem in the first place.
 Observation 6.5.16. We need to trade off the accuracy of h against the overhead
for computing it.

Michael Kohlhase: Artificial Intelligence 1 166 2025-02-06

Properties of Heuristic Functions


 Definition 6.5.17. Let Π be a search problem with states S and actions A. We
say that a heuristic h for Π is admissible if h(s) ≤ h∗ (s) for all s ∈ S.
We say that h is consistent if h(s) − h(s′ ) ≤ c(a) for all s ∈ S, a ∈ A, and
s′ ∈ T (s, a).
 In other words . . . :

 h is admissible if it is a lower bound on goal distance.


 h is consistent if, when applying an action a, the heuristic value cannot decrease
by more than the cost of a.
6.5. INFORMED SEARCH STRATEGIES 119

Michael Kohlhase: Artificial Intelligence 1 167 2025-02-06

Properties of Heuristic Functions, ctd.


 Let Π be a search problem, and let h be a heuristic for Π. If h is consistent, then
h is admissible.
 Proof: we prove h(s) ≤ h∗ (s) for all s ∈ S by induction over the length of the cheapest
path to a goal node.
1. base case
1.1. h(s) = 0 by definition of heuristic, so h(s) ≤ h∗ (s) as desired.
2. step case
2.1. We assume that h(s′ ) ≤ h∗ (s) for all states s′ with a cheapest goal node path
of length n.
2.2. Let s be a state whose cheapest goal path has length n+1 and the first transition
is o = (s,s′ ).
2.3. By consistency, we have h(s) − h(s′ ) ≤ c(o) and thus h(s) ≤ h(s′ ) + c(o).
2.4. By construction, h∗ (s) has a cheapest goal path of length n and thus, by induc-
tion hypothesis h(s′ ) ≤ h∗ (s′ ).
2.5. By construction, h∗ (s) = h∗ (s′ ) + c(o).
2.6. Together this gives us h(s) ≤ h∗ (s) as desired.

 Consistency is a sufficient condition for admissibility (easier to check)

Michael Kohlhase: Artificial Intelligence 1 168 2025-02-06

Properties of Heuristic Functions: Examples


 Example 6.5.18. Straight line distance is admissible and consistent by the triangle
inequality.
If you drive 100km, then the straight line distance to Rome can’t decrease by more
than 100km.

 Observation: In practice, admissible heuristics are typically consistent.


 Example 6.5.19 (An admissible, but inconsistent heuristic). When traveling
to Rome, let h(M unich) = 300 and h(Innsbruck) = 100.
 Inadmissible heuristics typically arise as approximations of admissible heuristics
that are too costly to compute. (see later)

Michael Kohlhase: Artificial Intelligence 1 169 2025-02-06

6.5.3 A-Star Search


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/22020.

A∗ Search: Evaluation Function


 Idea: Avoid expanding paths that are already expensive(make use of actual cost)
The simplest way to combine heuristic and path cost is to simply add them.
120 CHAPTER 6. PROBLEM SOLVING AND SEARCH

 Definition 6.5.20. The evaluation function for A∗ search is given by f (n) =


g(n) + h(n), where g(n) is the path cost for n and h(n) is the estimated cost to
the nearest goal from n.
 Thus f (n) is the estimated total cost of the path through n to a goal.

 Definition 6.5.21. Best first search with evaluation function g + h is called A∗


search.

Michael Kohlhase: Artificial Intelligence 1 170 2025-02-06

This works, provided that h does not overestimate the true cost to achieve the goal. In other
words, h must be optimistic wrt. the real cost h∗ . If we are too pessimistic, then non-optimal
solutions have a chance.

A∗ Search: Optimality
 Theorem 6.5.22. A∗ search with admissible heuristic is optimal.

 Proof: We show that sub-optimal nodes are never expanded by A∗


1. Suppose a suboptimal goal node G has been generated then we are in the
following situation:
start
n

O G

2. Let n be an unexpanded node on a path to an optimality goal node O, then


f (G) = g(G) since h(G) = 0
g(G) > g(O) since G suboptimal
g(O) = g(n) + h∗ (n) n on optimal path
g(n) + h∗ (n) ≥ g(n) + h(n) since h is admissible
g(n) + h(n) = f (n)
3. Thus, f (G) > f (n) and A∗ never expands G.

Michael Kohlhase: Artificial Intelligence 1 171 2025-02-06

A∗ Search Example

Arad
366=0+366

Arad

Sibiu Timisoara Zerind


393=140+253 447=118+329 449=75+374
6.5. INFORMED SEARCH STRATEGIES 121

Arad

Sibiu Timisoara Zerind


447=118+329 449=75+374
Arad Fagaras Oradea R. Vilcea

646=280+366 415=239+176 671=291+380 413=220+193

Arad

Sibiu Timisoara Zerind


447=118+329 449=75+374
Arad Fagaras Oradea R. Vilcea

646=280+366 415=239+176 671=291+380

Craiova Pitesti Sibiu

526=366+160 417=317+100 553=300+253

Arad

Sibiu Timisoara Zerind


447=118+329 449=75+374
Arad Fagaras Oradea R. Vilcea

646=280+366 671=291+380

Sibiu Bucharest Craiova Pitesti Sibiu

591=338+253 450=450+0 526=366+160 417=317+100 553=300+253

Arad

Sibiu Timisoara Zerind


447=118+329 449=75+374
Arad Fagaras Oradea R. Vilcea

646=280+366 671=291+380

Sibiu Bucharest Craiova Pitesti Sibiu

591=338+253 450=450+0 526=366+160 553=300+253

Bucharest Craiova Sibiu

418=418+0 615=455+160 607=414+193

Michael Kohlhase: Artificial Intelligence 1 172 2025-02-06

To extend our intuitions about informed search algorithms to A∗ -search, we take up the maze
examples from above again. We first show the good maze with Manhattan distance again.

Additional Observations (Not Limited to Path Planning)


 Example 6.5.23 (Greedy best-first search, “good case”).
Heuristic Functions in Path Planning

I Example 4.4 (The maze solved).


We indicate h⇤ by giving the goal distance
122 I Example 4.5 (Maze Heuristic:CHAPTER 6. PROBLEM SOLVING AND SEARCH
the good case).
We use the Manhattan distance to the goal as a heuristic
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 16 15 14 13 12 10 9 8 7 6 5 4
2 17 15 14 13 11 10 9 8 7 6 5 4 3
3 16 15 14 12 10 9 8
4 15 14 13 11 10 9 7 6 4 3 2 1
5 14 13 12 10 9 7 6 5 4 3 1 0

G
We will find a solution with little search.
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018

Michael Kohlhase: Artificial Intelligence 1 173 2025-02-06

To compare it to A∗ -search, here is the same maze but now with the numbers in red for the
evaluation function f where h is the Manhattan distance.

Additional Observations (Not Limited to Path Planning)


Additional Observations (Not Limited to Path Planning) II
 Example 6.5.24 (A∗ (g + h), “good case”).
I Example 4.21 (A⇤ (g + h), “good case”).

I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 22 22 22 22 22 24 24 24 24 24 24 24
2 18 20 20 20 22 22 22 22 22 22 22 22 22
3 18 18 18 20 22 22 22
4 18 18 18 20 20 20 22 22 24 24 24 24
5 18 18 18 20 20 24 22 22 22 22 24 24

G
I A⇤ with
∗ a consistent heuristic g + h always increases monotonically (h cannot
 In A with a consistent heuristic, g + h always increases monotonically (h
decrease mor than g increases)
cannot decrease more than g increases)
I We need more search, in the “right upper half”. This is typical: Greedy best-first
We need
 search more
tends to besearch, in the
faster than A⇤“right
. upper half”. This is typical: Greedy best
first search tends to be faster than A∗ .

Kohlhase: Künstliche Intelligenz 1 177 July 5, 2018


Michael Kohlhase: Artificial Intelligence 1 174 2025-02-06

Let’s now consider the “bad maze” with Manhattan distance again.

Additional Observations (Not Limited to Path Planning)


 Example 6.5.25 (Greedy best-first search, “bad case”).
I Example 4.4 (The maze solved).
We indicate h⇤ by giving the goal distance
I Example 4.5 (Maze Heuristic: the good case).
We use the Manhattan distance to the goal as a heuristic
I Example
6.5. INFORMED SEARCH STRATEGIES
4.6 (Maze Heuristic: the bad case). 123
We use the Manhattan distance to the goal as a heuristic again
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

G
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018

Search will be mis-guided into the “dead-end street”.

Michael Kohlhase: Artificial Intelligence 1 175 2025-02-06

And we compare it to A∗ -search; again the numbers in red are for the evaluation function f .

Additional Observations (Not Limited to Path Planning)


Additional Observations (Not Limited to Path Planning) IV
 Example 6.5.26 (A ∗
(g + h), “bad case”).
I Example 4.23 (A⇤ (g + h), “bad case”).

I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 18 16 22 14 13 12 11 10 9 8 7 6 5 4 24
3 18 15 20 13 22 22 22 9 26 26 26 5 30 3 24
4 18 18 18 12 20 10 22 8 24 6 26 4 28 2 24
5 18 18 18 18 18 9 22 22 22 5 26 26 26 1 24

G
We will search less of the “dead-end street”. Sometimes g + h gives better
We will search
search lessthan
guidance of the
h. “dead-end street”. Sometimes g +
(;h gives
A⇤ is better search
faster there)
guidance than h. (; A∗ is faster there)

Michael Kohlhase: Artificial Intelligence 1 176 2025-02-06

Finally, we compare thatKünstliche


Kohlhase: with the goal1 distance function
Intelligenz 179 for the “bad
July maze”.
5, 2018 Here we see that the
lower garden path is under-estimated by the evaluation function f , but still large enough to keep
the search out of it, thanks to the admissibility of the Manhattan distance.

Additional Observations (Not Limited to Path Planning)


 Example 6.5.27 (A∗ (g + h) using h∗ ).
Additional Observations (Not Limited to Path Planning) V

124 CHAPTER 6. PROBLEM SOLVING AND SEARCH


I Example 4.24 (A⇤ (g + h) using h⇤ ).

I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 24 16 24 14 13 12 11 10 9 8 7 6 5 4 24
3 24 15 24 13 34 36 38 9 50 52 54 5 66 3 24
4 24 24 24 12 32 10 40 8 48 6 56 4 64 2 24
5 26 26 26 28 30 9 42 44 46 5 58 60 62 1 24

G
In A⇤ , node values always increase monotonically (with any heuristic). If the
A∗ , node
Inheuristic is values
perfect,always increaseconstant
they remain monotonically (withpaths.
on optimal any heuristic). If the heuris-
tic is perfect, they remain constant on optimal paths.

Michael Kohlhase: Artificial Intelligence 1 177 2025-02-06


Kohlhase: Künstliche Intelligenz 1 180 July 5, 2018

A∗ search: f -contours
Section 3.5. Informed (Heuristic) Search Strategies 97
 Intuition: A∗ -search gradually adds “f -contours” (areas of the same f -value) to
the search.

Z N

A I
380 S
F
V
400
T R
L P

H
M U
420 B
D
C E
G

Figure 3.25 Map of Romania showing contours at178f = 380, f = 400,


Michael Kohlhase: Artificial Intelligence 1
and f = 420, with
2025-02-06
Arad as the start state. Nodes inside a given contour have f -costs less than or equal to the
contour value.
A∗ search: Properties
Figure 3.9; because f is nondecreasing along any path, n! would have lower f -cost than n
and would haveorbeen
 Properties selected first.
A∗ -search:
From the two preceding observations, it follows that the sequence of nodes expanded
by A∗ usingCompleteness
G RAPH -S EARCH isYes in nondecreasing
(unless there areorderinfinitely
of f (n).many
Hence, the first
nodes n goal node
selected for expansion must be with
an optimal
f (n) ≤solution
f (0)) because f is the true cost for goal nodes
(which haveTime 0) and all laterExponential
h = complexity goal nodes will be at leasterror
in [relative as expensive.
in h × length of
n
The fact that f -costs are solution]
nondecreasing along any path also means that we can draw
CONTOUR contours inSpace complexity
the state space, justSame as time
like the (variant
contours in a of BFS)
topographic map. Figure 3.25 shows
Optimality Yes
an example. Inside the contour labeled 400, all nodes have f (n) less than or equal to 400,
and so on. Then, because A∗ expands the frontier node of lowest f -cost, we can see that an
A∗ search fans out from the start node, adding nodes in concentric bands of increasing f -cost.
With uniform-cost search (A∗ search using h(n) = 0), the bands will be “circular”
around the start state. With more accurate heuristics, the bands will stretch toward the goal

6.5. INFORMED SEARCH STRATEGIES 125

 A∗ -search expands all (some/no) nodes with f (n) < h∗ (n)


 The run-time depends on how well we approximated the real cost h∗ with h.

Michael Kohlhase: Artificial Intelligence 1 179 2025-02-06

6.5.4 Finding Good Heuristics


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/22021.
Since the availability of admissible heuristics is so important for informed search (particularly for
A∗ -search), let us see how such heuristics can be obtained in practice. We will look at an example,
and then derive a general procedure from that.

Admissible
Section 3.2. heuristics:
Example Problems Example 8-puzzle 71

7 2 4 1 2

5 6 3 4 5

8 3 1 6 7 8

Start State Goal State

Figure 3.4 A typical instance of the 8-puzzle.


 Example 6.5.28. Let h1 (n) be the number of misplaced tiles in node n.
(h1 (S) = 9)• States: A state description specifies the location of each of the eight tiles and the blank
in one of the nine squares.
 Example 6.5.29. LetAny
• Initial state: h2state
(n)canbebethe total asManhattan
designated the initial state.distance from
Note that any given desired
goal location
of each tile.• can be reached from exactly half of the possible initial states (Exercise 3.4).
(h (S) = 3 + 1 + 2 + 2 + 2 + 3 +
2 defines the actions as movements of the blank space
Actions: The simplest formulation
2 + 2 + 3 = 20)
Left, Right, Up, or Down. Different subsets of these are possible depending on where
 Observationthe6.5.30
blank is. (Typical search costs). (IDS =
b iterative deepening search)
• Transition model: Given a state and action, this returns the resulting state; for example,
if we apply Left to the start state in Figure 3.4, the resulting state has the 5 and the blank
nodes explored IDS
switched.
A∗ (h1 ) A∗ (h2 )
• Goal test:
d =This14checks whether the3,473,941
state matches the539
goal configuration
113shown in Fig-
ure 3.4. (Other goal configurations are possible.)
d = 24 too many 39,135 1,641
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
What abstractions have we included here? The actions are abstracted to their beginning and
final states, ignoring the intermediate locations where the block is sliding. We have abstracted
away actions
Michael suchArtificial
Kohlhase: as shaking the board
Intelligence 1 when pieces get 180 stuck and ruled out extracting the
2025-02-06
pieces with a knife and putting them back again. We are left with a description of the rules of
the puzzle, avoiding all the details of physical manipulations.
Actually, the crucial difference
SLIDING-BLOCK
PUZZLES The 8-puzzlebetween
belongs to the the heuristics
family h1 and
of sliding-block is that
h2which
puzzles, – used
are often not asonly in the example
configuration above,testbut for for
problems allnew
configurations
search algorithms in–AI. theThisthe value
family of tothe
is known latter is larger than that of
be NP-complete,
so one does not expect to find methods significantly better in the worst case than the search
the former. We willalgorithms explore this in
described next.
this chapter and the next. The 8-puzzle has 9!/2 = 181, 440 reachable
states and is easily solved. The 15-puzzle (on a 4 × 4 board) has around 1.3 trillion states, and
Dominance random instances can be solved optimally in a few milliseconds by the best search algorithms.
The 24-puzzle (on a 5 × 5 board) has around 1025 states, and random instances take several
hours to solve optimally.
Definition
8-QUEENS PROBLEM 6.5.31.
The goal of theLet h1 and
8-queens h2 isbe
problem twoeight
to place admissible heuristics
queens on a chessboard suchwethat say that h2
no queen attacks any other. (A queen attacks any piece in the same row, column or diago-
dominates h if h2 (n)
nal.) 1Figure 3.5
≥ h1 (n) forsolution
shows an attempted
all n. that fails: the queen in the rightmost column is
attacked by the queen at the top left.
 Theorem 6.5.32. If h2 dominates h1 , then h2 is better for search than h1 .
 Proof sketch: If h2 dominates h1 , then h2 is “closer to h∗ ” than h1 , which means
better search performance.

Michael Kohlhase: Artificial Intelligence 1 181 2025-02-06

We now try to generalize these insights into (the beginnings of) a general method for obtaining
126 CHAPTER 6. PROBLEM SOLVING AND SEARCH

admissible heuristics.

Relaxed problems
 Observation: Finding good admissible heuristics is an art!
 Idea: Admissible heuristics can be derived from the exact solution cost of a relaxed
version of the problem.

 Example 6.5.33. If the rules of the 8-puzzle are relaxed so that a tile can move
anywhere, then we get heuristic h1 .
 Example 6.5.34. If the rules are relaxed so that a tile can move to any adjacent
square, then we get heuristic h2 . (Manhattan distance)
 Definition 6.5.35. Let Π := ⟨S , A, T , I , G ⟩ be a search problem, then we call
a search problem P r := ⟨S, Ar , T r , I r , G r ⟩ a relaxed problem (wrt. Π; or simply
relaxation of Π), iff A ⊆ Ar , T ⊆ T r , I ⊆ I r , and G ⊆ G r .
 Lemma 6.5.36. If P r relaxes Π, then every solution for Π is one for P r .
 Key point: The optimal solution cost of a relaxed problem is not greater than the
optimal solution cost of the real problem.

Michael Kohlhase: Artificial Intelligence 1 182 2025-02-06

Relaxation means to remove some of the constraints or requirements of the original problem,
so that a solution becomes easy to find. Then the cost of this easy solution can be used as an
optimistic approximation of the problem.

Empirical Performance: A∗ in Path Planning


 Example 6.5.37 (Live Demo vs. Breadth-First Search).

See http://qiao.github.io/PathFinding.js/visual/
 Difference to Breadth-first Search?: That would explore all grid cells in a circle
around the initial state!

Michael Kohlhase: Artificial Intelligence 1 183 2025-02-06


6.6. LOCAL SEARCH 127

6.6 Local Search


Video Nuggets covering this section can be found at https://fau.tv/clip/id/22050 and
https://fau.tv/clip/id/22051.

Systematic Search vs. Local Search


 Definition 6.6.1. We call a search algorithm systematic, if it considers all states
at some point.
 Example 6.6.2. All tree search algorithms (except pure depth first search) are
systematic. (given reasonable assumptions e.g. about costs.)

 Observation 6.6.3. Systematic search algorithms are complete.


 Observation 6.6.4. In systematic search algorithms there is no limit of the number
of nodes that are kept in memory at any time.
 Alternative: Keep only one (or a few) nodes at a time

 ; no systematic exploration of all options, ; incomplete.

Michael Kohlhase: Artificial Intelligence 1 184 2025-02-06

Local Search Problems


 Idea: Sometimes the path to the solution is irrelevant.

 Example 6.6.5 (8 Queens Problem). Place 8


queens on a chess board, so that no two queens
threaten each other.
 This problem has various solutions (the one of
the right isn’t one of them)
 Definition 6.6.6. A local search algorithm is a
search algorithm that operates on a single state,
the current state (rather than multiple paths).
(advantage: constant space)
 Typically local search algorithms only move to successor of the current state, and
do not retain search paths.

 Applications include: integrated circuit design, factory-floor layout, job-shop schedul-


ing, portfolio management, fleet deployment,. . .

Michael Kohlhase: Artificial Intelligence 1 185 2025-02-06

Local Search: Iterative improvement algorithms


 Definition 6.6.7. The traveling salesman problem (TSP is to find shortest trip
through set of cities such that each city is visited exactly once.
Local Search: Iterative improvement algorithms
Local Search: Iterative improvement algorithms
128 CHAPTER 6. PROBLEM SOLVING AND SEARCH
I Definition 5.7 (Traveling Salesman Problem). Find shortest trip through set
of cities such that each city is visited exactly once.
I Idea:
 Start 5.7
Definition with(Traveling
any complete tour, perform
Salesman pairwise
Problem). Findexchanges
shortest trip through set
I Idea: Start with any complete tour, perform pairwise exchanges
of cities such that each city is visited exactly once.
I Idea: Start with any complete tour, perform pairwise exchanges

I Definition 5.8 (n-queens problem). Put n queens on n ⇥ n board such that


no two queens
 Definition 6.6.8.inThe
the same row, columns,
n-queens or to
problem is diagonal.
put n queens on n × n board such
Ithat
I no two
Definition
Idea: queen
5.8
Move in the
(n-queens
a queen same row, columns,
problem).
to reduce number Put n or diagonal.
queens
of conflicts on n ⇥ n board such that
no two queens in the same row, columns, or diagonal.
 Idea: Move a queen to reduce number of conflicts
I Idea: Move a queen to reduce number of conflicts

Kohlhase: Künstliche Intelligenz 1 189 July 5, 2018

Michael Kohlhase: Artificial Intelligence 1 186 2025-02-06


Kohlhase: Künstliche Intelligenz 1 189 July 5, 2018

Hill-climbing (gradient ascent/descent)


 Idea: Start anywhere and go in the direction of the steepest ascent.
 Definition 6.6.9. Hill climbing (also gradient ascent) is a local search algorithm
that iteratively selects the best successor:
procedure Hill−Climbing (problem) /∗ a state that is a local minimum ∗/
local current, neighbor /∗ nodes ∗/
current := Make−Node(Initial−State[problem])
loop
neighbor := <a highest−valued successor of current>
if Value[neighbor] < Value[current] return [current] end if
current := neighbor
end loop
end procedure

 Intuition: Like best first search without memory.


 Works, if solutions are dense and local maxima can be escaped.

Michael Kohlhase: Artificial Intelligence 1 187 2025-02-06

In order to understand the procedure on a more intuitive level, let us consider the following
scenario: We are in a dark landscape (or we are blind), and we want to find the highest hill. The
search procedure above tells us to start our search anywhere, and for every step first feel around,
and then take a step into the direction with the steepest ascent. If we reach a place, where the
next step would take us down, we are finished.
Of course, this will only get us into local maxima, and has no guarantee of getting us into
global ones (remember, we are blind). The solution to this problem is to re-start the search at
random (we do not have any information) places, and hope that one of the random jumps will get
us to a slope that leads to a global maximum.

Example Hill Climbing with 8 Queens


6.6. LOCAL SEARCH 129

 Idea: Consider h = b number of


Section 4.1. Local Search Algorithms and Optimization Problems 121
queens that threaten each other.
If the path to the goal does not matter, we might consider a different class of algo-
 Example 6.6.10. An 8-queens rithms,
stateones that do not worry about paths at all. Local search algorithms operate using
LOCAL SEARCH

a single current node (rather than multiple paths) and generally move only to neighbors
CURRENT NODE
with heuristic cost estimate h of=that17node. Typically, the paths followed by the search are not retained. Although local
showing h-values for moving a queen
search algorithms are not systematic, they have two key advantages: (1) they use very little
memory—usually a constant amount; and (2) they can often find reasonable solutions in large
within its column: or infinite (continuous) state spaces for which systematic algorithms are unsuitable.
In addition to finding goals, local search algorithms are useful for solving pure op-
OPTIMIZATION
timization problems, in which the aim is to find the best state according to an objective
 Problem: The state space hasfunction.
local Many optimization problems do not fit the “standard” search model introduced in
PROBLEM
OBJECTIVE
FUNCTION

minima. e.g. the board on the Chapter


right3. For example, nature provides an objective function—reproductive fitness—that
Darwinian evolution could be seen as attempting to optimize, but there is no “goal test” and
has h = 1 but every successorno “path
hascost” for this problem.
To understand local search, we find it useful to consider the state-space landscape (as
STATE-SPACE
h > 1. LANDSCAPE
in Figure 4.1). A landscape has both “location” (defined by the state) and “elevation” (defined
by the value of the heuristic cost function or objective function). If elevation corresponds to
GLOBAL MINIMUM cost, then the aim is to find the lowest valley—a global minimum; if elevation corresponds
Michael Kohlhase: Artificial Intelligence 1
GLOBAL MAXIMUM to an objective function,
188 then the aim is to find the highest peak—a global maximum. (You
2025-02-06
can convert from one to the other just by inserting a minus sign.) Local search algorithms
explore this landscape. A complete local search algorithm always finds a goal if one exists;
an optimal algorithm always finds a global minimum/maximum.
Hill-climbing
objective function
global maximum

 Problem: Depending on initial


state, can get stuck on local max- shoulder

ima/minima and plateaux. local maximum


“flat” local maximum

 “Hill-climbing search is like climbing


Everest in thick fog with amnesia”.
state space
current
state

Figure 4.1 A one-dimensional state-space landscape in which elevation corresponds to the


 Idea: Escape local maxima by allowing some
objective “bad”
function. oris torandom
The aim moves.
find the global maximum. Hill-climbing search modifies
the current state to try to improve it, as shown by the arrow. The various topographic features
are defined in the text.
 Example 6.6.11. local search, simulated annealing, . . .

 Properties: All are incomplete, nonoptimal.


 Sometimes performs well in practice (if (optimal) solutions are dense)

Michael Kohlhase: Artificial Intelligence 1 189 2025-02-06

Recent work on hill climbing algorithms tries to combine complete search with randomization to
escape certain odd phenomena occurring in statistical distribution of solutions.
124 Chapter 4. Beyond Classical Search
Simulated annealing (Idea)

 Definition 6.6.12. Ridges are ascending


successions of local maxima.

 Problem: They are extremely difficult to


bv navigate for local search algorithms.
 Idea: Escape local maxima by allowing
some “bad” moves, but gradually decrease
their size and frequency.

 Annealing is the process of heating steel and let it cool gradually to give it time to
Figure 4.4 Illustration of why ridges cause difficulties for hill climbing. The grid of states
(dark circles) is superimposed on a ridge rising from left to right, creating a sequence of local
maxima that are not directly connected to each other. From each local maximum, all the
available actions point downhill.
130 CHAPTER 6. PROBLEM SOLVING AND SEARCH

grow an optimal cristal structure.


 Simulated annealing is like shaking a ping pong ball occasionally on a bumpy surface
to free it. (so it does not get stuck)

 Devised by Metropolis et al for physical process modelling [Met+53]


 Widely used in VLSI layout, airline scheduling, etc.

Michael Kohlhase: Artificial Intelligence 1 190 2025-02-06

Simulated annealing (Implementation)


 Definition 6.6.13. The following algorithm is called simulated annealing:
procedure Simulated−Annealing (problem,schedule) /∗ a solution state ∗/
local node, next /∗ nodes ∗/
local T /∗ a ‘‘temperature’’ controlling prob.~of downward steps ∗/
current := Make−Node(Initial−State[problem])
for t :=1 to ∞
T := schedule[t]
if T = 0 return current end if
next := <a randomly selected successor of current>
∆(E) := Value[next]−Value[current]
if ∆(E) > 0 current := next
else
current := next <only with probability> e∆(E)/T
end if
end for
end procedure

A schedule is a mapping from time to “temperature”.

Michael Kohlhase: Artificial Intelligence 1 191 2025-02-06

Properties of simulated annealing


 At fixed “temperature” T , state occupation probability reaches Boltzman distribu-
tion E(x)
p(x) = αe kT
T decreased slowly enough ; always reach best state x∗ because
E(x∗ )
e kT E(x∗ )−E(x)

E(x)
=e kT ≫1
e kT

for small T .

 Question: Is this necessarily an interesting guarantee?

Michael Kohlhase: Artificial Intelligence 1 192 2025-02-06


6.6. LOCAL SEARCH 131

Local beam search


 Definition 6.6.14. Local beam search is a search algorithm that keep k states
instead of 1 and chooses the top k of all their successors.

 Observation: Local beam search is not the same as k searches run in parallel!
(Searches that find good states recruit other searches to join them)
 Problem: Quite often, all k searches end up on the same local hill!
 Idea: Choose k successors randomly, biased towards good ones. (Observe the
close analogy to natural selection!)

Michael Kohlhase: Artificial Intelligence 1 193 2025-02-06

Genetic algorithms (very briefly)


 Definition 6.6.15. A genetic algorithm is a variant of local beam search that
generates successors by

 randomly modifying states (mutation)


 mixing pairs of states (sexual reproduction or crossover)
to optimize a fitness function. (survival of the fittest)
Section 4.1. Local Search Algorithms and Optimization Problems 127
 Example 6.6.16. Generating successors for 8 queens

24748552 24 31% 32752411 32748552 32748152


32752411 23 29% 24748552 24752411 24752411
24415124 20 26% 32752411 32752124 32252124
32543213 11 14% 24415124 24415411 24415417

(a) (b) (c) (d) (e)


Initial Population Fitness Function Selection Crossover Mutation

Figure 4.6 The genetic algorithm, illustrated for digit strings representing 8-queens states.
The initial population in (a) is ranked by the fitness function in (b), resulting in pairs for
Michael Kohlhase: Artificial Intelligence 1 194 2025-02-06
mating in (c). They produce offspring in (d), which are subject to mutation in (e).

Genetic algorithms (continued)


 Problem: Genetic algorithms require states encoded as strings.
+ =
 Crossover only helps iff substrings are meaningful components.
 Example 6.6.17 (Evolving 8 Queens). First crossover

Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
unshaded columns are retained.

Like beam searches, GAs begin with a set of k randomly generated states, called the
POPULATION population. Each state, or individual, is represented as a string over a finite alphabet—most
INDIVIDUAL commonly, a string of 0s and 1s. For example, an 8-queens state must specify the positions of
8 queens, each in a column of 8 squares, and so requires 8 × log2 8 = 24 bits. Alternatively,
the state could be represented as 8 digits, each in the range from 1 to 8. (We demonstrate later
that the two encodings behave differently.) Figure 4.6(a) shows a population of four 8-digit
24415124 20 26% 32752411 32752124 32252124
32543213 11 14% 24415124 24415411 24415417

(a) (b) (c) (d) (e)


Initial Population Fitness Function Selection Crossover Mutation

Figure 4.6 The genetic algorithm, illustrated for digit strings representing 8-queens states.
The initial population in (a) is ranked by the fitness function in (b), resulting in pairs for
mating in (c). They produce offspring in (d), which are subject to mutation in (e).
132 CHAPTER 6. PROBLEM SOLVING AND SEARCH

+ =

Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
 Note: Genetic algorithms ̸= evolution: e.g., real genes also encode replication
unshaded columns are retained.
machinery!

Like beam searches, GAs begin with a set of k randomly generated states, called the
POPULATION population. Each state,
Michael Kohlhase: Artificialor individual,
Intelligence 1 is represented as195a string over a finite alphabet—most
2025-02-06

INDIVIDUAL commonly, a string of 0s and 1s. For example, an 8-queens state must specify the positions of
8 queens, each in a column of 8 squares, and so requires 8 × log2 8 = 24 bits. Alternatively,
the state could be represented as 8 digits, each in the range from 1 to 8. (We demonstrate later
that the two encodings behave differently.) Figure 4.6(a) shows a population of four 8-digit
strings representing 8-queens states.
The production of the next generation of states is shown in Figure 4.6(b)–(e). In (b),
FITNESS FUNCTION each state is rated by the objective function, or (in GA terminology) the fitness function. A
fitness function should return higher values for better states, so, for the 8-queens problem
we use the number of nonattacking pairs of queens, which has a value of 28 for a solution.
The values of the four states are 24, 23, 20, and 11. In this particular variant of the genetic
algorithm, the probability of being chosen for reproducing is directly proportional to the
fitness score, and the percentages are shown next to the raw scores.
In (c), two pairs are selected at random for reproduction, in accordance with the prob-
Chapter 7

Adversarial Search for Game Playing

A Video Nugget covering this chapter can be found at https://fau.tv/clip/id/22079.

7.1 Introduction
Video Nuggets covering this section can be found at https://fau.tv/clip/id/22060 and
https://fau.tv/clip/id/22061.

The Problem
 The Problem of Game-Play: cf. ??
 Example 7.1.1.

 Definition 7.1.2. Adversarial search =


b Game playing against an opponent.

Michael Kohlhase: Artificial Intelligence 1 196 2025-02-06

Why Game Playing?


 What do you think?

133
134 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

 Playing a game well clearly requires a form of “intelligence”.


 Games capture a pure form of competition between opponents.
 Games are abstract and precisely defined, thus very easy to formalize.

 Game playing is one of the oldest sub-areas of AI (ca. 1950).


 The dream of a machine that plays chess is, indeed, much older than AI!

“Schachtürke” (1769) “El Ajedrecista” (1912)

Michael Kohlhase: Artificial Intelligence 1 197 2025-02-06

“Game” Playing? Which Games?


 . . . sorry, we’re not gonna do soccer here.

 Definition 7.1.3 (Restrictions). A game in the sense of AI-1 is one where


 Game state discrete, number of game state finite.
 Finite number of possible moves.
 The game state is fully observable.
 The outcome of each move is deterministic.
 Two players: Max and Min.
 Turn-taking: It’s each player’s turn alternatingly. Max begins.
 Terminal game states have a utility u. Max tries to maximize u, Min tries to
minimize u.
 In that sense, the utility for Min is the exact opposite of the utility for Max
(“zero sum”).
 There are no infinite runs of the game (no matter what moves are chosen, a
terminal state is reached after a finite number of moves).

Michael Kohlhase: Artificial Intelligence 1 198 2025-02-06

An Example Game
7.1. INTRODUCTION 135

 Game states: Positions of figures.


 Moves: Given by rules.

 Players: white (Max), black (Min).


 Terminal states: checkmate.
 Utility of terminal states, e.g.:
 +100 if black is checkmated.
 0 if stalemate.
 −100 if white is checkmated.

Michael Kohlhase: Artificial Intelligence 1 199 2025-02-06

“Game” Playing? Which Games Not?

 Soccer (sorry guys; not even RoboCup)


 Important types of games that we don’t tackle here:

 Chance. (E.g., backgammon)


 More than two players. (E.g., Halma)
 Hidden information. (E.g., most card games)
 Simultaneous moves. (E.g., Diplomacy)
 Not zero-sum, i.e., outcomes may be beneficial (or detrimental) for both players.
(cf. Game theory: Auctions, elections, economy, politics, . . . )
 Many of these more general game types can be handled by similar/extended algo-
rithms.

Michael Kohlhase: Artificial Intelligence 1 200 2025-02-06

(A Brief Note On) Formalization


 Definition 7.1.4. An adversarial search problem is a search problem ⟨S , A, T , I , G ⟩,
where

1. S = S Max ⊎ S Min ⊎ G and A = AMax ⊎ AMin


a
→ s′ then s ∈ S Max and s′ ∈ (S Min ∪ G).
2. For a ∈ AMax , if s −
a
→ s′ then s ∈ S Min and s′ ∈ (S Max ∪ G).
3. For a ∈ AMin , if s −

together with a game utility function u : G → R. (the “score” of the game)


 Definition 7.1.5 (Commonly used terminology).
position =
b state, move =
b action, end state =
b terminal state =
b goal state.
 Remark: A round of the game – one move Max, one move Min – is often referred
to as a “move”, and individual actions as “half-moves” (we don’t in AI-1)
136 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Michael Kohlhase: Artificial Intelligence 1 201 2025-02-06

Why Games are Hard to Solve: I


 What is a “solution” here?

 Definition 7.1.6. Let Θ be an adversarial search problem, and let X ∈ {Max, Min}.
A strategy for X is a function σ X : S X → AX so that a is applicable to s whenever
σ X (s) = a.
 We don’t know how the opponent will react, and need to prepare for all possibilities.

 Definition 7.1.7. A strategy is called optimal if it yields the best possible utility
for X assuming perfect opponent play (not formalized here).
 Problem: In (almost) all games, computing an optimal strategy is infeasible.
(state/search tree too huge)
 Solution: Compute the next move “on demand”, given the current state instead.

Michael Kohlhase: Artificial Intelligence 1 202 2025-02-06

Why Games are hard to solve II

 Example 7.1.8. Number of reachable states in chess: 1040 .


 Example 7.1.9. Number of reachable states in go: 10100 .

 It’s even worse: Our algorithms here look at search trees (game trees), no
duplicate pruning.
 Example 7.1.10.
 Chess without duplicate pruning: 35100 ≃ 10154 .
 Go without duplicate pruning: 200300 ≃ 10690 .

Michael Kohlhase: Artificial Intelligence 1 203 2025-02-06

How To Describe a Game State Space?


 Like for classical search problems, there are three possible ways to describe a game:
blackbox/API description, declarative description, explicit game state space.
 Question: Which ones do humans use?

 Explicit ≈ Hand over a book with all 1040 moves in chess.


 Blackbox ≈ Give possible chess moves on demand but don’t say how they are
generated.
 Answer: Declarative!
With “game description language” =
b natural language.
7.2. MINIMAX SEARCH 137

Michael Kohlhase: Artificial Intelligence 1 204 2025-02-06

Specialized vs. General Game Playing


 And which game descriptions do computers use?

 Explicit: Only in illustrations.


 Blackbox/API: Assumed description in (This Chapter)
 Method of choice for all those game players out there in the market (Chess
computers, video game opponents, you name it).
 Programs designed for, and specialized to, a particular game.

 Human knowledge is key: evaluation functions (see later), opening databases


(chess!!), end game databases.
 Declarative: General game playing, active area of research in AI.
 Generic game description language (GDL), based on logic.
 Solvers are given only “the rules of the game”, no other knowledge/input
whatsoever (cf. ??).
 Regular academic competitions since 2005.

Michael Kohlhase: Artificial Intelligence 1 205 2025-02-06

Our Agenda for This Chapter


 Minimax Search: How to compute an optimal strategy?
 Minimax is the canonical (and easiest to understand) algorithm for solving
games, i.e., computing an optimal strategy.
 Evaluation functions: But what if we don’t have the time/memory to solve the
entire game?
 Given limited time, the best we can do is look ahead as far as we can. Evaluation
functions tell us how to evaluate the leaf states at the cut off.

 Alphabeta search: How to prune unnecessary parts of the tree?


 Often, we can detect early on that a particular action choice cannot be part of
the optimal strategy. We can then stop considering this part of the game tree.
 State of the art: What is the state of affairs, for prominent games, of computer
game playing vs. human experts?
 Just FYI (not part of the technical content of this course).

Michael Kohlhase: Artificial Intelligence 1 206 2025-02-06

7.2 Minimax Search


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22061.
138 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

“Minimax”?
 We want to compute an optimal strategy for player “Max”.
 In other words: We are Max, and our opponent is Min.

 Recall: We compute the strategy offline, before the game begins.


During the game, whenever it’s our turn, we just look up the corresponding action.
 Idea: Use tree search using an extension û of the utility function u to inner nodes.
û is computed recursively from u during search:

 Max attempts to maximize û(s) of the terminal states reachable during play.
 Min attempts to minimize û(s).
Section 5.2. Optimal Decisions in Games 163
 The computation alternates between minimization and maximization ; hence “min-
imax”.
until we reach leaf nodes corresponding to terminal states such that one player has three in
a row or all the squares are filled. The number on each leaf node indicates the utility value
of the terminal state from the point of view of MAX; high values are assumed to be good for
Michael Kohlhase: Artificial Intelligence 1 207 2025-02-06
MAX and bad for MIN (which is how the players get their names).
For tic-tac-toe the game tree is relatively small—fewer than 9! = 362, 880 terminal
nodes. But for chess there are over 1040 nodes, so the game tree is best thought of as a
Example Tic-Tac-Toe
theoretical construct that we cannot realize in the physical world. But regardless of the size
SEARCH TREE of the game tree, it is MAX’s job to search for a good move. We use the term search tree for a
tree that is superimposed on the full game tree, and examines enough nodes to allow a player
 Example 7.2.1.
to determine whatA move
full gameto make.tree for tic-tac-toe

MAX (X)

X X X
MIN (O) X X X
X X X

XO X O X ...
MAX (X) O

X O X X O X O ...
MIN (O) X X

... ... ... ...

X O X X O X X O X ...
TERMINAL O X O O X X
O X X O X O O
Utility –1 0 +1

Figure 5.1 A (partial) game tree for the game of tic-tac-toe. The top node is the initial
 current
state,player and
and MAX action
moves marked
first, placing an X on
in anthe left.
empty square. We show part of the tree, giving
alternating moves by MIN ( O ) and MAX ( X ), until we eventually reach terminal states, which
 Last can
row: terminal positions with their utility.
be assigned utilities according to the rules of the game.

Michael Kohlhase: Artificial Intelligence 1 208 2025-02-06

5.2 O PTIMAL D ECISIONS IN G AMES

Minimax: Outline
In a normal search problem, the optimal solution would be a sequence of actions leading to
a goal state—a terminal state that is a win. In adversarial search, MIN has something to say
STRATEGY
 We max, we min, we max, we min . . .
about it. MAX therefore must find a contingent strategy, which specifies MAX’s move in
the initial state, then MAX’s moves in the states resulting from every possible response by
1. Depth first search in game tree, with Max in the root.
7.2. MINIMAX SEARCH 139

2. Apply game utility function to terminal positions.


3. Bottom-up for each inner node n in the search tree, compute the utility û(n) of
n as follows:
 If it’s Max’s turn: Set û(n) to the maximum of the utilities of n’s successor
nodes.
 If it’s Min’s turn: Set û(n) to the minimum of the utilities of n’s successor
nodes.
4. Selecting a move for Max at the root: Choose one move that leads to a successor
node with maximal utility.

Michael Kohlhase: Artificial Intelligence 1 209 2025-02-06

Minimax: Example

Max 3

Min 3 Min 2 Min 2

3 12 8 2 4 6 14 5 2

 Blue numbers: Utility function u applied to terminal positions.


 Red numbers: Utilities of inner nodes, as computed by the minimax algorithm.

Michael Kohlhase: Artificial Intelligence 1 210 2025-02-06

The Minimax Algorithm: Pseudo-Code


 Definition 7.2.2. The minimax algorithm (often just called minimax) is given by
the following functions whose argument is a state s ∈ S Max , in which Max is to
move.
function Minimax−Decision(s) returns an action
v := Max−Value(s)
return an action yielding value v in the previous function call
function Max−Value(s) returns a utility value
if Terminal−Test(s) then return u(s)
v := −∞
for each a ∈ Actions(s) do
v := max(v,Min−Value(ChildState(s,a)))
return v
function Min−Value(s) returns a utility value
140 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

if Terminal−Test(s) then return u(s)


v := +∞
for each a ∈ Actions(s) do
v := min(v,Max−Value(ChildState(s,a)))
return v

We call nodes, where Max/Min acts Max-nodes/Min-nodes.

Michael Kohlhase: Artificial Intelligence 1 211 2025-02-06

Minimax: Example, Now in Detail

Max −∞

Max −∞

Min ∞

Max −∞

Min ∞

Max −∞

Min 3

3
7.2. MINIMAX SEARCH 141

Max −∞

Min 3

3 12

Max −∞

Min 3

3 12 8

Max 3

Min 3

3 12 8

Max 3

Min 3 Min ∞

3 12 8
142 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max 3

Min 3 Min ∞

3 12 8 2

Max 3

Min 3 Min ∞

3 12 8 2

Max 3

Min 3 Min 2

3 12 8 2 4

Max 3

Min 3 Min 2

3 12 8 2 4 6
7.2. MINIMAX SEARCH 143

Max 3

Min 3 Min 2 Min ∞

3 12 8 2 4 6

Max 3

Min 3 Min 2 Min ∞

3 12 8 2 4 6 14

Max 3

Min 3 Min 2 Min 14

3 12 8 2 4 6 14

Max 3

Min 3 Min 2 Min 5

3 12 8 2 4 6 14 5
144 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max 3

Min 3 Min 2 Min 2

3 12 8 2 4 6 14 5 2

Max 3

Min 3 Min 2 Min 2

3 12 8 2 4 6 14 5 2

Max 3

Min 3 Min 2 Min 2

3 12 8 2 4 6 14 5 2

 So which action for Max is returned?


 Leftmost branch.
 Note: The maximal possible pay-off is higher for the rightmost branch, but as-
suming perfect play of Min, it’s better to go left. (Going right would be “relying on
your opponent to do something stupid”.)

Michael Kohlhase: Artificial Intelligence 1 212 2025-02-06

Minimax, Pro and Contra


 Minimax advantages:
 Minimax is the simplest possible (reasonable) search algorithm for games.
7.3. EVALUATION FUNCTIONS 145

(If any of you sat down, prior to this lecture, to implement a Tic-Tac-Toe player,
chances are you either looked this up on Wikipedia, or invented it in the process.)
 Returns an optimal action, assuming perfect opponent play.
 No matter how the opponent plays, the utility of the terminal state reached
will be at least the value computed for the root.
 If the opponent plays perfectly, exactly that value will be reached.

 There’s no need to re-run minimax for every game state: Run it once, offline
before the game starts. During the actual game, just follow the branches taken
in the tree. Whenever it’s your turn, choose an action maximizing the value of
the successor states.
 Minimax disadvantages: It’s completely infeasible in practice.

 When the search tree is too large, we need to limit the search depth and apply
an evaluation function to the cut off states.

Michael Kohlhase: Artificial Intelligence 1 213 2025-02-06

7.3 Evaluation Functions


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22064.
We now address the problem that minimax is infeasible in practice. As so often, the solution is
to eschew optimal strategies and to approximate them. In this case, instead of a computed utility
function, we estimate one that is easy to compute: the evaluation function.

Evaluation Functions for Minimax


 Problem: Search tree are too big to search through in minimax.

 Solution: We impose a search depth limit (also called horizon) d, and apply an
evaluation function to the cut-off states, i.e. states s with dp(s) = d.
 Definition 7.3.1. An evaluation function f maps game states to numbers:
 f (s) is an estimate of the actual value of s (as would be computed by unlimited-
depth minimax for s).
 If cut-off state is terminal: Just use û instead of f .
 Analogy to heuristic functions (cf. ??): We want f to be both (a) accurate and
(b) fast.

 Another analogy: (a) and (b) are in contradiction ; need to trade-off accuracy
against overhead.
 In typical game playing algorithms today, f is inaccurate but very fast.
(usually no good methods known for computing accurate f )

Michael Kohlhase: Artificial Intelligence 1 214 2025-02-06

Example Revisited: Minimax With Depth Limit d = 2


146 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max 3

Min 3 Min 2 Min 2

3 12 8 2 4 6 14 5 2

 Blue numbers: evaluation function f , applied to the cut-off states at d = 2.


 Red numbers: utilities of inner node, as computed by minimax using f .

Michael Kohlhase: Artificial Intelligence 1 215 2025-02-06

Example Chess

 Evaluation function in chess:

 Material: Pawn 1, Knight 3, Bishop 3, Rook 5,


Queen 9.
 3 points advantage ; safe win.
 Mobility: How many fields do you control?
 King safety, Pawn structure, . . .

 Note how simple this is! (probably is not how


Kasparov evaluates his positions)

Michael Kohlhase: Artificial Intelligence 1 216 2025-02-06

Linear Evaluation Functions


 Problem: How to come up with evaluation functions?

 Definition 7.3.2. A common approach is to use a weighted linear function for f ,


i.e. given a sequence of features f i : S →R and a corresponding sequence of weights
wi ∈ R, f is of the form f (s):=w1 · f 1 (s) + w2 · f 2 (s) + · · · + wn · f n (s)
 Problem: How to obtain these weighted linear functions?
 Weights wi can be learned automatically. (learning agent)
 The features f i , however, have to be designed by human experts.
 Note: Very fast, very simplistic.
a
 Observation: Can be computed incrementally: In transition s − → s′ , adapt f (s)

to f (s ) by considering only those features whose values have changed.
7.3. EVALUATION FUNCTIONS 147

Michael Kohlhase: Artificial Intelligence 1 217 2025-02-06

This assumes that the features (their contribution towards the actual value of the state) are
independent. That’s usually not the case (e.g. the value of a rook depends on the pawn struc-
ture).

The Horizon Problem


 Problem: Critical aspects of the game can be cut off by the horizon.
We call this the horizon problem.
 Example 7.3.3.

 Who’s gonna win here?

 White wins (pawn cannot be prevented from


becoming a queen.)
 Black has a +4 advantage in material, so if
we cut-off here then our evaluation function
will say “100%, black wins”.
 The loss for black is “beyond our horizon” un-
less we search extremely deeply: black can
hold off the end by repeatedly giving check to
white’s king.
Black to move

Michael Kohlhase: Artificial Intelligence 1 218 2025-02-06

So, How Deeply to Search?


 Goal: In given time, search as deeply as possible.
 Problem: Very difficult to predict search running time. (need an anytime
algorithm)

 Solution: Iterative deepening search.


 Search with depth limit d = 1, 2, 3, . . .
 When time is up: return result of deepest completed search.

 Definition 7.3.4 (Better Solution). The quiescent search algorithm uses a dy-
namically adapted search depth d: It searches more deeply in unquiet positions,
where value of evaluation function changes a lot in neighboring states.
 Example 7.3.5. In quiescent search for chess:
 piece exchange situations (“you take mine, I take yours”) are very unquiet
 ; Keep searching until the end of the piece exchange is reached.

Michael Kohlhase: Artificial Intelligence 1 219 2025-02-06


148 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

7.4 Alpha-Beta Search


We have seen that evaluation functions can overcome the combinatorial explosion induced by
minimax search. But we can do even better: certain parts of the minimax search tree can be safely
ignored, since we can prove that they will only sub-optimal results. We discuss the technique of
alphabeta-pruning in detail as an example of such pruning methods in search algorithms.

When We Already Know We Can Do Better Than This

Max (A)  Say n > m.

 By choosing to go to the left in search


node (A), Max already can get utility
Min of at least n in this part of the game.
value: n
 So, if “later on” (further down in the
same subtree), in search node (B) we
Min (B) already know that Min can force Max
to get value m < n.
 Then Max will play differently in (A)
so we will never actually get to (B).
Max
value: m

Michael Kohlhase: Artificial Intelligence 1 220 2025-02-06

Alpha Pruning: Basic Idea


 Question: Can we save some work here?

Max 3

Min 3 Min 2 Min 2

3 12 8 2 4 6 14 5 2

Michael Kohlhase: Artificial Intelligence 1 221 2025-02-06


7.4. ALPHA-BETA SEARCH 149

Alpha Pruning: Basic Idea (Continued)


 Answer: Yes! We already know at this point that the middle action won’t be
taken by Max.

Max ≥3

Min 3 Min ≤2 Min

3 12 8 2

 Idea: We can use this to prune the search tree ; better algorithm

Michael Kohlhase: Artificial Intelligence 1 222 2025-02-06

Alpha Pruning
 Definition 7.4.1. For each node n in a minimax search tree, the alpha value α(n)
is the highest Max-node utility that search has encountered on its path from the
root to n.
 Example 7.4.2 (Computing alpha values).

Max −∞; α = −∞

Max −∞; α = −∞

Min ∞; α = −∞

Max −∞; α = −∞

Min ∞; α = −∞

3
150 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max −∞; α = −∞

Min 3; α = −∞

Max −∞; α = −∞

Min 3; α = −∞

3 12

Max −∞; α = −∞

Min 3; α = −∞

3 12 8

Max 3; α = 3

Min 3; α = −∞

3 12 8
7.4. ALPHA-BETA SEARCH 151

Max 3; α = 3

Min 3; α = −∞ Min ∞; α = 3

3 12 8

Max 3; α = 3

Min 3; α = −∞ Min ∞; α = 3

3 12 8 2

Max 3; α = 3

Min 3; α = −∞ Min 2; α = 3

3 12 8 2

Max 3; α = 3

Min 3; α = −∞ Min 2; α = 3 Min

3 12 8 2

 How to use α?: In a Min-node n, if û(n′ ) ≤ α(n) for one of the successors, then
stop considering n. (pruning out its remaining successors)
152 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Michael Kohlhase: Artificial Intelligence 1 223 2025-02-06

Alpha-Beta Pruning
 Recall:
 What is α: For each search node n, the highest Max-node utility that search
has encountered on its path from the root to n.
 How to use α: In a Min-node n, if one of the successors already has utility
≤ α(n), then stop considering n. (Pruning out its remaining successors)

 Idea: We can use a dual method for Min!


 Definition 7.4.3. For each node n in a minimax search tree, the beta value β(n) is
the highest Min-node utility that search has encountered on its path from the root
to n.

 How to use β: In a Max-node n, if one of the successors already has utility


≥ β(n), then stop considering n. (pruning out its remaining successors)
 . . . and of course we can use α and β together! ; alphabeta-pruning

Michael Kohlhase: Artificial Intelligence 1 224 2025-02-06

Alpha-Beta Search: Pseudocode


 Definition 7.4.4. The alphabeta search algorithm is given by the following pseu-
docode
function Alpha−Beta−Search (s) returns an action
v := Max−Value(s, −∞, +∞)
return an action yielding value v in the previous function call

function Max−Value(s, α, β) returns a utility value


if Terminal−Test(s) then return u(s)
v:= −∞
for each a ∈ Actions(s) do
v := max(v,Min−Value(ChildState(s,a), α, β))
α := max(α, v)
if v ≥ β then return v /∗ Here: v ≥ β ⇔ α ≥ β ∗/
return v

function Min−Value(s, α, β) returns a utility value


if Terminal−Test(s) then return u(s)
v := +∞
for each a ∈ Actions(s) do
v := min(v,Max−Value(ChildState(s,a), α, β))
β := min(β, v)
if v ≤ α then return v /∗ Here: v ≤ α ⇔ α ≥ β ∗/
return v

b Minimax (slide 211) + α/β book-keeping and pruning.


=

Michael Kohlhase: Artificial Intelligence 1 225 2025-02-06

Note: Note that α only gets assigned a value in Max-nodes, and β only gets assigned a value in
Min-nodes.
7.4. ALPHA-BETA SEARCH 153

Alpha-Beta Search: Example


 Notation: v; [α, β]

Max −∞; [−∞, ∞]

Max −∞; [−∞, ∞]

Min ∞; [−∞, ∞]

Max −∞; [−∞, ∞]

Min ∞; [−∞, ∞]

Max −∞; [−∞, ∞]

Min 3; [−∞, 3]

Max −∞; [−∞, ∞]

Min 3; [−∞, 3]

3 12
154 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max −∞; [−∞, ∞]

Min 3; [−∞, 3]

3 12 8

Max 3; [3, ∞]

Min 3; [−∞, 3]

3 12 8

Max 3; [3, ∞]

Min 3; [−∞, 3] Min ∞; [3, ∞]

3 12 8

Max 3; [3, ∞]

Min 3; [−∞, 3] Min ∞; [3, ∞]

3 12 8 2
7.4. ALPHA-BETA SEARCH 155

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2]

3 12 8 2

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min ∞; [3, ∞]

3 12 8 2

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min ∞; [3, ∞]

3 12 8 2 14

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 14; [3, 14]

3 12 8 2 14
156 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 14; [3, 14]

3 12 8 2 14 5

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

3 12 8 2 14 5

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

3 12 8 2 14 5 2

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 2; [3, 2]

3 12 8 2 14 5 2

 Note: We could have saved work by choosing the opposite order for the successors
7.4. ALPHA-BETA SEARCH 157

of the rightmost Min-node.


Choosing the best moves (for each of Max and Min) first yields more pruning!

Michael Kohlhase: Artificial Intelligence 1 226 2025-02-06

Alpha-Beta Search: Modified Example


 Showing off some actual β pruning:

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min ∞; [3, ∞]

3 12 8 2

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min ∞; [3, ∞]

3 12 8 2 5

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

3 12 8 2 5
158 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

Max −∞; [3, 5]


3 12 8 2 5

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

Max −∞; [3, 5]


3 12 8 2 5

14

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

Max 14; [14, 5]


3 12 8 2 5

14
7.4. ALPHA-BETA SEARCH 159

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 5; [3, 5]

Max 14; [14, 5]


3 12 8 2 5 2

14

Max 3; [3, ∞]

Min 3; [−∞, 3] Min 2; [3, 2] Min 2; [3, 2]

Max 14; [14, 5]


3 12 8 2 5 2

14

Michael Kohlhase: Artificial Intelligence 1 227 2025-02-06

How Much Pruning Do We Get?


 Choosing the best moves first yields most pruning in alphabeta search.
 The maximizing moves for Max, the minimizing moves for Min.

 Observation: Assuming game tree with branching factor b and depth limit d:
 Minimax would have to search bd nodes.
 Best case: If we always choose the best moves first, then the search tree is
d
reduced to b 2 nodes!
 Practice: It is often possible to get very close to the best case by simple move-
ordering methods.
 Example 7.4.5 (Chess).
160 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

 Move ordering: Try captures first, then threats, then forward moves, then back-
ward moves.
d
 From 35d to 35 2 . E.g., if we have the time to search a billion (109 ) nodes, then
minimax looks ahead d = 6 moves, i.e., 3 rounds (white-black) of the game.
Alpha-beta search looks ahead 6 rounds.

Michael Kohlhase: Artificial Intelligence 1 228 2025-02-06

7.5 Monte-Carlo Tree Search (MCTS)


Video Nuggets covering this section can be found at https://fau.tv/clip/id/22259 and
https://fau.tv/clip/id/22262.
We will now come to the most visible game-play program in recent times: The AlphaGo system
for the game of go. This has been out of reach of the state of the art (and thus for alphabeta
search) until 2016. This challenge was cracked by a different technique, which we will discuss in
this section.

And now . . .
 AlphaGo = Monte Carlo tree search (AI-1) + neural networks (AI-2)

CC-BY-SA: Buster Benson@ https://www.flickr.com/photos/erikbenson/25717574115

Michael Kohlhase: Artificial Intelligence 1 229 2025-02-06

Monte-Carlo Tree Search: Basic Ideas


 Observation: We do not always have good evaluation functions.
 Definition 7.5.1. For Monte Carlo sampling we evaluate actions through sampling.
 When deciding which action to take on game state s:
while time not up do
select action a applicable to s
run a random sample from a until terminal state t
return an a for s with maximal average u(t)

 Definition 7.5.2. For the Monte Carlo tree search algorithm (MCTS) we maintain
a search tree T , the MCTS tree.
7.5. MONTE-CARLO TREE SEARCH (MCTS) 161

while time not up do


apply actions within T to select a leaf state s′
select action a′ applicable to s′ , run random sample from a′
add s′ to T , update averages etc.
return an a for s with maximal average u(t)
When executing a, keep the part of T below a.

 Compared to alphabeta search: no exhaustive enumeration.


 Pro: running time & memory.
 Contra: need good guidance how to select and sample.

Michael Kohlhase: Artificial Intelligence 1 230 2025-02-06

This looks only at a fraction of the search tree, so it is crucial to have good guidance where to go,
i.e. which part of the search tree to look at.

Monte-Carlo Sampling: Illustration of Sampling


 Idea: Sample the search tree keeping track of the average utilities.
 Example 7.5.3 (Single-player, for simplicity). (with adversary, distinguish
max/min nodes)

Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Ex-
pansions: 1, 1, 0
avg. reward: 70, 10, 0 Ex-
pansions: 1, 1, 1
avg. reward: 70, 10, 40 Ex-
pansions: 1, 1, 2
avg. reward: 70, 10, 35 Ex- Expansions: 0, 0
pansions: 2, 1, 2 avg. reward: 0, 0
avg. reward: 60, 10, 35 Ex-
pansions: 2, 2, 2
avg. reward: 60, 55, 35 Ex-
pansions: 2, 2, 2
avg. reward: 60, 55, 35 40

70 50 30

100 10

Michael Kohlhase: Artificial Intelligence 1 231 2025-02-06

The sampling goes middle, left, right, right, left, middle. Then it stops and selects the highest-
average action, 60, left. After first sample, when values in initial state are being updated, we
have the following “expansions” and “avg. reward fields”: small number of expansions favored for
exploration: visit parts of the tree rarely visited before, what is out there? avg. reward: high
values favored for exploitation: focus on promising parts of the search tree.
162 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

Monte-Carlo Tree Search: Building the Tree


 Idea: We can save work by building the tree as we go along.
 Example 7.5.4 (Redoing the previous example).

Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Expan-
sions: 1, 1, 0
avg. reward: 70, 10, 0 Ex-
Expansions: 1, 0 Expansions: 1 pansions: 1, 1, 1
avg. reward: 70, 0 Ex- avg. reward: 10 avg. reward: 70, 10, 40 Ex-
pansions: 2, 0 Expansions: 2 pansions: 1, 1, 2
avg. reward: 60, 0 avg. reward: 55 avg. reward: 70, 10, 35 Ex-
Expansions: 1, 0
Expansions: 1 pansions: 2, 1, 2 avg. reward: 40, 0 Ex-
avg. reward: 100 avg. reward: 60, 10, 35 Ex- 2, 0
pansions:
pansions: 2, 2, 2 avg. reward: 35, 0
avg. reward: 60, 55, 35 Ex-
Expansions: 0, 1 pansions: 2, 2, 2 Expansions: 0, 1
avg. reward: 0, 50 avg. reward: 60, 55, 35 avg. reward: 0, 30

40

70 50 30

100 10

Michael Kohlhase: Artificial Intelligence 1 232 2025-02-06

This is the exact same search as on previous slide, but incrementally building the search tree, by
always keeping the first state of the sample. The first three iterations middle, left, right, go to
show the tree extension; do point out here that, like the root node, the nodes added to the tree
have expansions and avg reward counters for every applicable action. Then in next iteration right,
after 30 leaf node was found, an important thing is that the averages get updated *along the entire
path*, i.e., not only in the root as we did before, but also in the nodes along the way. After all
six iterations have been done, as before we select the action left, value 60; but we keep the part
of the tree below that action, “saving relevant work already done before”.

How to Guide the Search in MCTS?


 How to sample?: What exactly is “random”?
 Classical formulation: balance exploitation vs. exploration.

 Exploitation: Prefer moves that have high average already (interesting regions
of state space)
 Exploration: Prefer moves that have not been tried a lot yet (don’t overlook
other, possibly better, options)
 UCT: “Upper Confidence bounds applied to Trees” [KS06].
7.5. MONTE-CARLO TREE SEARCH (MCTS) 163

 Inspired by Multi-Armed Bandit (as in: Casino) problems.


 Basically a formula defining the balance. Very popular (buzzword).
 Recent critics (e.g. [FD14]): Exploitation in search is very different from the
Casino, as the “accumulated rewards” are fictitious (we’re only thinking about
the game, not actually playing and winning/losing all the time).

Michael Kohlhase: Artificial Intelligence 1 233 2025-02-06

AlphaGo: Overview
 Definition 7.5.5 (Neural Networks in AlphaGo).

 Policy networks: Given a state s, output a probability distribution over the


actions applicable in s.
 Value networks: Given a state s, output a number estimating the game value
of s.

 Combination with MCTS:


 Policy networks bias the action choices within the MCTS tree (and hence the
leaf state selection), and bias the random samples.
 Value networks are an additional source of state values in the MCTS tree, along
with the random samples.

 And now in a little more detail

Michael Kohlhase: Artificial Intelligence 1 234 2025-02-06

Neural Networks in AlphaGo


 Neural network training pipeline and architecture: ARTICLE RESEARCH

a b
Rollout policy SL policy network RL policy network Value network Policy network Value network
Neural network

pS pV pU QT pVU (a⎪s) QT (s′)

Policy gradient
n
Cla

tio

Se

n
ca

ssio
ssifi

lf P
ssifi

lay

gre
ca

Cla
tio

Re
n

Data

s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
Illustration taken from [Sil+16] .
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions Rollout policy p : Simple but fast, ≈ prior work on Go.
π set is generated by playing
of the policy network. A new data probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′)
SL policy network p : Supervised learning, human-expert data (“learn to choose
is trained by regression to predict the expectedσoutcome (that is, whether that predicts the expected outcome in position s′.
an expert action”).
sampled state-action pairs (s, a), using stochastic gradient ascent to and its weights ρ are initialized to the same values, ρ = σ. We play
maximize
 RL policyof network
the likelihood the human move pρ :a selected in state s
Reinforcement games betweenself-play
learning, the current policy topρwin”).
network
(“learn and a randomly selected
previous iteration of the policy network. Randomizing from a pool
∂log pσ (a | s ) of opponents in this way stabilizes training by preventing overfitting
∆σ ∝
∂σ to the current policy. We use a reward function r(s) that is zero for all
non-terminal time steps t < T. The outcome zt = ± r(sT) is the termi-
We trained a 13-layer policy network, which we call the SL policy nal reward at the end of the game from the perspective of the current
network, from 30 million positions from the KGS Go Server. The net- player at time step t: +1 for winning and −1 for losing. Weights are
work predicted expert moves on a held out test set with an accuracy of then updated at each time step t by stochastic gradient ascent in the
57.0% using all input features, and 55.7% using only raw board posi- direction that maximizes expected outcome25
164 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

 Value network vθ : Use self-play games with pρ as training data for game-position
evaluation vθ (“predict which player will win in this state”).

Michael Kohlhase: Artificial Intelligence 1 235 2025-02-06

Comments on the Figure:

a A fast rollout policy pπ and supervised learning (SL) policy network pσ are trained to predict
human expert moves in a data set of positions. A reinforcement learning (RL) policy network
pρ is initialized to the SL policy network, and is then improved by policy gradient learning to
maximize the outcome (that is, winning more games) against previous versions of the policy
network. A new data set is generated by playing games of self-play with the RL policy network.
Finally, a value network vθ is trained by regression to predict the expected outcome (that is,
whether the current player wins) in positions from the self-play data set.

b Schematic representation of the neural network architecture used in AlphaGo. The policy
network takes a representation of the board position s as its input, passes it through many con-
volutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a
probability distribution pσ (a|s) or pρ (a|s) over legal moves a, represented by a probability map
over the board. The value network similarly uses many convolutional layers with parameters θ,
but outputs a scalar value vθ (s′ ) that predicts the expected outcome in position s′ .

Neural Networks + MCTS in AlphaGo


 Monte
RESEARCH Carlo tree search in AlphaGo:
ARTICLE

a Selection b Expansion c Evaluation d Backup

QT
P P Q Q
Q + u(P) max Q + u(P)
QT QT
Q Q
P P
Q + u(P) max Q + u(P)
pV QT QT QT

P P
pS

r r r r

Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation is evaluated in two ways: using the value network vθ; and by running
traverses the tree by selecting the edge with maximum action value Q, a rollout to the end of the game with the fast rollout policy pπ, then
Illustration taken from [Sil+16]
plus a bonus u(P) that depends on a stored prior probability P for that
edge. b, The leaf node may be expanded; the new node is processed once
computing the winner with function r. d, Action values Q are updated to
track the mean value of all evaluations r(·) and vθ(·) in the subtree below
by the policy network pσ and the output probabilities are stored as prior that action.
probabilities Rollout policy p : Action choice in random samples.
 P for each action. c, At the end
π of a simulation, the leaf node
learning ofconvolutional
SL policynetworks,
network won 11% pσ of Action
: games against Pachi23 bias
choice (s, a)within the
of the search treeUCTS tree value
stores an action (stored
Q(s, a),as visit“P ”, N(s, a),
count
and 12% against a slightly weaker program, Fuego24. and prior probability P(s, a). The tree is traversed by simulation (that
gets smaller to “u(P )” with number of is,visits); descendingalong
the treewith quality
in complete games Q.without backup), starting
Reinforcement learning of value networks from the root state. At each time step t of each simulation, an action at
 RL
The final stage policy
of the trainingnetwork pρ :onNot
pipeline focuses used
position here (used
evaluation, only
is selected fromtostatelearn
st vθ ).
p
estimating a value function v (s) that predicts the outcome from posi-
 Value
tion s of games played bynetwork
using policy vθp :forUsed to evaluate leaf states s, in
both players28–30
a t =linear
argmax(Q sum
(s t , a )with
+ u(s t , athe
)) value
returned by a random~sample
v p(s ) = E[z |s = s, a p]
on s. a
t t t…T
so as to maximize action value plus a bonus
Ideally, we would like to know the optimal value function under
perfect play v*(s); in practice, we instead estimate the value function P(s, a )
Michael Kohlhase: Artificial Intelligence 1 236 u(s, a ) ∝ 2025-02-06
v pρ for our strongest policy, using the RL policy network pρ. We approx- 1 + N (s, a )
imate the value function using a value network vθ(s) with weights θ,
vθ(s ) ≈ v pρ(s ) ≈ v ⁎(s ) . This neural network has a similar architecture that is proportional to the prior probability but decays with
Comments
to the policyon thebutFigure:
network, outputs a single prediction instead of a prob- repeated visits to encourage exploration. When the traversal reaches a
ability distribution. We train the weights of the value network by regres- leaf node sL at step L, the leaf node may be expanded. The leaf position
a Eachsion on state-outcome
simulation pairs (s, z), using
traverses thestochastic
tree by gradient descent to the
selecting sL isedge
processed with maximum
just once by the SL policyaction
network pvalue Q, plus
σ. The output prob- a
minimize the mean squared error (MSE) between the predicted value abilities are stored as prior probabilities P for each legal action a,
bonus
vθ(s),u(P ) that
and the depends
corresponding outcome onz a stored prior probability P(s, a ) = pP for
σ (a|s )
that
. The edge.
leaf node is evaluated in two very different ways:
first, by the value network vθ(sL); and second, by the outcome zL of a
∂vθ(s ) random rollout played out until terminal step T using the fast rollout
∆θ ∝ (z − vθ(s ))
∂θ policy pπ; these evaluations are combined, using a mixing parameter
λ, into a leaf evaluation V(sL)
The naive approach of predicting game outcomes from data con-
sisting of complete games leads to overfitting. The problem is that
7.6. STATE OF THE ART 165

b The leaf node may be expanded; the new node is processed once by the policy network pσ and
the output probabilities are stored as prior probabilities P for each action.
c At the end of a simulation, the leaf node is evaluated in two ways:

• using the value network vθ ,


• and by running a rollout to the end of the game
with the fast rollout policy p π, then computing the winner with function r.
d Action values Q are updated to track the mean value of all evaluations r(·) and vθ (·) in the
subtree below that action.
AlphaGo, Conclusion?: This is definitely a great achievement!
• “Search + neural networks” looks like a great formula for general problem solving.
• expect to see lots of research on this in the coming decade(s).

• The AlphaGo design is quite intricate (architecture, learning workflow, training data design,
neural network architectures, . . . ).
• How much of this is reusable in/generalizes to other problems?
• Still lots of human expertise in here. Not as much, like in chess, about the game itself. But
rather, in the design of the neural networks + learning architecture.

7.6 State of the Art


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22250.

State of the Art


 Some well-known board games:
 Chess: Up next.
 Othello (Reversi): In 1997, “Logistello” beat the human world champion. Best
computer players now are clearly better than best human players.
 Checkers (Dame): Since 1994, “Chinook” is the offical world champion. In
2007, it was shown to be unbeatable: Checkers is solved. (We know the exact
value of, and optimal strategy for, the initial state.)
 Go: In 2016, AlphaGo beat the Grandmaster Lee Sedol, cracking the “holy grail”
of board games. In 2017, “AlphaZero” – a variant of AlphaGo with zero prior
knowledge beat all reigning champion systems in all board games (including
AlphaGo) 100/0 after 24h of self-play.
 Intuition: Board Games are considered a “solved problem” from the AI per-
spective.

Michael Kohlhase: Artificial Intelligence 1 237 2025-02-06

Computer Chess: “Deep Blue” beat Garry Kasparov in 1997


166 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING

 6 games, final score 3.5 : 2.5.

 Specialized chess hardware, 30 nodes with


16 processors each.
 Alphabeta search plus human knowledge.
(more details in a moment)

 Nowadays, standard PC hardware plays at


world champion level.

Michael Kohlhase: Artificial Intelligence 1 238 2025-02-06

Computer Chess: Famous Quotes

 The chess machine is an ideal one to start with, since (Claude Shannon (1949))
1. the problem is sharply defined both in allowed operations (the moves) and in the
ultimate goal (checkmate),
2. it is neither so simple as to be trivial nor too difficult for satisfactory solution,
3. chess is generally considered to require “thinking” for skilful play, [. . . ]
4. the discrete structure of chess fits well into the digital nature of modern comput-
ers.
 Chess is the drosophila of Artificial Intelligence. (Alexander Kronrod (1965))

Michael Kohlhase: Artificial Intelligence 1 239 2025-02-06

Computer Chess: Another Famous Quote


 In 1965, the Russian mathematician Alexander Kronrod said, “Chess is the Drosophila
of artificial intelligence.”
However, computer chess has developed much as genetics might have if the geneti-
cists had concentrated their efforts starting in 1910 on breeding racing Drosophilae.
We would have some science, but mainly we would have very fast fruit flies. (John
McCarthy (1997))

Michael Kohlhase: Artificial Intelligence 1 240 2025-02-06

7.7 Conclusion
Summary
 Games (2-player turn-taking zero-sum discrete and finite games) can be understood
as a simple extension of classical search problems.
 Each player tries to reach a terminal state with the best possible utility (maximal
vs. minimal).
7.7. CONCLUSION 167

 Minimax searches the game depth-first, max’ing and min’ing at the respective turns
of each player. It yields perfect play, but takes time O(bd ) where b is the branching
factor and d the search depth.
 Except in trivial games (Tic-Tac-Toe), minimax needs a depth limit and apply an
evaluation function to estimate the value of the cut-off states.
 Alpha-beta search remembers the best values achieved for each player elsewhere in
the tree already, and prunes out sub-trees that won’t be reached in the game.
 Monte Carlo tree search (MCTS) samples game branches, and averages the findings.
AlphaGo controls this using neural networks: evaluation function (“value network”),
and action filter (“policy network”).

Michael Kohlhase: Artificial Intelligence 1 241 2025-02-06

Suggested Reading:
• Chapter 5: Adversarial Search, Sections 5.1 – 5.4 [RN09].
– Section 5.1 corresponds to my “Introduction”, Section 5.2 corresponds to my “Minimax Search”,
Section 5.3 corresponds to my “Alpha-Beta Search”. I have tried to add some additional clarify-
ing illustrations. RN gives many complementary explanations, nice as additional background
reading.
– Section 5.4 corresponds to my “Evaluation Functions”, but discusses additional aspects re-
lating to narrowing the search and look-up from opening/termination databases. Nice as
additional background reading.
– I suppose a discussion of MCTS and AlphaGo will be added to the next edition . . .
168 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Chapter 8

Constraint Satisfaction Problems

In the last chapters we have studied methods for “general problem”, i.e. such that are applicable to
all problems that are expressible in terms of states and “actions”. It is crucial to realize that these
states were atomic, which makes the algorithms employed (search algorithms) relatively simple
and generic, but does not let them exploit the any knowledge we might have about the internal
structure of states.
In this chapter, we will look into algorithms that do just that by progressing to factored states
representations. We will see that this allows for algorithms that are many orders of magnitude
more efficient than search algorithms.
To give an intuition for factored states representations we, we present some motivational examples
in ?? and go into detail of the Waltz algorithm, which gave rise to the main ideas of constraint
satisfaction algorithms in ??. ?? and ?? define constraint satisfaction problems formally and use
that to develop a class of backtracking/search based algorithms. The main contribution of the
factored states representations is that we can formulate advanced search heuristics that guide
search based on the structure of the states.

8.1 Constraint Satisfaction Problems: Motivation


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22251.

A (Constraint Satisfaction) Problem


 Example 8.1.1 (Tournament Schedule). Who’s going to play against who, when
and where?

169
170 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

Michael Kohlhase: Artificial Intelligence 1 242 2025-02-06

Constraint Satisfaction Problems (CSPs)


 Standard search problem: state is a “black box” any old data structure that supports
goal test, eval, successor state, . . .

 Definition 8.1.2. A constraint satisfaction problem (CSP) is a triple ⟨V , D, C ⟩


where
1. V is a finite set V of variables,
2. an V -indexed family (Dv )v∈V of domains, and
3. for some subsets {v 1 , . . ., v k } ⊆ V a constraint C {v1 ,...,vk } ⊂Dv1 × . . . × Dvk .
A variable assignment φ ∈ (v∈V ) →Dv is a solution for C, iff ⟨φ(v 1 ), . . ., φ(v k )⟩ ∈
C {v1 ,...,vk } for all {v 1 , . . ., v k } ⊆ V .
 Definition 8.1.3. A CSP γ is called satisfiable, iff it has a solution: a total variable
assignment φ that satisfies all constraints.

 Definition 8.1.4. The process of finding solutions to CSPs is called constraint


solving.
 Remark 8.1.5. We are using factored representation for world states now!

 Allows useful general-purpose algorithms with more power than standard tree
search algorithm.

Michael Kohlhase: Artificial Intelligence 1 243 2025-02-06

Another Constraint Satisfaction Problem


8.1. CONSTRAINT SATISFACTION PROBLEMS: MOTIVATION 171

 Example 8.1.6 (SuDoKu). Fill the cells with row/column/block-unique digits

 Variables: The 81 cells.


 Domains: Numbers 1, . . . , 9.
 Constraints: Each number only once in each row, column, block.

Michael Kohlhase: Artificial Intelligence 1 244 2025-02-06

CSP Example: Map-Coloring


 Definition 8.1.7. Given a map M , the map coloring problem is to assign colors to
regions in a map so that no adjoining regions have the same color.
204 Chapter 6. Constraint Satisfaction Problems
 Example 8.1.8 (Map coloring in Australia).
NT
 Variables: WA, NT,Q Q, NSW, V, SA,
Northern WAT
Territory
Queensland  Domains: Di = {red, green, blue}
Western
Australia
SA NSW
South
Australia New
 Constraints: adjacent regions must
South
Wales
have different colorsV e.g., WA ̸= NT (if
Victoria the language allows this), or ⟨WA, NT⟩ ∈
Tasmania
{⟨red, green⟩, ⟨red, Tblue⟩, ⟨green, red⟩, . . . }
(a) (b)

Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each

represented as a constraint graph.


Intuition: solutions map variables
region so that no neighboring regions have the same color. (b) The map-coloring problem

to domain values satisfying all con-
immediately discard further refinements of the partial assignment. Furthermore, we can see
straints,
why the assignment is not a solution—we see which variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly e.g.,formulated
 when {WA as=a CSP.red, NT = green, . . .}
6.1.2 Example problem: Job-shop scheduling
Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
In practice, many of these problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
task as a variable, where the value of each variable is the time that the task starts, expressed
Michael Kohlhase: Artificial Intelligence 1 245 2025-02-06
as an integer number of minutes. Constraints can assert that one task must occur before
another—for example, a wheel must be installed before the hubcap is put on—and that only
so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
Bundesliga ConstraintsWe consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:

 Variables: vAvs.B where A and B are teams, with domains {1, . . . ,34}:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
For each
match, the The index
value of thevariable
of each weekend where
is the time that theittask
is starts.
scheduled.
Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
172 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

 (Some) constraints:
 If {A, B} ∩ {C, D} =
̸ ∅: vAvs.B ̸=
vCvs.D (each team only one match
per day).

 If {A, B} = {C, D}: vAvs.B ≤ 17 <


vCvs.D or vCvs.D ≤ 17 < vAvs.B
(each pairing exactly once in each
half-season).

 If A = C: vAvs.B + 1 ̸= vCvs.D
(each team alternates between home
matches and away matches).
 Leading teams of last season meet
near the end of each half-season.

 ...

Michael Kohlhase: Artificial Intelligence 1 246 2025-02-06

How to Solve the Bundesliga Constraints?


 306 nested for-loops (for each of the 306 matches), each ranging from 1 to 306.
Within the innermost loop, test whether the current values are (a) a permutation
and, if so, (b) a legal Bundesliga schedule.

 Estimated running time: End of this universe, and the next couple billion ones
after it . . .
 Directly enumerate all permutations of the numbers 1, . . . , 306, test for each whether
it’s a legal Bundesliga schedule.
 Estimated running time: Maybe only the time span of a few thousand uni-
verses.
 View this as variables/constraints and use backtracking (this chapter)
 Executed running time: About 1 minute.
 How do they actually do it?: Modern computers and CSP methods: fractions
of a second. 19th (20th/21st?) century: Combinatorics and manual work.
 Try it yourself: with an off-the shelf CSP solver, e.g. Minion [Min]

Michael Kohlhase: Artificial Intelligence 1 247 2025-02-06

More Constraint Satisfaction Problems


8.1. CONSTRAINT SATISFACTION PROBLEMS: MOTIVATION 173

Traveling Tournament Problem Scheduling

Timetabling Radio Frequency Assignment

Michael Kohlhase: Artificial Intelligence 1 248 2025-02-06

1. U.S. Major League Baseball, 30 teams, each 162 games. There’s one crucial additional difficulty,
in comparison to Bundesliga. Which one? Travel is a major issue here!! Hence “Traveling
Tournament Problem” in reference to the TSP.
2. This particular scheduling problem is called “car sequencing”, how to most efficiently get cars
through the available machines when making the final customer configuration (non-standard/flexible/custom
extras).

3. Another common form of scheduling . . .


4. The problem of assigning radio frequencies so that all can operate together without noticeable
interference. Variable domains are available frequencies, constraints take form of |x − y| > δxy ,
where delta depends on the position of x and y as well as the physical environment.

Our Agenda for This Topic


 Our treatment of the topic “Constraint Satisfaction Problems” consists of Chap-
ters 7 and 8. in [RN03]
 This Chapter: Basic definitions and concepts; naïve backtracking search.
 Sets up the framework. Backtracking underlies many successful algorithms for
solving constraint satisfaction problems (and, naturally, we start with the sim-
plest version thereof).
 Next Chapter: Constraint propagation and decomposition methods.
 Constraint propagation reduces the search space of backtracking. Decomposi-
tion methods break the problem into smaller pieces. Both are crucial for efficiency
in practice.

Michael Kohlhase: Artificial Intelligence 1 249 2025-02-06


174 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

Our Agenda for This Chapter


 How are constraint networks, and assignments, consistency, solutions: How are
constraint satisfaction problems defined? What is a solution?
 Get ourselves on firm ground.
 Naïve Backtracking: How does backtracking work? What are its main weak-
nesses?
 Serves to understand the basic workings of this wide-spread algorithm, and to
motivate its enhancements.
 Variable- and Value Ordering: How should we guide backtracking searchs?

 Simple methods for making backtracking aware of the structure of the problem,
and thereby reduce search.

Michael Kohlhase: Artificial Intelligence 1 250 2025-02-06

8.2 The Waltz Algorithm


We will now have a detailed look at the problem (and innovative solution) that started the
field of constraint satisfaction problems.
Background:
Adolfo Guzman worked on an algorithm to count the number of simple objects (like children’s
blocks) in a line drawing. David Huffman formalized the problem and limited it to objects in
general position, such that the vertices are always adjacent to three faces and each vertex is
formed from three planes at right angles (trihedral). Furthermore, the drawings could only have
three kinds of lines: object boundary, concave, and convex. Huffman enumerated all possible
configurations of lines around a vertex. This problem was too narrow for real-world situations, so
Waltz generalized it to include cracks, shadows, non-trihedral vertices and light. This resulted in
over 50 different line labels and thousands of different junctions. [ILD]

The Waltz Algorithm


 Remark: One of the earliest examples of applied CSPs.
 Motivation: Interpret line drawings of polyhedra.

 Problem: Are intersections convex or concave? (interpret =


b label as such)

 Idea: Adjacent intersections impose constraints on each other. Use CSP to find a
unique set of labelings.
8.2. THE WALTZ ALGORITHM 175

Michael Kohlhase: Artificial Intelligence 1 251 2025-02-06

Waltz Algorithm on Simple Scenes


 Assumptions: All objects
 have no shadows or cracks,
 have only three-faced vertices,
 are in “general position”, i.e. no junctions change with small movements of the
eye.

 Observation 8.2.1. Then each line on the images is one of the following:
 a boundary line (edge of an object) (<) with right hand of arrow denoting “solid”
and left hand denoting “space”
 an interior convex edge (label with “+”)
 an interior concave edge (label with “-”)

Michael Kohlhase: Artificial Intelligence 1 252 2025-02-06

18 Legal Kinds of Junctions


 Observation 8.2.2. There are only 18 “legal” kinds of junctions:

 Idea: given a representation of a diagram


 label each junction in one of these manners (lots of possible ways)
176 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

 junctions must be labeled, so that lines are labeled consistently


 Fun Fact: CSP always works perfectly! (early success story for CSP [Wal75])

Michael Kohlhase: Artificial Intelligence 1 253 2025-02-06

Waltz’s Examples
 In his dissertation 1972 [Wal75] David Waltz used the following examples

Michael Kohlhase: Artificial Intelligence 1 254 2025-02-06

Waltz Algorithm (More Examples): Ambiguous Figures

Michael Kohlhase: Artificial Intelligence 1 255 2025-02-06


8.3. CSP: TOWARDS A FORMAL DEFINITION 177

Waltz Algorithm (More Examples): Impossible Figures

Michael Kohlhase: Artificial Intelligence 1 256 2025-02-06

8.3 CSP: Towards a Formal Definition


We will now work our way towards a definition of CSPs that is formal enough so that we can
define the concept of a solution. This gives use the necessary grounding to talk about algorithms
later. A Video Nugget covering this section can be found at https://fau.tv/clip/id/22277.

Types of CSPs
 Definition 8.3.1. We call a CSP discrete, iff all of the variables have countable
domains; we have two kinds:
 finite domains (size d ; O(dn ) solutions)
 e.g., Boolean CSPs (solvability =
b Boolean satisfiability ; NP complete)
 infinite domains (e.g. integers, strings, etc.)
 e.g., job scheduling, variables are start/end days for each job
 need a “constraint language”, e.g., StartJob1 + 5 ≤ StartJob3
 linear constraints decidable, nonlinear ones undecidable

 Definition 8.3.2. We call a CSP continuous, iff one domain is uncountable.


 Example 8.3.3. Start/end times for Hubble Telescope observations form a contin-
uous CSP.
 Theorem 8.3.4. Linear constraints solvable in poly time by linear programming
methods.
178 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

 Theorem 8.3.5. There cannot be optimal algorithms for nonlinear constraint


systems.

Michael Kohlhase: Artificial Intelligence 1 257 2025-02-06

Types of Constraints
 We classify the constraints by the number of variables they involve.

 Definition 8.3.6. Unary constraints involve a single variable, e.g., SA ̸= green.


 Definition 8.3.7. Binary constraints involve pairs of variables, e.g., SA ̸= WA.
 Definition 8.3.8. Higher-order constraints involve n = 3 or more variables, e.g.,
cryptarithmetic column constraints.
The number n of variables is called the order of the constraint.
 Definition 8.3.9. Preferences (soft constraint) (e.g., red is better than green)
are often representable by a cost for each variable assignment ; constrained opti-
mization problems.

Michael Kohlhase: Artificial Intelligence 1 258 2025-02-06

Non-Binary Constraints, e.g. “Send More Money”


 Example 8.3.10 (Send More Money). A student writes home:

S E N D Puzzle: letters stand for digits, addition should


+ M O R E work out (parents send MONEY€)
M O N E Y

 Variables: S, E, N, D, M, O, R, Y , each with domain {0, . . . ,9}.


 Constraints:
1. all variables should have different values: S ̸= E, S ̸= N , . . .
2. first digits are non-zero: S ̸= 0, M ̸= 0.
3. the addition scheme should work out: i.e.
1000 · S + 100 · E + 10 · N + D + 1000 · M + 100 · O + 10 · R + E = 10000 · M +
1000 · 0 + 100 · N + 10 · E + Y .

BTW: The solution is S 7→ 9, E 7→ 5, N 7→ 6, D 7→ 7, M 7→ 1, O 7→ 0, R 7→


8, Y 7→ 2 ; parents send 10652€

 Definition 8.3.11. Problems like the one in ?? are called crypto-arithmetic puzzles.

Michael Kohlhase: Artificial Intelligence 1 259 2025-02-06

Encoding Higher-Order Constraints as Binary ones


8.3. CSP: TOWARDS A FORMAL DEFINITION 179

 Problem: The last constraint is of order 8. (n = 8 variables involved)


 Observation 8.3.12. We can write the addition scheme constraint column wise
using auxiliary variables, i.e. variables that do not “occur” in the original problem.

D+E = Y + 10 · X1
S E N D
X1 + N + R = E + 10 · X2
+ M O R E
X2 + E + O = N + 10 · X3 M O N E Y
X3 + S + M = O + 10 · M

These constraints are of order ≤ 5.


 General Recipe: For n ≥ 3, encode C(v1 , . . . , vn−1 , vn ) as

C(p1 (x), . . . , pn−1 (x), vn ) ∧ v1 = p1 (x) ∧ . . . ∧ vn−1 = pn−1 (x)

 Problem: The problem structure gets hidden. (search algorithms can get
confused)

Michael Kohlhase: Artificial Intelligence 1 260 2025-02-06

Constraint Graph
 Definition 8.3.13. A binary CSP is a CSP where each constraint is unary or binary.
 Observation 8.3.14. A binary CSP forms a graph called the constraint graph
whose nodes are variables, and whose edges represent the constraints.

 Example
204 204 8.3.15. Australia as a binary CSP
Chapter 6.
Chapter 6.Constraint Satisfaction
Constraint Problems
Satisfaction Problems

NT NT
Q Q
NorthernNorthern WA WA
Territory
Territory
Queensland
Queensland
WesternWestern
Australia
SA SA NSW NSW
Australia
South South
Australia
Australia New New
South South V
Wales Wales
V
VictoriaVictoria

Tasmania
Tasmania
T T
(a) (a) (b) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
 Intuition: General-purpose
be viewed as a constraint
be viewed CSP
satisfaction
as a constraint algorithms
problem
satisfaction use
(CSP).(CSP).
problem The Thethe
goal isgoal graph
to assign
is colorsstructure
to assign to each to speed up
to each
colors
search. regionregion
so thatsonothat
represented
neighboring
no neighboring
as a constraint
represented
(E.g.,
regions
graph.graph.
as a constraint
have the
regions Tasmania
same
have the color. is
(b) The
same color. an independent
(b)map-coloring problem subproblem!)
problem
The map-coloring

immediately discard
immediately furtherfurther
discard refinements of theofpartial
refinements assignment.
the partial Furthermore,
assignment. we can
Furthermore, wesee
can see
why the
whyassignment is not isa solution—we
the assignment see which
not a solution—we
Michael Kohlhase: Artificial Intelligence 1
variables
see which violate
variables
261
a constraint—so
violate we
a constraint—so can
we can
2025-02-06
focus focus
attention on the variables that matter. As a result, many problems that are intractable
attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when formulated as a
for regular state-space search can be solved quickly when formulated as a CSP. CSP.

6.1.26.1.2
Example problem:
Example Job-shop
problem: scheduling
Job-shop scheduling
Real-world Factories
CSPs have the
Factories problem
have of scheduling
the problem a day’s
of scheduling worthworth
a day’s of jobs,
of subject to various
jobs, subject constraints.
to various constraints.
In practice, manymany
In practice, of these problems
of these are solved
problems with CSP
are solved withtechniques.
CSP techniques. Consider the problem
Consider of of
the problem
scheduling the assembly
scheduling of a car.
the assembly of aThe
car.whole
The wholejob is job
composed
is composed of tasks, and we
of tasks, can
and wemodel each each
can model
 Example 8.3.16
task as
task as(Assignment
a variable, wherewhere
a variable, the value problems).
of each
the value of variable e.g.,
is the
each variable istime who
the that
time the teaches
thattask
thestarts, what
expressed
task starts, class
expressed
as an asinteger number
an integer of minutes.
number of minutes.Constraints can assert
Constraints that one
can assert that task
one must occuroccur
task must beforebefore
another—for example,
another—for a wheel
example, must must
a wheel be installed beforebefore
be installed the hubcap is putison—and
the hubcap that only
put on—and that only
so many tasks tasks
so many can go canongoat ononce. Constraints
at once. can also
Constraints can specify that athat
also specify taska takes a certain
task takes a certain
amount of time
amount of to complete.
time to complete.
We consider a small
We consider part ofpart
a small theofcartheassembly, consisting
car assembly, of 15 of
consisting tasks: installinstall
15 tasks: axles axles
(front(front
and back), affix all
and back), four
affix all wheels (right(right
four wheels and left,andfront and back),
left, front tighten
and back), nuts for
tighten nutseach
for wheel,
each wheel,
affix hubcaps, and inspect
affix hubcaps, the final
and inspect the assembly.
final assembly.We can Werepresent
can representthe tasks with 15
the tasks variables:
with 15 variables:
180 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

 Example 8.3.17 (Timetabling problems). e.g., which class is offered when and
where?
 Example 8.3.18 (Hardware configuration).

 Example 8.3.19 (Spreadsheets).


 Example 8.3.20 (Transportation scheduling).
 Example 8.3.21 (Factory scheduling).
 Example 8.3.22 (Floorplanning).

 Note: many real-world problems involve real-valued variables ; continuous CSPs.

Michael Kohlhase: Artificial Intelligence 1 262 2025-02-06

8.4 Constraint Networks: Formalizing Binary CSPs


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22279.

Constraint Networks (Formalizing binary CSPs)


 Definition 8.4.1. A constraint network is a triple γ := ⟨V , D, C ⟩, where

 V is a finite set of variables,


 D := {Dv | v ∈ V } the set of their domains, and
 C := {C uv ⊆ Du ×Dv | u, v ∈ V and u ̸= v} is a set of constraints with C uv =
C −1
vu .

We call the undirected graph ⟨V , {(u,v) ∈ V 2 | C uv ̸= Du × Dv }⟩, the constraint


graph of γ.
 We will talk of CSPs and mean constraint networks.
 Remarks: The mathematical formulation gives us a lot of leverage:
 b possible assignments to variables u and v
C uv ⊆ Du ×Dv =
 Relations are the most general formalization, generally we use symbolic formu-
lations, e.g. “u = v” for the relation C uv = {(a,b) | a = b} or “u ̸= v”.
 We can express unary constraints Cu by restricting the domain of v: Dv := Cv .

Michael Kohlhase: Artificial Intelligence 1 263 2025-02-06

Example: SuDoKu as a Constraint Network


 Example 8.4.2 (Formalize SuDoKu). We use the added formality to encode
SuDoKu as a constraint network, not just as a CSP as ??.
8.4. CONSTRAINT NETWORKS: FORMALIZING BINARY CSPS 181

 Variables: V = {vij | 1 ≤ i, j ≤ 9}: vij =cell in row i column j.


 Domains For all v ∈ V : Dv = D = {1, . . . ,9}.
 Unary constraint: Cvij = {d} if cell i, j is pre-filled with d.
 (Binary) constraint: C vij vi′ j′ =
b “vij ̸= vi′ j ′ ”, i.e.
C vij vi′ j′ = {(d,d′ ) ∈ D × D | d ̸= d′ }, for: i = i′ (same row), or j = j ′ (same
′ ′
column), or (⌈ 3i ⌉,⌈ 3j ⌉) = (⌈ i3 ⌉,⌈ j3 ⌉) (same block).

Note that the ideas are still the same as ??, but in constraint networks we have a
language to formulate things precisely.

Michael Kohlhase: Artificial Intelligence 1 264 2025-02-06

Constraint Networks (Solutions)


 Let γ := ⟨V , D, C ⟩ be a constraint network.
S
 Definition 8.4.3. We call a partial function a : V ⇀ u∈V Du a variable assignment
if a(u) ∈ Du for all u ∈ dom(a).
S
 Definition 8.4.4. Let C := ⟨V , D, C ⟩ be a constraint network and a : V ⇀ v∈V Dv
a variable assignment. We say that a satisfies (otherwise violates) a constraint C uv ,
iff u, v ∈ dom(a) and (a(u),a(v)) ∈ C uv . a is called consistent in C, iff it satisfies
all constraints in C. A value w ∈ Du is legal for a variable u in C, iff {(u,w)} is a
consistent assignment in C. A variable with illegal value under a is called conflicted.
 Example 8.4.5. The empty assignment ϵ is (trivially) consistent in any constraint
network.
 Definition 8.4.6. Let f and g be variable assignments, then we say that f extends
(or is an extension of) g, iff dom(g)⊂dom(f ) and f |dom(g) = g.

 Definition 8.4.7. We call a consistent (total) assignment a solution for γ and γ


itself solvable or satisfiable.

Michael Kohlhase: Artificial Intelligence 1 265 2025-02-06

How it all fits together


 Lemma 8.4.8. Higher-order constraints can be transformed into equi-satisfiable
182 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

binary constraints using auxiliary variables.


 Corollary 8.4.9. Any CSP can be represented by a constraint network.
 In other words The notion of a constraint network is a refinement of a CSP.

 So we will stick to constraint networks in this course.


 Observation 8.4.10. We can view a constraint network as a search problem, if we
take the states as the variable assignments, the actions as assignment extensions,
and the goal states as consistent assignments.

 Idea: We will explore that idea for algorithms that solve constraint networks.

Michael Kohlhase: Artificial Intelligence 1 266 2025-02-06

8.5 CSP as Search


We now follow up on ?? to use search algorithms for solving constraint networks.
The key point of this section is that the factored states representations realized by constraint
networks allow the formulation of very powerful heuristics. A Video Nugget covering this
section can be found at https://fau.tv/clip/id/22319.

Standard search formulation (incremental)


 Idea: Every constraint network induces a single state problem.
 Definition 8.5.1 (Let’s do the math). Given a constraint network γ := ⟨V , D, C ⟩,
then Πγ := ⟨S γ , Aγ , T γ , I γ , G γ ⟩ is called the search problem induced by γ, iff
 State S γ are variable assignments
 Action Aγ : extend φ ∈ S γ by a pair x 7→ v not conflicted with φ.
 Transition model T γ (a, φ) = φ,x 7→ v (extended assignment)
 Initial state I γ : the empty assignment ϵ.
 Goal states G γ : the total, consistent assignments
 What has just happened?: We interpret a constraint network γ as a search
problem Πγ . A solution to Πγ induces a solution to γ.

 Idea: We have algorithms for that: e.g. tree search.


 Remark: This is the same for all CSPs! ,
; fail if no consistent assignments exist (not fixable!)

Michael Kohlhase: Artificial Intelligence 1 267 2025-02-06

Standard search formulation (incremental)


 Example 8.5.2. A search tree for ΠAustralia :
8.5. CSP AS SEARCH 183

W A = red W A = green W A = blue

W A = red W A = red
N T = green N T = blue

W A = red W A = red
N T = green N T = green
Q = red Q = blue

 Observation: Every solution appears at depth n with n variables.


 Idea: Use depth first search!

 Observation: Path is irrelevant ; can use local search algorithms.


 Branching factor b = (n − ℓ)d at depth ℓ, hence n!dn leaves!!!! /

Michael Kohlhase: Artificial Intelligence 1 268 2025-02-06

Backtracking Search
 Assignments for different variables are independent!
 e.g. first WA = red then NT = green vs. first NT = green then WA = red
 ; we only need to consider assignments to a single variable at each node
 ; b = d and there are dn leaves.
 Definition 8.5.3. Depth first search for CSPs with single-variable assignment
extensions actions is called backtracking search.
 Backtracking search is the basic uninformed algorithm for CSPs.

 It can solve the n-queens problem for ≊ n, 25.

Michael Kohlhase: Artificial Intelligence 1 269 2025-02-06

Backtracking Search (Implementation)


 Definition 8.5.4. The generic backtracking search algorithm:
procedure Backtracking−Search(csp ) returns solution/failure
return Recursive−Backtracking (∅, csp)
procedure Recursive−Backtracking (assignment) returns soln/failure
if assignment is complete then return assignment
var := Select−Unassigned−Variable(Variables[csp], assignment, csp)
foreach value in Order−Domain−Values(var, assignment, csp) do
if value is consistent with assignment given Constraints[csp] then
add {var = value} to assignment
result := Recursive−Backtracking(assignment,csp)
184 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

if result ̸= failure then return result


remove {var= value} from assignment
return failure

Michael Kohlhase: Artificial Intelligence 1 270 2025-02-06

Backtracking in Australia
 Example 8.5.5. We apply backtracking search for a map coloring problem:

Step 1:

Step 2:
8.5. CSP AS SEARCH 185

Step 3:

Step 4:

Michael Kohlhase: Artificial Intelligence 1 271 2025-02-06

Improving Backtracking Efficiency


 General-purpose methods can give huge gains in speed for backtracking search.
 Answering the following questions well helps find powerful heuristics:
1. Which variable should be assigned next? (i.e. a variable ordering heuristic)
2. In what order should its values be tried? (i.e. a value ordering heuristic)
3. Can we detect inevitable failure early? (for pruning strategies)
4. Can we take advantage of problem structure? (; inference)
 Observation: Questions 1/2 correspond to the missing subroutines
Select−Unassigned−Variable and Order−Domain−Values from ??.

Michael Kohlhase: Artificial Intelligence 1 272 2025-02-06


186 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

Heuristic: Minimum Remaining Values (Which Variable)


 Definition 8.5.6. The minimum remaining values (MRV) heuristic for backtracking
search always chooses the variable with the fewest legal values, i.e. a variable v that
given an initial assignment a minimizes #({d ∈ Dv | a ∪ {v 7→ d} is consistent}).
 Intuition: By choosing a most constrained variable v first, we reduce the branching
factor (number of sub trees generated for v) and thus reduce the size of our search
tree.
 Extreme case: If #({d ∈ Dv | a ∪ {v 7→ d} is consistent}) = 1, then the value
assignment to v is forced by our previous choices.
 Example 8.5.7. In step 3 of ??, there is only one remaining value for SA!

Michael Kohlhase: Artificial Intelligence 1 273 2025-02-06

Degree Heuristic (Variable Order Tie Breaker)

 Problem: Need a tie-breaker among MRV variables! (there was no preference in


step 1,2)
 Definition 8.5.8. The degree heuristic in backtracking search always chooses a
most constraining variable, i.e. given an initial assignment a always pick a variable
v with #({v ∈ (V \dom(a)) | C uv ∈ C}) maximal.

 By choosing a most constraining variable first, we detect inconsistencies earlier on


and thus reduce the size of our search tree.
 Commonly used strategy combination: From the set of most constrained vari-
able, pick a most constraining variable.
 Example 8.5.9.

Degree heuristic: SA = 5, T = 0, all others 2 or 3.

Michael Kohlhase: Artificial Intelligence 1 274 2025-02-06

Where in ?? does the most constraining variable play a role in the choice? SA (only possible
choice), NT (all choices possible except WA, V, T). Where in the illustration does most con-
strained variable play a role in the choice? NT (all choices possible except T), Q (only Q and WA
8.6. CONCLUSION & PREVIEW 187

possible).

Least Constraining Value Heuristic (Value Ordering)


 Definition 8.5.10. Given a variable v, the least constraining value heuristic chooses
the least constraining value for v: the one that rules out the fewest values in the
remaining variables, i.e. for a given initial assignment a and a chosen variable v pick a
value d ∈ Dv that minimizes #({e ∈ Du | u ̸∈ dom(a), C uv ∈ C, and (e,d) ̸∈ C uv })

 By choosing the least constraining value first, we increase the chances to not rule
out the solutions below the current node.
 Example 8.5.11.

 Combining these heuristics makes 1000 queens feasible.

Michael Kohlhase: Artificial Intelligence 1 275 2025-02-06

8.6 Conclusion & Preview


Summary & Preview
 Summary of “CSP as Search”:
 Constraint networks γ consist of variables, associated with finite domains, and
constraints which are binary relations specifying permissible value pairs.
 A variable assignment a maps some variables to values. a is consistent if it
complies with all constraints. A consistent total assignment is a solution.
 The constraint satisfaction problem (CSP) consists in finding a solution for a
constraint network. This has numerous applications including, e.g., scheduling
and timetabling.
 Backtracking search assigns variable one by one, pruning inconsistent variable
assignments.
 Variable orderings in backtracking can dramatically reduce the size of the search
tree. Value orderings have this potential (only) in solvable sub trees.
 Up next: Inference and decomposition, for improved efficiency.

Michael Kohlhase: Artificial Intelligence 1 276 2025-02-06

Suggested Reading: p

• Chapter 6: Constraint Satisfaction Problems, Sections 6.1 and 6.3, in [RN09].


188 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS

– Compared to our treatment of the topic “Constraint Satisfaction Problems” (?? and ??),
RN covers much more material, but less formally and in much less detail (in particular, my
slides contain many additional in-depth examples). Nice background/additional reading, can’t
replace the lectures.
– Section 6.1: Similar to our “Introduction” and “Constraint Networks”, less/different examples,
much less detail, more discussion of extensions/variations.
– Section 6.3: Similar to my “Naïve Backtracking” and “Variable- and Value Ordering”, with
less examples and details; contains part of what we cover in ?? (RN does inference first, then
backtracking). Additional discussion of backjumping.
Chapter 9

Constraint Propagation

In this chapter we discuss another idea that is central to symbolic AI as a whole. The first com-
ponent is that with the factored states representations, we need to use a representation language
for (sets of) states. The second component is that instead of state-level search, we can graduate
to representation-level search (inference), which can be much more efficient that state level search
as the respective representation language actions correspond to groups of state-level actions.

9.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22321.

Illustration: Constraint Propagation


 Example 9.1.1. A constraint
204 network γ: Chapter 6. Constraint Satisfaction Problems

NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria

Tasmania
T
(a) (b)

Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can

 Question: Can we add a constraint without losing have the any solutions?
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
region so that no neighboring regions same color. (b) The map-coloring problem
represented as a constraint graph.

 Example 9.1.2. C WAQ := “=”. If WA


immediately and
discard further Q are
refinements assigned
of the partial different
assignment. Furthermore, we cancolors,
why the assignment is not a solution—we see which variables violate a constraint—so we can
see then
NT must be assigned the 3rd color, leaving
focus attention no
on the variables thatcolor
matter. As for
a result,SA.
many problems that are intractable
for regular state-space search can be solved quickly when formulated as a CSP.

6.1.2 Example problem: Job-shop scheduling


 Intuition: Adding constraintsFactories
without losing solutions
have the problem of scheduling a day’s worth of jobs, subject to various constraints.
b obtaining an equivalent network
= with
In practice, a “tighter
many of these description”
problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
; a smaller number of consistent (partial) variable assignments
task as a variable, where the value of each variable is the time that the task starts, expressed
as an integer number of minutes. Constraints can assert that one task must occur before
; more efficient search! another—for example, a wheel must be installed before the hubcap is put on—and that only
so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix
Michael Kohlhase: Artificial Intelligence 1 hubcaps, and inspect the final assembly.
277 We can represent the tasks 2025-02-06
with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
The value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
constraints between individual tasks. Whenever a task T1 must occur before task T2 , and

Illustration: Decomposition
CONSTRAINTS
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .

 Example 9.1.3. Constraint network γ:

189
190 204
CHAPTERChapter
9. 6.CONSTRAINT PROPAGATION
Constraint Satisfaction Problems

NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria

Tasmania
T
(a) (b)

Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can

 We can separate this into two independent constraint


regions have the samenetworks.
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
region so that no neighboring color. (b) The map-coloring problem
represented as a constraint graph.

 Tasmania is not adjacent to any other state. Thus we can color Australia first, and
immediately discard further refinements of the partial assignment. Furthermore, we can see
why the assignment is not a solution—we see which variables violate a constraint—so we can
assign an arbitrary color to Tasmania afterwards. focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when formulated as a CSP.

6.1.2 Example problem: Job-shop scheduling


 Decomposition methods exploit the structure of the constraint network. They
Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
identify separate parts (sub-networks) whose inter-dependencies are “simple” and
In practice, many of these problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
can be handled efficiently. task as a variable, where the value of each variable is the time that the task starts, expressed
as an integer number of minutes. Constraints can assert that one task must occur before
another—for example, a wheel must be installed before the hubcap is put on—and that only

 Example 9.1.4 (Extreme case). No inter-dependencies at all, as for Tasmania


so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
above. and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
The
Michael Kohlhase: Artificial Intelligence 1 value of each variable is the time
278that the task starts. Next we represent precedence
2025-02-06
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .

Our Agenda for This Chapter


 Constraint propagation: How does inference work in principle? What are relevant
practical aspects?
 Fundamental concepts underlying inference, basic facts about its use.

 Forward checking: What is the simplest instance of inference?


 Gets us started on this subject.
 Arc consistency: How to make inferences between variables whose value is not fixed
yet?

 Details a state of the art inference method.


 Decomposition: Constraint graphs, and two simple cases
 How to capture dependencies in a constraint network? What are “simple cases”?
 Basic results on this subject.

 Cutset conditioning: What if we’re not in a simple case?


 Outlines the most easily understandable technique for decomposition in the gen-
eral case.

Michael Kohlhase: Artificial Intelligence 1 279 2025-02-06

9.2 Constraint Propagation/Inference


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22326.
9.2. CONSTRAINT PROPAGATION/INFERENCE 191

Constraint Propagation/Inference: Basic Facts


 Definition 9.2.1. Constraint propagation (i.e inference in constraint networks)
consists in deducing additional constraints, that follow from the already known
constraints, i.e. that are satisfied in all solutions.
 Example 9.2.2. It’s what you do all the time when playing SuDoKu:

 Formally: Replace γ by an equivalent and strictly tighter constraint network γ ′ .

Michael Kohlhase: Artificial Intelligence 1 280 2025-02-06

Equivalent Constraint Networks


 Definition 9.2.3. We say that two constraint networks γ := ⟨V , D, C ⟩ and γ ′ :=
⟨V , D′ , C ′ ⟩ sharing the same set of variables are equivalent, (write γ ′ ≡γ), if they
have the same solutions.
 Example 9.2.4.

v1 v1

γ red red
γ′
blue blue

̸= ̸= ̸= ̸=

v2 red red v3 v2 red red v3


blue blue blue ̸= blue

Are these constraint networks equivalent? No.

v1 v1

γ red red
γ′
blue blue

̸= ̸= ̸= ̸=

v2 red red v3 v2 red red v3


blue blue blue = blue
192 CHAPTER 9. CONSTRAINT PROPAGATION

Are these constraint networks equivalent? Yes.

Michael Kohlhase: Artificial Intelligence 1 281 2025-02-06

Tightness
 Definition 9.2.5 (Tightness). Let γ := ⟨V , D, C ⟩ and γ ′ = ⟨V , D′ , C ′ ⟩ be
constraint networks sharing the same set of variables, then γ ′ is tighter than γ,
(write γ ′ ⊑γ), if:
(i) For all v ∈ V : D′ v ⊆ Dv .
(ii) For all u ̸= v ∈ V and C ′ uv ∈ C ′ : either C ′ uv ̸∈ C or C ′ uv ⊆ C uv .

γ ′ is strictly tighter than γ, (written γ ′ <γ), if at least one of these inclusions is


proper.
 Example 9.2.6.

v1 v1

γ red red
γ′
blue blue

̸= ̸= ̸= ̸=

v2 red red v3 v2 red red v3


blue blue blue ̸= blue

Here, we do have γ ′ ⊑γ.

v1 v1

γ red red
γ′
blue blue

̸= ̸= ̸= ̸=

v2 red red v3 v2 red red v3


blue blue blue = blue

Here, we do have γ ′ ⊑γ.

v1 v1

γ red red
γ′
blue blue

̸= ̸= ̸=

v2 red red v3 v2 red red v3


blue blue blue = blue

Here, we do not have γ ′ ⊑γ!.


9.2. CONSTRAINT PROPAGATION/INFERENCE 193

b γ ′ has the same constraints as γ, plus some.


 Intuition: Strict tightness =

Michael Kohlhase: Artificial Intelligence 1 282 2025-02-06

Equivalence + Tightness = Inference


 Theorem 9.2.7. Let γ and γ ′ be constraint networks such that γ ′ ≡γ and γ ′ ⊑γ.
Then γ ′ has the same solutions as, but fewer consistent assignments than, γ.
 ; γ ′ is a better encoding of the underlying problem.

 Example 9.2.8. Two equivalent constraint networks (one obviously unsolvable)

v1 v1

γ red red
γ′
blue blue

̸= ̸= ̸= ̸=

v2 red blue v3 v2 red blue v3


=

ϵ cannot be extended to a solution (neither in γ nor in γ ′ because they’re equivalent);


this is obvious (red ̸= blue) in γ ′ , but not in γ.

Michael Kohlhase: Artificial Intelligence 1 283 2025-02-06

How to Use Constraint Propagation in CSP Solvers?


 Simple: Constraint propagation as a pre-process:
 When: Just once before search starts.
 Effect: Little running time overhead, little pruning power. (not considered
here)
 More Advanced: Constraint propagation during search:
 When: At every recursive call of backtracking.
 Effect: Strong pruning power, may have large running time overhead.
 Search vs. Inference: The more complex the inference, the smaller the number
of search nodes, but the larger the running time needed at each node.
 Idea: Encode variable assignments as unary constraints (i.e., for a(v) = d, set the
unary constraint Dv = {d}), so that inference reasons about the network restricted
to the commitments already made in the search.

Michael Kohlhase: Artificial Intelligence 1 284 2025-02-06


194 CHAPTER 9. CONSTRAINT PROPAGATION

Backtracking With Inference


 Definition 9.2.9. The general algorithm for backtracking with inference is
1 function BacktrackingWithInference(γ,a) returns a solution, or ‘‘inconsistent’’
2 if a is inconsistent then return ‘‘inconsistent’’
3 if a is a total assignment then return a
4 γ ′ := a copy of γ /∗ γ ′ = (V γ ′ , Dγ ′ , C γ ′ ) ∗/
5 γ ′ := Inference(γ ′ )
6 if exists v with Dγ ′ v = ∅ then return ‘‘inconsistent’’
7 select some variable v for which a is not defined
8 for each d ∈ copy of Dγ ′ v in some order do
9 a′ := a ∪ {v = d}; Dγ ′ v := {d} /∗ makes a explicit as a constraint ∗/
10 a′′ := BacktrackingWithInference(γ ′ ,a′ )
11 if a′′ ̸= “inconsistent” then return a′′
12 return ‘‘inconsistent’’

 Exactly the same as ??, only line 5 new!


 Inference(): Any procedure delivering a (tighter) equivalent network.
 Inference() typically prunes domains; indicate unsolvability by Dγ ′ v = ∅.
 When backtracking out of a search branch, retract the inferred constraints: these
were dependent on a, the search commitments so far.

Michael Kohlhase: Artificial Intelligence 1 285 2025-02-06

9.3 Forward Checking


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22326.
Forward
Forward Checking
Checking
I Inference, version 1: Forward Checking
 Definition 9.3.1. Forward checking propagates information about illegal values:
function ForwardChecking( ,a) returns modified
Whenever a
for each variable is assigned
v whereu a(v by a, delete
) = d 0 is defined do all values inconsistent with a(u) from
every Dvforforeach
all uvariables v connected
where a(u) is undefinedwith Cuvby2aCconstraint.
and u do
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
Forward
return
 Example Checking
9.3.2. Forward checking in Australia
I Example 3.1.
I Inference, version 1: Forward Checking
function ForwardChecking( ,a) returns modified
for each v where a(v ) = d 0 is defined do
for each u where
WA
a(u)
NT
is undefined
Q
and Cuv 2V C do
NSW SA T
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
return
I Example 3.1.

WA NT Q NSW V SA T

Kohlhase: Künstliche Intelligenz 1 295 July 5, 2018

Kohlhase: Künstliche Intelligenz 1 295 July 5, 2018


Forward Checking
I Inference, version 1: Forward Checking
function ForwardChecking( ,a) returns modified
for each v where a(v ) = d 0 is defined do
for each u where a(u) is undefined and Cuv 2 C do
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
9.3. FORWARDreturn
CHECKING 195
I Example 3.1.
Forward Checking
I Inference, version 1: Forward Checking
function ForwardChecking(
WA NT ,a) returns
Q modifiedV
NSW SA T
for each v where a(v ) = d 0 is defined do
for each u where a(u) is undefined and Cuv 2 C do
Du := {d 2 Du | (d, d 0 ) 2 Cuv }
return
I Example 3.1.

Kohlhase: Künstliche Intelligenz 1 295 July 5, 2018

WA NT Q NSW V SA T

 Definition 9.3.3 (Inference, Version 1). Forward checking implemented


Kohlhase: Künstliche Intelligenz 1 295 July 5, 2018

function ForwardChecking(γ,a) returns modified γ


for each v where a(v) = d′ is defined do
for each u where a(u) is undefined and Cuv ∈ C do
Du := {d ∈ Du | (d,d′ ) ∈ C uv }
return γ

Michael Kohlhase: Artificial Intelligence 1 286 2025-02-06

Note: It’s a bit strange that we start with d′ here; this is to make link to arc consistency –
coming up next – as obvious as possible (same notations u, and d vs. v and d′ ).

Forward Checking: Discussion


 Definition 9.3.4. An inference procedure is called sound, iff for any input γ the
output γ ′ have the same solutions.

 Lemma 9.3.5. Forward checking is sound.


Proof sketch: Recall here that the assignment a is represented as unary constraints
inside γ.
 Corollary 9.3.6. γ and γ ′ are equivalent.

 Incremental computation: Instead of the first for-loop in ??, use only the inner one
every time a new assignment a(v) = d′ is added.
 Practical Properties:
 Cheap but useful inference method.
 Rarely a good idea to not use forward checking (or a stronger inference method
subsuming it).
 Up next: A stronger inference method (subsuming forward checking).
196 CHAPTER 9. CONSTRAINT PROPAGATION

 Definition 9.3.7. Let p and q be inference procedures, then p subsumes q, if


p(γ)⊑q(γ) for any input γ.

Michael Kohlhase: Artificial Intelligence 1 287 2025-02-06

9.4 Arc Consistency


Video Nuggets covering this section can be found at https://fau.tv/clip/id/22350 and
https://fau.tv/clip/id/22351.

When Forward Checking is Not Good Enough


 Problem: Forward checking makes inferences only from assigned to unassigned
variables.

 Example 9.4.1.

v1 v1 v1

1 1 1
v1 < v2 v1 < v2 v1 < v2

v2 123 1 2 3 v3 v2 23 1 2 3 v3 v2 23 3 v3
v2 < v3 v2 < v 3 v2 < v 3

We could do better here: value 3 for v2 is not consistent with any remaining value
for v3 ; it can be removed!
But forward checking does not catch this.

Michael Kohlhase: Artificial Intelligence 1 288 2025-02-06

Arc Consistency: Definition


 Definition 9.4.2 (Arc Consistency). Let γ := ⟨V , D, C ⟩ be a constraint network.

1. A variable u ∈ V is arc consistent relative to another variable v ∈ V if either


C uv ̸∈ C, or for every value d ∈ Du there exists a value d′ ∈ Dv such that
(d,d′ ) ∈ C uv .
2. The constraint network γ is arc consistent if every variable u ∈ V is arc consistent
relative to every other variable v ∈ V .

The concept of arc consistency concerns both levels.


 Intuition: Arc consistency = b for every domain value and constraint, at least one
value on the other side of the constraint “works”.
 Note the asymmetry between u and v: arc consistency is directed.

Michael Kohlhase: Artificial Intelligence 1 289 2025-02-06


9.4. ARC CONSISTENCY 197

Arc Consistency: Example


 Definition 9.4.3 (Arc Consistency). Let γ := ⟨V , D, C ⟩ be a constraint network.
1. A variable u ∈ V is arc consistent relative to another variable v ∈ V if either
C uv ̸∈ C, or for every value d ∈ Du there exists a value d′ ∈ Dv such that
(d,d′ ) ∈ C uv .
2. The constraint network γ is arc consistent if every variable u ∈ V is arc consistent
relative to every other variable v ∈ V .
The concept of arc consistency concerns both levels.

 Example 9.4.4 (Arc Consistency).

v1 v1 v1

1 1 1
v1 < v2 v1 < v2 v1 < v2

v2 123 1 2 3 v3 v2 23 1 2 3 v3 v2 23 3 v3
v2 < v3 v2 < v 3 v2 < v 3

 Question: On top, middle, is v 3 arc consistent relative to v 2 ?


 Answer: No. For values 1 and 2, Dv2 does not have a value that works.
 Note: Enforcing arc consistency for one variable may lead to further reductions
on another variable!
 Question: And on the right?
 Answer: Yes. (But v 2 is not arc consistent relative to v 3 )

Michael Kohlhase: Artificial Intelligence 1 290 2025-02-06

Arc Consistency: Example


 Definition 9.4.5 (Arc Consistency). Let γ := ⟨V , D, C ⟩ be a constraint network.

1. A variable u ∈ V is arc consistent relative to another variable v ∈ V if either


C uv ̸∈ C, or for every value d ∈ Du there exists a value d′ ∈ Dv such that
(d,d′ ) ∈ C uv .
2. The constraint network γ is arc consistent if every variable u ∈ V is arc consistent
relative to every other variable v ∈ V .

The concept of arc consistency concerns both levels.


 Example 9.4.6.
Forward Checking
v1 v1 v1

I Inference, 1version 1: Forward Checking 1 1


function
v1 < v2 ForwardChecking( ,a) returns
v1 < v 2 modified v1 < v2
for each v where a(v ) = d 0 is defined do
v2 1 for
2 3 each u where
1 2 3 a(u) 2 2 3 and Cuv1 22 3C do
v3 isvundefined v3 v2 2 3 v <v 3 v3
Duv2:=
< v{d
3 2 Du | (d, d 0 ) 2 Cuv } v2 < v3 2 3
198 return CHAPTER 9. CONSTRAINT PROPAGATION
I Example 3.1.

WA NT Q NSW V SA T
WA NT Q NSW V SA T

;?
;?

Forward checking
 Note: SA is not makes arc
Kohlhase: Künstliche inferences
consistent
Intelligenz 1 only “from
relative
295 assigned
to NT in 3rd torow.
July 5, unassigned”
2018 variables.
Kohlhase: Künstliche Intelligenz 1 297 July 5, 2018

Michael Kohlhase: Artificial Intelligence 1 291 2025-02-06

Enforcing Arc Consistency: General Remarks


 Inference, version 2: “Enforcing Arc Consistency” = removing domain values
until γ is arc consistent. (Up next)

 Note: Assuming such an inference method AC(γ).


 Lemma 9.4.7. AC(γ) is sound: guarantees to deliver an equivalent network.
 Proof sketch: If, for d ∈ Du , there does not exist a value d′ ∈ Dv such that
(d,d′ ) ∈ C uv , then u = d cannot be part of any solution.

 Observation 9.4.8. AC(γ) subsumes forward checking: AC(γ)⊑ForwardChecking(γ).


 Proof: Recall from slide 282 that γ ′ ⊑γ means γ ′ is tighter than γ.
1. Forward checking removes d from Du only if there is a constraint C uv such
that Dv = {d′ } (i.e. when v was assigned the value d′ ), and (d,d′ ) ̸∈ C uv .
2. Clearly, enforcing arc consistency of u relative to v removes d from Du as well.

Michael Kohlhase: Artificial Intelligence 1 292 2025-02-06

Enforcing Arc Consistency for One Pair of Variables


 Definition 9.4.9 (Revise). Revise is an algorithm enforcing arc consistency of u
relative to v
function Revise(γ,u,v) returns modified γ
for each d ∈ Du do
if there is no d′ ∈ Dv with (d,d′ ) ∈ C uv then Du := Du \{d}
return γ

 Lemma 9.4.10. If d is maximal domain size in γ and the test “(d,d′ ) ∈ C uv ?” has
time complexity O(1), then the running time of Revise(γ, u, v) is O(d2 ).

 Example 9.4.11. Revise(γ, v 3 , v 2 )


9.4. ARC CONSISTENCY 199

v1 v1

1 1

v1 < v2 v1 < v 2

v2 23 123 v3 v2 23 123 v3
v2 < v3 v2 < v3

v1 v1

1 1

v1 < v2 v1 < v 2

v2 23 123 v3 v2 23 23 v3
v2 < v3 v2 < v3

v1 v1

1 1

v1 < v2 v1 < v 2

v2 23 23 v3 v2 23 3 v3
v2 < v3 v2 < v3

v1

v1 < v2

v2 23 3 v3
v2 < v3

Michael Kohlhase: Artificial Intelligence 1 293 2025-02-06

AC-1: Enforcing Arc Consistency (Version 1)


 Idea: Apply Revise pairwise up to a fixed point.
 Definition 9.4.12. AC-1 enforces arc consistency in constraint networks:
function AC−1(γ) returns modified γ
repeat
changesMade := False
for each constraint C uv do
Revise(γ,u,v) /∗ if Du reduces, set changesMade := True ∗/
Revise(γ,v,u) /∗ if Dv reduces, set changesMade := True ∗/
until changesMade = False
return γ

 Observation: Obviously, this does indeed enforce arc consistency for γ.


200 CHAPTER 9. CONSTRAINT PROPAGATION

 Lemma 9.4.13. If γ has n variables, m constraints, and maximal domain size d,


then the time complexity of AC1(γ) is O(md2 nd).
 Proof sketch: O(md2 ) for each inner loop, fixed point reached at the latest once
all nd variable values have been removed.

 Problem: There are redundant computations.


 Question: Do you see what these redundant computations are?
 Redundant computations: u and v are revised even if theirdomains haven’t
changed since the last time.

 Better algorithm avoiding this: AC 3 (coming up)

Michael Kohlhase: Artificial Intelligence 1 294 2025-02-06

AC-3: Enforcing Arc Consistency (Version 3)


 Idea: Remember the potentially inconsistent variable pairs.
 Definition 9.4.14. AC-3 optimizes AC-1 for enforcing arc consistency.
function AC−3(γ) returns modified γ
M := ∅
for each constraint C uv ∈ C do
M := M ∪ {(u,v), (v,u)}
while M ̸= ∅ do
remove any element (u,v) from M
Revise(γ, u, v)
if Du has changed in the call to Revise then
for each constraint C wu ∈ C where w ̸= v do
M := M ∪ {(w,u)}
return γ

 Question: AC − 3(γ) enforces arc consistency because?


 Answer: At any time during the while-loop, if (u,v) ̸∈ M then u is arc consistent
relative to v.

 Question: Why only “where w ̸= v”?


 Answer: If w = v is the reason why Du changed, then w is still arc consistent
relative to u: the values just removed from Du did not match any values from Dw
anyway.

Michael Kohlhase: Artificial Intelligence 1 295 2025-02-06

AC-3: Example
 Example 9.4.15. y div x = 0: y modulo x is 0, i.e., y is divisible by x
9.4. ARC CONSISTENCY 201

v1

25

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )

v1

25

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )

v1

25

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
202 CHAPTER 9. CONSTRAINT PROPAGATION

v1

25

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )

v1

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )

v1

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )
(v 3 ,v 1 )
9.4. ARC CONSISTENCY 203

v1

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 25 v3 (v 2 ,v 1 )
(v 3 ,v 1 )

v1

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 2 v3 (v 2 ,v 1 )

v1

v 2 div v 1 = 0 v 3 div v 1 = 0

M
v2 24 2 v3

Michael Kohlhase: Artificial Intelligence 1 296 2025-02-06

AC-3: Runtime
 Theorem 9.4.16 (Runtime of AC-3). Let γ := ⟨V , D, C ⟩ be a constraint network
with m constraints, and maximal domain size d. Then AC − 3(γ) runs in time
O(md3 ).
 Proof: by counting how often Revise is called.
204 CHAPTER 9. CONSTRAINT PROPAGATION

1. Each call to Revise(γ, u, v) takes time O(d2 ) so it suffices to prove that at


most O(md) of these calls are made.
2. The number of calls to Revise(γ, u, v) is the number of iterations of the while-
loop, which is at most the number of insertions into M .
3. Consider any constraint C uv .
4. Two variable pairs corresponding to C uv are inserted in the for-loop. In the
while loop, if a pair corresponding to C uv is inserted into M , then
5. beforehand the domain of either u or v was reduced, which happens at most
2d times.
6. Thus we have O(d) insertions per constraint, and O(md) insertions overall, as
desired.

Michael Kohlhase: Artificial Intelligence 1 297 2025-02-06

9.5 Decomposition: Constraint Graphs, and Three Simple


Cases
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22353.

Reminder: The Big Picture


 Say γ is a constraint network with n variables and maximal domain size d.
 dn total assignments must be tested in the worst case to solve γ.

 Inference: One method to try to avoid/ameliorate this combinatorial explosion in


practice.
 Often, from an assignment to some variables, we can easily make inferences
regarding other variables.
 Decomposition: Another method to avoid/ameliorate this combinatorial explosion
in practice.
 Often, we can exploit the structure of a network to decompose it into smaller
parts that are easier to solve.
 Question: What is “structure”, and how to “decompose”?

Michael Kohlhase: Artificial Intelligence 1 298 2025-02-06

Problem Structure
9.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 205

 Idea: Tasmania and mainland are “independent


subproblems” 204 Chapter 6. Constraint Satisfaction Problems

 Definition 9.5.1. Independent subproblems are


identified as connected components of constraint NT
Q
graphs. WA
Northern
Territory
 Suppose each independentAustralia
subproblem has
Western c vari-
Queensland
SA NSW
ables out of n total. (d is maxSouth
domainNewsize)
Australia
South V
Wales
 Worst-case solution cost is n div c · dc (linear
Victoriain n)

T
 E.g., n = 80, d = 2, c = 20 Tasmania

(a) (b)
 b 4 billion years at
280 = 106.1million
Figure (a) The nodes/sec
principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
 b 0.4 seconds at 10 million nodes/sec
4 · 220 = region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.

immediately
Michael Kohlhase: Artificial Intelligence 1 discard further refinements
299 of the partial assignment. Furthermore, we can see
2025-02-06
why the assignment is not a solution—we see which variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when formulated as a CSP.

6.1.2 Example problem: Job-shop scheduling


“Decomposition” 1.0: Disconnected Constraint Graphs
Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
In practice, many of these problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
 Theorem 9.5.2 (Disconnected Constraint
task as a variable, Graphs).
where the value Let
of each variable :=that⟨Vthe, task
is theγtime C ⟩ expressed
D,starts, be a
constraint network. Let ai asbean integer number ofto
a solution minutes.
each Constraints
connected can assert that one task must
component
Sexample, a wheel must be installed before the hubcap is put on—and γ i
occur
of before
the
another—for that only
constraint graph of γ. Thenso amany
:=tasksi can
ai gois ona atsolution to γ.can also specify that a task takes a certain
once. Constraints
amount of time to complete.
 Proof: We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
1. a satisfies all C uv where affixu and and
hubcaps, v are
inspectinside
the final the same
assembly. We canconnected component.
represent the tasks with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
2. The latter is the case for all C uv . LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
Nuts
3. If two parts ofPRECEDENCE
γ are not connected,
The value of each thenis the
variable they
time are independent.
that the task starts. Next we represent precedence
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
204 Example 9.5.3. Color Tasmania
Chapter T16.+ separately
d1 Constraint
in Australia
≤ T2 . Satisfaction Problems

NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria

Tasmania
T
(a) (b)

Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
 Example 9.5.4
be viewed (Doing
as a constraint satisfactionthe
problemNumbers).
(CSP). The goal is to assign colors to each
region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.
 γ with n = 40 variables, each domain size k = 2. Four separate connected
immediately discard further refinements of the partial assignment. Furthermore, we can see
components each
why the assignment is not of size
a solution—we 10. variables violate a constraint—so we can
see which
focus attention on the variables that matter. As a result, many problems that are intractable
 Reduction ofsearch
for regular state-space worst-case when
can be solved quickly whenusing decomposition:
formulated as a CSP.

6.1.2 Example problem: Job-shop scheduling


 No decomposition: 2 . With: 4 · 210 . Gain: 228 ≊ 280.000.000.
40
Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
In practice, many of these problems are solved with CSP techniques. Consider the problem of
 Definition 9.5.5. The process of decomposing a constraint network into compo-
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
task as a variable, where the value of each variable is the time that the task starts, expressed
nents is called decomposition. There are various decomposition algorithms.
as an integer number of minutes. Constraints can assert that one task must occur before
another—for example, a wheel must be installed before the hubcap is put on—and that only
so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
We consider a small part of the car assembly, consisting of 15 tasks: install axles
Michael Kohlhase: Artificial Intelligence 1 300
(front 2025-02-06
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
The value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
206 CHAPTER 9. CONSTRAINT PROPAGATION

Tree-structured CSPs

 Definition 9.5.6. We call a CSP tree-structured, iff its constraint graph is acyclic
 Theorem 9.5.7. Tree-structured CSP can be solved in O(nd2 ) time.

 Compare to general CSPs, where worst case time is O(dn ).


 This property also applies to logical and probabilistic reasoning: an important ex-
ample of the relation between syntactic restrictions and the complexity of reasoning.

Michael Kohlhase: Artificial Intelligence 1 301 2025-02-06

Algorithm for tree-structured CSPs


1. Choose a variable as root, order variables from root to leaves such that every node’s
parent precedes it in the ordering

2. For j from n down to 2, apply


RemoveInconsistent(Parent(Xj ,Xj ))

3. For j from 1 to n, assign Xj consistently with P arent(Xj )

Michael Kohlhase: Artificial Intelligence 1 302 2025-02-06

Nearly tree-structured CSPs


 Definition 9.5.8. Conditioning: instantiate a variable, prune its neighbors’ do-
mains.

 Example 9.5.9.
9.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 207

 Definition 9.5.10. Cutset conditioning: instantiate (in all ways) a set of variables
such that the remaining constraint graph is a tree.

 Cutset size c ; running time O(dc (n − c)d2 ), very fast for small c.

Michael Kohlhase: Artificial Intelligence 1 303 2025-02-06

“Decomposition” 2.0: Acyclic Constraint Graphs


 Theorem 9.5.11 (Acyclic Constraint Graphs). Let γ := ⟨V , D, C ⟩ be a con-
straint network with n variables and maximal domain size k, whose constraint graph
is acyclic. Then we can find a solution for γ, or prove γ to be unsatisfiable, in time
O(nk 2 ).
 Proof sketch: See the algorithm on the next slide

 Constraint networks with acyclic constraint graphs can be solved in (low order)
polynomial time.
204 Chapter 6. Constraint Satisfaction Problems
 Example 9.5.12. Australia is not acyclic. (But see next section)
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria

Tasmania
T
(a) (b)

 Example 9.5.13 (Doing the Numbers).


Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.
 γ with n = 40 variables, each domain size k = 2. Acyclic constraint graph.
immediately discard further refinements of the partial assignment. Furthermore, we can see
why the assignment is not a solution—we see which Reduction of worst-case when using decomposition:
variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when No decomposition: 240 .
formulated as a CSP.

6.1.2 Example problem: Job-shop scheduling


 With decomposition: 40 · 2 . Gain: 2 .
2 32
Factories have the problem of scheduling a day’s worth of jobs, subject to various constraints.
In practice, many of these problems are solved with CSP techniques. Consider the problem of
scheduling the assembly of a car. The whole job is composed of tasks, and we can model each
task as a variable, where the value of each variable is the time that the task starts, expressed
Michael Kohlhase: Artificial Intelligence 1 304 2025-02-06
as an integer number of minutes. Constraints can assert that one task must occur before
another—for example, a wheel must be installed before the hubcap is put on—and that only
so many tasks can go on at once. Constraints can also specify that a task takes a certain
amount of time to complete.
We consider a small part of the car assembly, consisting of 15 tasks: install axles (front
and back), affix all four wheels (right and left, front and back), tighten nuts for each wheel,
affix hubcaps, and inspect the final assembly. We can represent the tasks with 15 variables:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
The value of each variable is the time that the task starts. Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T +d ≤T .
208 CHAPTER 9. CONSTRAINT PROPAGATION

Acyclic Constraint Graphs: How To


 Definition 9.5.14. Algorithm AcyclicCG(γ):
1. Obtain a (directed) tree from γ’s constraint graph, picking an arbitrary variable
v as the root, and directing edges outwards.a
2. Order the variables topologically, i.e., such that each node is ordered before its
children; denote that order by v 1 , . . ., v n .
3. for i := n, n − 1, . . . , 2 do:
(a) Revise(γ, v parent(i) , v i ).
(b) if Dvparent(i) = ∅ then return “inconsistent”
Now, every variable is arc consistent relative to its children.
4. Run BacktrackingWithInference with forward checking, using the variable order
v 1 , . . ., v n .
 Lemma 9.5.15. This algorithm will find a solution without ever having to back-
track!

Michael Kohlhase: Artificial Intelligence 1 305 2025-02-06

a We assume here that γ’s constraint graph is connected. If it is not, do this and the following

for each component separately.

AcyclicCG(γ): Example
 Example 9.5.16 (AcyclicCG() execution).

v1

123

v1 < v2

v2 123 123 v3
v2 < v3

v1

123

v1 < v2

v2 123 123 v3
v2 < v3

Input network γ.
Step 1: Directed tree for root v 1 .
9.6. CUTSET CONDITIONING 209

v1

123

v1 < v2

v2 12 123 v3
v2 < v3

Step 2: Order v 1 , v 2 , v 3 .
v1

v1 < v2

v2 12 123 v3
v2 < v3

Step 3: After Revise(γ, v 2 , v 3 ).


v1

v1 < v2

v2 2 123 v3
v2 < v3

Step 3: After Revise(γ, v 1 , v 2 ).


Step 4: After a(v 1 ) := 1 and forward checking.
v1

v1 < v2

v2 2 3 v3
v2 < v3

Step 4: After a(v 2 ) := 2 and forward checking.


v1

v1 < v2

v2 2 3 v3
v2 < v3

Step 4: After a(v 3 ) := 3 (and forward checking).

Michael Kohlhase: Artificial Intelligence 1 306 2025-02-06

9.6 Cutset Conditioning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22354.
210 CHAPTER 9. CONSTRAINT PROPAGATION

“Almost” Acyclic Constraint Graphs


 Example 9.6.1 (Coloring Australia).

 Cutset Conditioning: Idea:


1. Recursive call of backtracking search on a s.t. the subgraph of the constraint
graph induced by {v ∈ V | a(v) is undefined} is acyclic.
 Then we can solve the remaining sub-problem with AcyclicCG().
2. Choose the variable ordering so that removing the first d variables renders the
constraint graph acyclic.
 Then with (1) we won’t have to search deeper than d . . . !

Michael Kohlhase: Artificial Intelligence 1 307 2025-02-06

“Decomposition” 3.0: Cutset Conditioning


 Definition 9.6.2 (Cutset). Let γ := ⟨V , D, C ⟩ be a constraint network, and
V0 ⊆ V . Then V0 is a cutset for γ if the subgraph of γ’s constraint graph induced
by V \V0 is acyclic. V0 is called optimal if its size is minimal among all cutsets for
γ.
 Definition 9.6.3. The cutset conditioning algorithm, computes an optimal cutset,
from γ and an existing cutset V0 .
function CutsetConditioning(γ,V0 ,a) returns a solution, or ‘‘inconsistent’’
γ ′ := a copy of γ; γ ′ := ForwardChecking(γ ′ ,a)
if ex. v with Dγ ′ v = ∅ then return ‘‘inconsistent’’
if ex. v ∈ V0 s.t. a(v) is undefined then select such v
else a′ := AcyclicCG(γ ′ );
if a′ ̸= “inconsistent” then return a ∪ a′ else return ‘‘inconsistent’’
for each d ∈ copy of Dγ ′ v in some order do
a′ := a ∪ {v = d}; Dγ ′ v := {d};
a′′ := CutsetConditioning(γ ′ ,V0 ,a′ )
if a′′ ̸= “inconsistent” then return a′′ else return ‘‘inconsistent’’

 Forward checking is required so that “a ∪ AcyclicCG(γ ′ )” is consistent in γ.


 Observation 9.6.4. Running time is exponential only in #(V0 ), not in #(V )!

 Remark 9.6.5. Finding optimal cutsets is NP hard, but good approximations exist.

Michael Kohlhase: Artificial Intelligence 1 308 2025-02-06


9.7. CONSTRAINT PROPAGATION WITH LOCAL SEARCH 211

9.7 Constraint Propagation with Local Search


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22355.

Iterative algorithms for CSPs


 Local search algorithms like hill climbing and simulated annealing typically work
with “complete” states, i.e., all variables are assigned
 To apply to CSPs: allow states with unsatisfied constraints, actions reassign variable
values.

 Variable selection: Randomly select any conflicted variable.


 Value selection by min conflicts heuristic: choose value that violates the fewest
constraints i.e., hill climb with h(n):=total number of violated constraints.

Michael Kohlhase: Artificial Intelligence 1 309 2025-02-06

Example: 4-Queens
 States: 4 queens in 4 columns (44 = 256 states)
 Actions: Move queen in column
 Goal state: No conflicts
 Heuristic: h(n) =
b number of conflict

Michael Kohlhase: Artificial Intelligence 1 310 2025-02-06

Performance of min-conflicts
 Given a random initial state, can solve n-queens in almost constant time for
arbitrary n with high probability (e.g., n = 10,000,000)
 The same appears to be true for any randomly-generated CSP except in a narrow
range of the ratio
number of constraints
R=
number of variables
212 CHAPTER 9. CONSTRAINT PROPAGATION

Michael Kohlhase: Artificial Intelligence 1 311 2025-02-06

9.8 Conclusion & Summary


A Video Nugget covering this section can be found at https://fau.tv/clip/id/22356.

Conclusion & Summary


 γ and γ ′ are equivalent if they have the same solutions. γ ′ is tighter than γ if it is
more constrained.
 Inference tightens γ without losing equivalence, during backtracking search. This
reduces the amount of search needed; that benefit must be traded off against the
running time overhead for making the inferences.

 Forward checking removes values conflicting with an assignment already made.

 Arc consistency removes values that do not comply with any value still available at
the other end of a constraint. This subsumes forward checking.
 The constraint graph captures the dependencies between variables. Separate con-
nected components can be solved independently. Networks with acyclic constraint
graphs can be solved in low order polynomial time.

 A cutset is a subset of variables removing which renders the constraint graph acyclic.
Cutset conditioning backtracks only on such a cutset, and solves a sub-problem with
acyclic constraint graph at each search leaf.

Michael Kohlhase: Artificial Intelligence 1 312 2025-02-06

Topics We Didn’t Cover Here


 Path consistency, k-consistency: Generalizes arc consistency to size k subsets
of variables. Path consistency =
b 3-consistency.
 Tree decomposition: Instead of instantiating variables until the leaf nodes are
trees, distribute the variables and constraints over sub-CSPs whose connections form
a tree.
 Backjumping: Like backtracking search, but with ability to back up across several
9.8. CONCLUSION & SUMMARY 213

levels (to a previous variable assignment identified to be responsible for failure).


 No-Good Learning: Inferring additional constraints based on information gath-
ered during backtracking search.

 Local search: In space of total (but not necessarily consistent) assignments.


(E.g., 8 queens in ??)
 Tractable CSP: Classes of CSPs that can be solved in P.
 Global Constraints: Constraints over many/all variables, with associated special-
ized inference methods.

 Constraint Optimization Problems (COP): Utility function over solutions, need


an optimal one.

Michael Kohlhase: Artificial Intelligence 1 313 2025-02-06

Suggested Reading:
• Chapter 6: Constraint Satisfaction Problems in [RN09], in particular Sections 6.2, 6.3.2, and
6.5.
– Compared to our treatment of the topic “constraint satisfaction problems” (?? and ??),
RN covers much more material, but less formally and in much less detail (in particular, our
slides contain many additional in-depth examples). Nice background/additional reading, can’t
replace the lectures.
– Section 6.3.2: Somewhat comparable to our “inference” (except that equivalence and tightness
are not made explicit in RN) together with “forward checking”.
– Section 6.2: Similar to our “arc consistency”, less/different examples, much less detail, addi-
tional discussion of path consistency and global constraints.
– Section 6.5: Similar to our “decomposition” and “cutset conditioning”, less/different examples,
much less detail, additional discussion of tree decomposition.
214 CHAPTER 9. CONSTRAINT PROPAGATION
Part III

Knowledge and Inference

215
217

A Video Nugget covering this part can be found at https://fau.tv/clip/id/22466.


This part of the course introduces representation languages and inference methods for structured
state representations for agents: In contrast to the atomic and factored state representations from
??, we look at state representations where the relations between objects are not determined by
the problem statement, but can be determined by inference-based methods, where the knowledge
about the environment is represented in a formal langauge and new knowledge is derived by
transforming expressions of this language.
We look at propositional logic – a rather weak representation langauge – and first-order logic
– a much stronger one – and study the respective inference procedures. In the end we show that
computation in Prolog is just an inference process as well.
218
Chapter 10

Propositional Logic & Reasoning,


Part I: Principles

10.1 Introduction: Inference with Structured State Repre-


sentations
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22455.

State Representations in Agents and Algorithms


 Recall: We call a state representation

 atomic, iff it has no internal structure (black box)


 factored, iff each state is characterized by attribute and their values.
 structured, iff the state includes representations of objects, their properties and
relationships.

 Recall: We have used atomic representations in search problems and tree search
algorithms.
 But: We already allowed peeking into state in
 informed search to compute heuristics
 adversarial search ⇝ too many state!
 Recall: We have used factored representations in
 backtracking search for CSPs ; universally useful heuristics
 constraint propagation: inference ; lifting search to the CSP description level.

 Up Next: Inference for structured state representations.

Michael Kohlhase: Artificial Intelligence 1 314 2025-02-06

10.1.1 A Running Example: The Wumpus World


To clarify the concepts and methods for inference with structured state representations, we now
introduce an extended example (the Wumpus world) and the agent model (logic-based agents)
that use them. We will refer back to both from time to time below.

219
220 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

The Wumpus world is a very simple game modeled after the early text adventure games of the
1960 and 70ies, where the player entered a world and was provided with textual information about
percepts and could explore the world via actions. The main difference is that we use it as an agent
environment in this course.

The Wumpus World


Definition 10.1.1. The Wumpus world is a
simple game where an agent explores a cave
with 16 cells that can contain pits, gold, and
the Wumpus with the goal of getting back
out alive with the gold.
The agent cannot observe the locations of
pits, gold, and the Wumpus, but some of
their effects in the cell it currently visits.

 Definition 10.1.2 (Actions). The agent can perform the following actions: goForward,
turnRight (by 90◦ ), turnLeft (by 90◦ ), shoot arrow in direction you’re facing (you
got exactly one arrow), grab an object in current cell, leave cave if you’re in cell
[1, 1].
 Definition 10.1.3 (Initial and Terminal States). Initially, the agent is in cell
[1, 1] facing east. If the agent falls down a pit or meets live Wumpus it dies.

 Definition 10.1.4 (Percepts). The agent can experience the following percepts:
stench, breeze, glitter, bump, scream, none.
 Cell adjacent (i.e. north, south, west, east) to Wumpus: stench (else: none).
 Cell adjacent to pit: breeze (else: none).
 Cell that contains gold: glitter (else: none).
 You walk into a wall: bump (else: none).
 Wumpus shot by arrow: scream (else: none).

Michael Kohlhase: Artificial Intelligence 1 315 2025-02-06

The game is complex enough to warrant structured state representations and can easily be extended
to include uncertainty and non-determinism later.
As our focus is on inference processes here, let us see how a human player would reason when
entering the Wumpus world. This can serve as a model for designing our artificial agents.

Reasoning in the Wumpus World


 Example 10.1.5 (Reasoning in the Wumpus World).
As humans we mark cells with the knowledge inferred so far: A: agent, V: visited,
OK: safe, P: pit, W: Wumpus, B: breeze, S: stench, G: gold.
10.1. INTRODUCTION: INFERENCE WITH STRUCTURED STATE REPRESENTATIONS221

(1) Initial state (2) One step to right (3) Back, and up to [1,2]

 The Wumpus is in [1,3]! How do we know?


 No stench in [2,1], so the stench in [1,2] can only come from [1,3].
 There’s a pit in [3,1]! How do we know?
 No breeze in [1,2], so the breeze in [2,1] can only come from [3,1].
 Note: The agent has more knowledge than just the percepts ⇝ inference!

Michael Kohlhase: Artificial Intelligence 1 316 2025-02-06

Let us now look into what kind of agent we would need to be successful in the Wumpus world:
it seems reasonable that we should build on a model-based agent and specialize it to structured
state representations and inference.

Agents that Think Rationally


 Problem: But how can we build an agent that can do the necessary inferences?
 Idea: Think Before You Act!
“Thinking” = Inference about knowledge represented using logic.

 Definition 10.1.6. A logic-based agent is a model-based agent that represents the


world state as a logical formula and uses inference to think about the state of the
environment
Section 2.4. andStructure
The its own actions. Agent schema:
of Agents 51

Sensors
State
How the world evolves What the world
is like now
Environment

What my actions do

Condition-action rules What action I


should do now

Agent Actuators

Figure 2.11 A model-based reflex agent.

The formal language of the logical system acts as a world description language.
Agent function:
function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action
persistent: state, the agent’s current conception of the world state
function KB−AGENTmodel , a description returns
(percept) of how the next anstateaction
depends on current state and action
a set of condition–action rules
persistent: KB, a rules, knowledge
action, base
the most recent action, initially none
a ←counter,
t,state initially 0, indicating time
U PDATE -S TATE(state, action , percept , model )
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))
rule ← RULE -M ATCH(state, rules)
action := ASK(KB, MAKE−ACTION−QUERY(t))
action ← rule.ACTION
return action

Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.

is responsible for creating the new internal state description. The details of how models and
states are represented vary widely depending on the type of environment and the particular
technology used in the agent design. Detailed examples of models and updating algorithms
222 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

TELL(KB, MAKE−ACTION−SENTENCE(action,t))
t := t+1
return action

Its agent function maintains a knowledge base about the environment, which is
updated with percept descriptions (formalizations of the percepts) and action de-
scriptions. The next action is the result of a suitable inference-based query to the
knowledge base.

Michael Kohlhase: Artificial Intelligence 1 317 2025-02-06

10.1.2 Propositional Logic: Preview


We will now give a preview of the concepts and methods in propositional logic based on the
Wumpus world before we formally define them below. The focus here is on the use of PL0 as a
world description language and understanding how inference might work.
We will start off with our preview by looking into the use of PL0 as a world description language
for the Wumpus world. For that we need to fix the language itself (its syntax) and the meaning
of expressions in PL0 (its semantics).

Logic: Basic Concepts (Representing Knowledge)


 We now preview some of the concepts involved in logic so that you have an intuition
for the formal definitions below.
 Definition 10.1.7. Syntax: What are legal formulae A in the logic?
 Example 10.1.8. “W ” and “W ⇒ S”.
b Wumpus is here, S =
(W = b it stinks, W ⇒ S =
b If W , then S)
 Definition 10.1.9. Semantics: Which formulae A are true?
 Observation: Whether W ⇒ S is true depends on whether W and S are!
 Idea: Capture the state of W and S. . . in a variable assignment.

 Definition 10.1.10. For a variable assignment φ, write φ|=A if φ is true in the


Wumpus world described by φ.
 Example 10.1.11. If φ := {W 7→ T, S 7→ F}, then φ|=W but φ̸|=(W ⇒ S).

 Intuition: Knowledge about the state of the world is described by formulae,


interpretations evaluate them in the current world (they should turn out true!)
 Definition 10.1.12. The process of representing a natural language text in the
formal language of a logical system is called formalization.
 Observation: Formalizing a NL text or utterance makes it machine-actionable.
(the ultimate purpose of AI)
 Observation: Formalization is an art/skill, not a science!

Michael Kohlhase: Artificial Intelligence 1 318 2025-02-06

It is critical to understand that while PL0 as a logical system is given once and for all, the agent
designer still has to formalize the situation (here the Wumpus world) in the world description
10.1. INTRODUCTION: INFERENCE WITH STRUCTURED STATE REPRESENTATIONS223

language (here PL0 ; but we will look at more expressive logical systems below). This formalization
is the seed of the knowledge base, the logic-based agent can then add to via its percepts and action
descriptions, and that also forms the basis of its inferences. We will look at this aspect now.

Logic: Basic Concepts (Reasoning about Knowledge)


 Definition 10.1.13. Entailment: Which B follow from A, written A ⊨ B, meaning
that, for all φ with φ|=A, we have φ|=B? E.g., P ∧ (P ⇒ Q) ⊨ Q.
 Intuition: Entailment = b ideal outcome of reasoning, everything that we can
possibly conclude. e.g. determine Wumpus position as soon as we have enough
information
 Definition 10.1.14. Deduction: Which formulas B can be derived from A using
a set C of inference rules (a calculus), written A⊢C B?
A A⇒B
 Example 10.1.15. If C contains then P, P ⇒ Q⊢C Q
B
 Intuition: Deduction =b process in an actual computer trying to reason about
entailment. E.g. a mechanical process attempting to determine Wumpus position.

 Critical Insight: Entailment is purely semantical and gives a mathematical founda-


tion of reasoning in PL0 , while Deduction is purely syntactic and can be implemented
well. (but this only helps if they are related)
 Definition 10.1.16. Soundness: whenever A⊢C B, we also have A ⊨ B.

 Definition 10.1.17. Completeness: whenever A ⊨ B, we also have A⊢C B.

Michael Kohlhase: Artificial Intelligence 1 319 2025-02-06

General Problem Solving using Logic


 Idea: Any problem that can be formulated as reasoning about logic. ; use
off-the-shelf reasoning tool.
 Very successful using propositional logic and modern SAT solvers! (Propositional
satisfiability testing; ??)

Michael Kohlhase: Artificial Intelligence 1 320 2025-02-06

Propositional Logic and Its Applications


 Propositional logic = canonical form of knowledge + reasoning.

 Syntax: Atomic propositions that can be either true or false, connected by “and,
or, and not”.
 Semantics: Assign value to every proposition, evaluate connectives.
 Applications: Despite its simplicity, widely applied!
224 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

 Product configuration (e.g., Mercedes). Check consistency of customized


combinations of components.
 Hardware verification (e.g., Intel, AMD, IBM, Infineon). Check whether a
circuit has a desired property p.
 Software verification: Similar.
 CSP applications: Propositional logic can be (successfully!) used to formulate
and solve constraint satisfaction problems. (see ??)
 ?? gives an example for verification.

Michael Kohlhase: Artificial Intelligence 1 321 2025-02-06

10.1.3 Propositional Logic: Agenda

Our Agenda for This Topic


 This subsection: Basic definitions and concepts; tableaux, resolution.
 Sets up the framework. Resolution is the quintessential reasoning procedure
underlying most successful SAT solvers.
 Next Section (??): The Davis Putnam procedure and clause learning; practical
problem structure.
 State-of-the-art algorithms for reasoning about propositional logic, and an im-
portant observation about how they behave.

Michael Kohlhase: Artificial Intelligence 1 322 2025-02-06

Our Agenda for This Chapter


 Propositional logic: What’s the syntax and semantics? How can we capture de-
duction?

 We study this logic formally.


 Tableaux, Resolution: How can we make deduction mechanizable? What are its
properties?
 Formally introduces the most basic machine-oriented reasoning algorithm.

 Killing a Wumpus: How can we use all this to figure out where the Wumpus is?
 Coming back to our introductory example.

Michael Kohlhase: Artificial Intelligence 1 323 2025-02-06

10.2 Propositional Logic (Syntax/Semantics)


Video Nuggets covering this section can be found at https://fau.tv/clip/id/22457 and
https://fau.tv/clip/id/22458.
10.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 225

We will now develop the formal theory behind the ideas previewed in the last section and use
that as a prototype for the theory of the more expressive logical systems still to come in AI-1. As
PL0 is a very simple logical system, we could cut some corners in the exposition but we will stick
closer to a generalizable theory.

Propositional Logic (Syntax)

 Definition 10.2.1 (Syntax). The formulae of propositional logic (write PL0 ) are
made up from
 propositional variables: V0 := {P , Q, R, P 1 , P 2 , . . .} (countably infinite)
 A propositional signature: constants/constructors called connectives: Σ0 :=
{T , F , ¬, ∨, ∧, ⇒, ⇔, . . .}

We define the set wff0 (V0 ) of well-formed propositional formula (wffs) as


 propositional variables,
 the logical constants T and F ,
 negations ¬A,
 conjunctions A ∧ B(A and B are called conjuncts),
 disjunctions A ∨ B (A and B are called disjuncts),
 implications A ⇒ B, and
 equivalences (or biimplication). A ⇔ B,

where A, B ∈ wff0 (V0 ) themselves.


 Example 10.2.2. P ∧ Q, P ∨ Q, ¬P ∨ Q ⇔ P ⇒ Q ∈ wff0 (V0 )
 Definition 10.2.3. Propositional formulae without connectives are called atomic
(or an atom) and complex otherwise.

Michael Kohlhase: Artificial Intelligence 1 324 2025-02-06

We can also express the formal language introduced by ?? as a context-free grammar.

Propositional Logic Grammar Overview


 Grammar for Propositional Logic:

propositional variables X ::= V0 = {P , Q, R, . . . , . . .} variables


propositional formulae A ::= X variable
| T |F truth values
| ¬A negation
| A1 ∧ A2 conjunction
| A1 ∨ A2 disjunction
| A1 ⇒ A2 implication
| A1 ⇔ A2 equivalence

Michael Kohlhase: Artificial Intelligence 1 325 2025-02-06

Propositional logic is a very old and widely used logical system. So it should not be surprising
that there are other notations for the connectives than the ones we are using in AI-1. We list the
226 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

most important ones here for completeness.

Alternative Notations for Connectives


Here Elsewhere
¬A ∼A A
A∧B A&B A•B A, B
A∨B A+B A|B A;B
A⇒B A→B A⊃B
A⇔B A↔B A≡B
F ⊥ 0
T ⊤ 1

Michael Kohlhase: Artificial Intelligence 1 326 2025-02-06

These notations will not be used in AI-1, but sometimes appear in the literature.
The semantics of PL0 is defined relative to a model, which consists of a universe of discourse and
an interpretation function that we specify now.

Semantics of PL0 (Models)

 Warning: For the official semantics of PL0 we will separate the tasks of giving
meaning to connectives and propositional variables to different mappings.
 This will generalize better to other logical systems. (and thus applications)
 Definition 10.2.4. A model M := ⟨Do , I⟩ for propositional logic consists of

 the universe Do = {T, F}


 the interpretation I that assigns values to essential connectives.
 I(¬) : Do → Do ; T 7→ F, F 7→ T
 I(∧) : Do × Do → Do ; ⟨α, β⟩ 7→ T, iff α = β = T

We call a constant a logical constant, iff its value is fixed by the interpretation.
 Treat the other connectives as abbreviations, e.g. A ∨ B= b ¬(¬A ∧ ¬B) and
A ⇒ B= b ¬A ∨ B, and T =b P ∨ ¬P (only need to treat ¬, ∧ directly)

 Note: PL0 is a single-model logical system with canonical model ⟨Do , I⟩.

Michael Kohlhase: Artificial Intelligence 1 327 2025-02-06

We have a problem in the exposition of the theory here: As PL0 semantics only has a single,
canonical model, we could simplify the exposition by just not mentioning the universe and inter-
pretation function. But we choose to expose both of them in the construction, since other versions
of propositional logic – in particular the system PLnq below – that have a choice of models as they
use a different distribution of the representation among constants and variables.

Semantics of PL0 (Evaluation)


10.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 227

 Problem: The interpretation function I only assigns meaning to connectives.


 Definition 10.2.5. A variable assignment φ : V0 → Do assigns values to proposi-
tional variables.
 Definition 10.2.6. The value function I φ : wff0 (V0 ) → Do assigns values to PL0
formulae. It is recursively defined,
 I φ (P ) = φ(P ) (base case)
 I φ (¬A) = I(¬)(I φ (A)).
 I φ (A ∧ B) = I(∧)(I φ (A), I φ (B)).

 Note: I φ (A ∨ B) = I φ (¬(¬A ∧ ¬B)) is only determined by I φ (A) and I φ (B),


so we think of the defined connectives as logical constants as well.
 Alternative Notation: Write [ A]]φ for I φ (A). (and [ A]], if A is ground)
 Definition 10.2.7. Two formulae A and B are called equivalent, iff I φ (A) =
I φ (B) for all variable assignments φ.

Michael Kohlhase: Artificial Intelligence 1 328 2025-02-06

In particular in a interpretation-less exposition of propositional logic would have elided the homo-
morphic construction of the value function and could have simplified the recursive cases in ?? to
I φ (A ∧ B) = T, iff I φ (A) = T = I φ (B).
But the homomorphic construction via I(∧) is standard to definitions in other logical systems
and thus generalizes better.

Computing Semantics
 Example 10.2.8. Let φ := [T/P 1 ], [F/P 2 ], [T/P 3 ], [F/P 4 ], . . . then

I φ (P 1 ∨ P 2 ∨ ¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 )
= I(∨)(I φ (P 1 ∨ P 2 ), I φ (¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 ))
= I(∨)(I(∨)(I φ (P 1 ), I φ (P 2 )), I(∨)(I φ (¬(¬P 1 ∧ P 2 )), I φ (P 3 ∧ P 4 )))
= I(∨)(I(∨)(φ(P 1 ), φ(P 2 )), I(∨)(I(¬)(I φ (¬P 1 ∧ P 2 )), I(∧)(I φ (P 3 ), I φ (P 4 ))))
= I(∨)(I(∨)(T, F), I(∨)(I(¬)(I(∧)(I φ (¬P 1 ), I φ (P 2 ))), I(∧)(φ(P 3 ), φ(P 4 ))))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(I φ (P 1 )), φ(P 2 ))), I(∧)(T, F)))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(φ(P 1 )), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(T), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(F, F)), F))
= I(∨)(T, I(∨)(I(¬)(F), F))
= I(∨)(T, I(∨)(T, F))
= I(∨)(T, T)
= T

 What a mess!

Michael Kohlhase: Artificial Intelligence 1 329 2025-02-06

Now we will also review some propositional identities that will be useful later on. Some of them we
have already seen, and some are new. All of them can be proven by simple truth table arguments.
228 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

Propositional Identities
 Definition 10.2.9. We have the following identities in propositional logic:

Name for ∧ for ∨


Idempotence φ∧φ=φ φ∨φ=φ
Identity φ∧T =φ φ∨F =φ
Absorption 1 φ∧F =F φ∨T =T
Commutativity φ∧ψ =ψ∧φ φ∨ψ =ψ∨φ
Associativity φ ∧ (ψ ∧ θ) = (φ ∧ ψ) ∧ θ φ ∨ (ψ ∨ θ) = (φ ∨ ψ) ∨ θ
Distributivity φ ∧ (ψ ∨ θ) = φ ∧ ψ ∨ φ ∧ θ φ ∨ ψ ∧ θ = (φ ∨ ψ) ∧ (φ ∨ θ)
Absorption 2 φ ∧ (φ ∨ θ) = φ φ∨φ∧θ =φ
De Morgan rule ¬(φ ∧ ψ) = ¬φ ∨ ¬ψ ¬(φ ∨ ψ) = ¬φ ∧ ¬ψ
double negation ¬¬φ = φ
Definitions φ ⇒ ψ = ¬φ ∨ ψ φ ⇔ ψ = (φ ⇒ ψ) ∧ (ψ ⇒ φ)

 Idea: How about using these as inference component (simplification) to simplify


calculations like the one in ??. (see below)

Michael Kohlhase: Artificial Intelligence 1 330 2025-02-06

We will now use the distribution of values of a propositional formula under all variable assignments
to characterize them semantically. The intuition here is that we want to understand theorems,
examples, counterexamples, and inconsistencies in mathematics and everyday reasoning1 .
The idea is to use the formal language of propositional formulae as a model for mathematical
language. Of course, we cannot express all of mathematics as propositional formulae, but we can
at least study the interplay of mathematical statements (which can be true or false) with the
copula “and”, “or” and “not”.

Semantic Properties of Propositional Formulae


 Definition 10.2.10. Let M := ⟨U, I⟩ be our model, then we call A
 true under φ (φ satisfies A) in M, iff I φ (A) = T, (write M|=φ A)
 false under φ (φ falsifies A) in M, iff I φ (A) = F, (write M̸|=φ A)
 satisfiable in M, iff I φ (A) = T for some assignment φ,
 valid in M, iff M|=φ A for all variable assignments φ,
 falsifiable in M, iff I φ (A) = F for some assignments φ, and
 unsatisfiable in M, iff I φ (A) = F for all assignments φ.
 Example 10.2.11. x ∨ x is satisfiable and falsifiable.
 Example 10.2.12. x ∨ ¬x is valid and x ∧ ¬x is unsatisfiable.
 Note: As PL0 is a single-model logical system, we can elide the reference to the
model and regain the notation φ|=A from the preview for M|=φ A.
 Definition 10.2.13 (Entailment). (aka. logical consequence)
We say that A entails B (write A ⊨ B), iff I φ (B) = T for all φ with I φ (A) = T
(i.e. all assignments that make A true also make B true)

1 Here (and elsewhere) we will use mathematics (and the language of mathematics) as a test tube for under-

standing reasoning, since mathematics has a long history of studying its own reasoning processes and assumptions.
10.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 229

Michael Kohlhase: Artificial Intelligence 1 331 2025-02-06

Let us now see how these semantic properties model mathematical practice.
In mathematics we are interested in assertions that are true in all circumstances. In our model
of mathematics, we use variable assignments to stand for “circumstances”. So we are interested
in propositional formulae which are true under all variable assignments; we call them valid. We
often give examples (or show situations) which make a conjectured formula false; we call such
examples counterexamples, and such assertions falsifiable. We also often give examples for certain
formulae to show that they can indeed be made true (which is not the same as being valid yet);
such assertions we call satisfiable. Finally, if a formula cannot be made true in any circumstances
we call it unsatisfiable; such assertions naturally arise in mathematical practice in the form of
refutation proofs, where we show that an assertion (usually the negation of the theorem we want
to prove) leads to an obviously unsatisfiable conclusion, showing that the negation of the theorem
is unsatisfiable, and thus the theorem valid.

A better mouse-trap: Truth Tables


 Truth tables visualize truth functions:
¬ ∧ ⊤ ⊥ ∨ ⊤ ⊥
⊤ F ⊤ T F ⊤ T T
⊥ T ⊥ F F ⊥ T F

 If we are interested in values for all assignments (e.g z ∧ x ∨ ¬(z ∧ y))

assignments intermediate results full


x y z e1 := z ∧ y e2 := ¬e1 e3 := z ∧ x e3 ∨ e2
F F F F T F T
F F T F T F T
F T F F T F T
F T T T F F F
T F F F T F T
T F T F T T T
T T F F T F T
T T T T F T T

Michael Kohlhase: Artificial Intelligence 1 332 2025-02-06

Let us finally test our intuitions about propositional logic with a “real-world example”: a logic
puzzle, as you could find it in a Sunday edition of the local newspaper.

Hair Color in Propositional Logic


 There are three persons, Stefan, Nicole, and Jochen.
1. Their hair colors are black, red, or green.
2. Their study subjects are AI, Physics, or Chinese at least one studies AI.
(a) Persons with red or green hair do not study AI.
(b) Neither the Physics nor the Chinese students have black hair.
(c) Of the two male persons, one studies Physics, and the other studies Chinese.
 Question: Who studies AI?
(A) Stefan (B) Nicole (C) Jochen (D) Nobody
230 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

 Answer: You can solve this using PL0 , if we accept bla(S), etc. as propositional variables.
We first express what we know: For every x ∈ {S, N , J} (Stefan, Nicole, Jochen) we have

1. bla(x) ∨ red(x) ∨ gre(x); (note: three formulae)


2. ai(x) ∨ phy(x) ∨ chi(x) and ai(S) ∨ ai(N ) ∨ ai(J)
(a) ai(x) ⇒ ¬red(x) ∧ ¬gre(x).
(b) phy(x) ⇒ ¬bla(x) and chi(x) ⇒ ¬bla(x).
(c) phy(S) ∧ chi(J) ∨ phy(J) ∧ chi(S).

Now, we obtain new knowledge via entailment steps:

3. 1. together with 2.2a entails that ai(x) ⇒ bla(x) for every x ∈ {S, N , J},
4. thus ¬bla(S) ∧ ¬bla(J) by 2.2c and 2.2b and
5. so ¬ai(S) ∧ ¬ai(J) by 3. and 4.
6. With 2. the latter entails ai(N ).

Michael Kohlhase: Artificial Intelligence 1 333 2025-02-06

The example shows that puzzles like that are a bit difficult to solve without writing things down.
But if we formalize the situation in PL0 , then we can solve the puzzle quite handily with inference.
Note that we have been a bit generous with the names of propositional variables; e.g. bla(x),
where x ∈ {S, N , J}, to keep the representation small enough to fit on the slide. This does not
hinder the method in any way.

10.3 Inference in Propositional Logics


We have now defined syntax (the language agents can use to represent knowledge) and its
semantics (how expressions of this language relate to agent’s environment). Theoretically, an
agent could use the entailment relation to derive new knowledge from percepts and the existing
state representation – in the MAKE−PERCEPT−SENTENCE and MAKE−ACTION−SENTENCE
subroutines below. But as we have seen in above, this is very tedious. A much better way would
be to have a set of rules that directly act on the state representations.
Let us now look into what kind of agent we would need to be successful in the Wumpus world:
it seems reasonable that we should build on a model-based agent and specialize it to structured
state representations and inference.

Agents that Think Rationally


 Problem: But how can we build an agent that can do the necessary inferences?

 Idea: Think Before You Act!


“Thinking” = Inference about knowledge represented using logic.
 Definition 10.3.1. A logic-based agent is a model-based agent that represents the
world state as a logical formula and uses inference to think about the state of the
environment and its own actions. Agent schema:
10.3. INFERENCE
Section 2.4. INThe
PROPOSITIONAL
Structure of Agents LOGICS 51 231

Sensors
State
How the world evolves What the world
is like now

Environment
What my actions do

Condition-action rules What action I


should do now

Agent Actuators

Figure 2.11 A model-based reflex agent.

The formal language of the logical system acts as a world description language.
Agent function:
function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action
persistent: state, the agent’s current conception of the world state
function KB−AGENTmodel , a description returns
(percept) of how the next anstateaction
depends on current state and action
a set of condition–action rules
persistent: KB, a rules, knowledge
action, base
the most recent action, initially none
a ←counter,
t,state initially 0, indicating time
U PDATE -S TATE(state, action , percept , model )
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))
rule ← RULE -M ATCH(state, rules)
action := ASK(KB, MAKE−ACTION−QUERY(t))
action ← rule.ACTION
return action
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
t := t+1 Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
return actionusing an internal model. It then chooses an action in the same way as the reflex agent.

is responsible for creating the new internal state description. The details of how models and
Its agent function maintains a knowledge base about the environment, which is
states are represented vary widely depending on the type of environment and the particular
updated withtechnology used descriptions
percept in the agent design.(formalizations of the
Detailed examples of models and percepts) and action de-
updating algorithms
scriptions. The
appear next action
in Chapters is 15,
4, 12, 11, the17,result
and 25. of a suitable inference-based query to the
Regardless of the kind of representation used, it is seldom possible for the agent to
knowledge base.
determine the current state of a partially observable environment exactly. Instead, the box
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess
Michael Kohlhase: Artificial Intelligence 1 334
about what may be causing the
2025-02-06
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
to make a decision.
A perhaps less obvious point about the internal “state” maintained by a model-based

A Simple Formal System: Prop. Logic with Hilbert-Calculus


agent is that it does not have to describe “what the world is like now” in a literal sense. For

 Formulae: Built from propositional variables: P , Q, R. . . and implication: ⇒


 Semantics: I φ (P ) = φ(P ) and I φ (A ⇒ B) = T, iff I φ (A) = F or I φ (B) = T.
 Definition 10.3.2. The Hilbert calculus H0 consists of the inference rules:

K S
P ⇒Q⇒P (P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R

A⇒B A A
MP Subst
B [B/X](A)

 Example 10.3.3. A H0 theorem C ⇒ C and its proof


Proof: We show that ∅⊢H0 C ⇒ C
1. (C ⇒ (C ⇒ C) ⇒ C) ⇒ (C ⇒ C ⇒ C) ⇒ C ⇒ C (S with
[C/P ], [C ⇒ C/Q], [C/R])
2. C ⇒ (C ⇒ C) ⇒ C (K with [C/P ], [C ⇒ C/Q])
3. (C ⇒ C ⇒ C) ⇒ C ⇒ C (MP on P.1 and P.2)
4. C ⇒ C ⇒ C (K with [C/P ], [C/Q])
5. C ⇒ C (MP on P.3 and P.4)
232 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

Michael Kohlhase: Artificial Intelligence 1 335 2025-02-06

This is indeed a very simple formal system, but it has all the required parts:
• A formal language: expressions built up from variables and implications.
• A semantics: given by the obvious interpretation function
• A calculus: given by the two axioms and the two inference rules.
The calculus gives us a set of rules with which we can derive new formulae from old ones. The
axioms are very simple rules, they allow us to derive these two formulae in any situation. The
proper inference rules are slightly more complicated: we read the formulae above the horizontal
line as assumptions and the (single) formula below as the conclusion. An inference rule allows us
to derive the conclusion, if we have already derived the assumptions.
Now, we can use these inference rules to perform a proof – a sequence of formulae that can be
derived from each other. The representation of the proof in the slide is slightly compactified to fit
onto the slide: We will make it more explicit here. We first start out by deriving the formula

(P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R (10.1)

which we can always do, since we have an axiom for this formula, then we apply the rule Subst,
where A is this result, B is C, and X is the variable P to obtain

(C ⇒ Q ⇒ R) ⇒ (C ⇒ Q) ⇒ C ⇒ R (10.2)

Next we apply the rule Subst to this where B is C ⇒ C and X is the variable Q this time to obtain

(C ⇒ (C ⇒ C) ⇒ R) ⇒ (C ⇒ C ⇒ C) ⇒ C ⇒ R (10.3)

And again, we apply the rule Subst this time, B is C and X is the variable R yielding the first
formula in our proof on the slide. To conserve space, we have combined these three steps into one
in the slide. The next steps are done in exactly the same way.
In general, formulae can be used to represent facts about the world as propositions; they have a
semantics that is a mapping of formulae into the real world (propositions are mapped to truth
values.) We have seen two relations on formulae: the entailment relation and the derivation
relation. The first one is defined purely in terms of the semantics, the second one is given by a
calculus, i.e. purely syntactically. Is there any relation between these relations?

Soundness and Completeness


 Definition 10.3.4. Let L := ⟨L, K, ⊨⟩ be a logical system, then we call a calculus
C for L,
 sound (or correct), iff H ⊨ A, whenever H⊢C A, and
 complete, iff H⊢C A, whenever H ⊨ A.

 Goal: Find calculi C, such that ⊢C A iff ⊨ A (provability and validity coincide)
 To TRUTH through PROOF (CALCULEMUS [Leibniz ∼1680])


10.4. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 233

Michael Kohlhase: Artificial Intelligence 1 336 2025-02-06

Ideally, both relations would be the same, then the calculus would allow us to infer all facts that
can be represented in the given formal language and that are true in the real world, and only
those. In other words, our representation and inference is faithful to the world.
A consequence of this is that we can rely on purely syntactical means to make predictions
about the world. Computers rely on formal representations of the world; if we want to solve a
problem on our computer, we first represent it in the computer (as data structures, which can be
seen as a formal language) and do syntactic manipulations on these structures (a form of calculus).
Now, if the provability relation induced by the calculus and the validity relation coincide (this will
be quite difficult to establish in general), then the solutions of the program will be correct, and we
will find all possible ones. Of course, the logics we have studied so far are very simple, and not
able to express interesting facts about the world, but we will study them as a simple example of
the fundamental problem of computer science: How do the formal representations correlate with
the real world.
Within the world of logics, one can derive new propositions (the conclusions, here: Socrates is
mortal) from given ones (the premises, here: Every human is mortal and Sokrates is human). Such
derivations are proofs.
In particular, logics can describe the internal structure of real-life facts; e.g. individual things,
actions, properties. A famous example, which is in fact as old as it appears, is illustrated in the
slide below.

The Miracle of Logic


 Purely formal derivations are true in the real world!

Michael Kohlhase: Artificial Intelligence 1 337 2025-02-06

If a formal system is correct, the conclusions one can prove are true (= hold in the real world)
whenever the premises are true. This is a miraculous fact (think about it!)

10.4 Propositional Natural Deduction Calculus


Video Nuggets covering this section can be found at https://fau.tv/clip/id/22520 and
https://fau.tv/clip/id/22525.
234 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

We will now introduce the “natural deduction” calculus for propositional logic. The calculus
was created to model the natural mode of reasoning e.g. in everyday mathematical practice. In
particular, it was intended as a counter-approach to the well-known Hilbert style calculi, which
were mainly used as theoretical devices for studying reasoning in principle, not for modeling
particular reasoning styles. We will introduce natural deduction in two styles/notations,
both were invented by Gerhard Gentzen in the 1930’s and are very much related. The Natural
Deduction style (ND) uses local hypotheses in proofs for hypothetical reasoning, while the “sequent
style” is a rationalized version and extension of the ND calculus that makes certain meta-proofs
simpler to push through by making the context of local hypotheses explicit in the notation. The
sequent notation also constitutes a more adequate data struture for implementations, and user
interfaces.
Rather than using a minimal set of inference rules, we introduce a natural deduction calculus that
provides two/three inference rules for every logical constant, one “introduction rule” (an inference
rule that derives a formula with that logical constant at the head) and one “elimination rule” (an
inference rule that acts on a formula with this head and derives a set of subformulae).

Calculi: Natural Deduction (ND0 ; Gentzen [Gen34])


 Idea: ND0 tries to mimic human argumentation for theorem proving.

 Definition 10.4.1. The propositional natural deduction calculus ND0 has inference
rules for the introduction and elimination of connectives:

Introduction Elimination Axiom


A B A∧B A∧B
ND0 ∧I ND0 ∧El ND0 ∧Er
A∧B A B
ND0 TND
A ∨ ¬A
[A]1

B A⇒B A
ND0 ⇒I 1 ND0 ⇒E
A⇒B B

ND0 ⇒I a proves A ⇒ B by exhibiting a ND0 derivation D (depicted by the double


horizontal lines) of B from the local hypothesis A; ND0 ⇒I a then discharges (get
rid of A, which can only be used in D) the local hypothesis and concludes A ⇒ B.
This mode of reasoning is called hypothetical reasoning.
 Definition 10.4.2. Given a set H ⊆ wff0 (V0 ) of assumptions and a conclusion C,
we write H⊢ND0 C, iff there is a ND0 derivation tree whose leaves are in H.
 Note: ND0 TND is used only in classical logic. (otherwise
constructive/intuitionistic)

Michael Kohlhase: Artificial Intelligence 1 338 2025-02-06

The most characteristic rule in the natural deduction calculus is the ND0 ⇒I a rule and the hy-
pothetical reasoning it introduce. ND0 ⇒I a corresponds to the mathematical way of proving an
implication A ⇒ B: We assume that A is true and show B from this local hypothesis. When we
can do this we discharge the assumption and conclude A ⇒ B.
Note that the local hypothesis is discharged by the rule ND0 ⇒I a , i.e. it cannot be used in any
other part of the proof. As the ND0 ⇒I a rules may be nested, we decorate both the rule and the
corresponding local hypothesis with a marker (here the number 1).
Let us now consider an example of hypothetical reasoning in action.
10.4. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 235

Natural Deduction: Examples


 Example 10.4.3 (Inference with Local Hypotheses).

1
[A ∧ B]1 [A ∧ B]1 [A]
ND0 ∧Er ND0 ∧El 2
B A [B]
ND0 ∧I A
B∧A ND0 ⇒I 2
1
ND0 ⇒I B⇒A
A∧B⇒B∧A ND0 ⇒I 1
A⇒B⇒A

Michael Kohlhase: Artificial Intelligence 1 339 2025-02-06

Here we see hypothetical reasoning with local local hypotheses at work. In the left example, we
assume the formula A ∧ B and can use it in the proof until it is discharged by the rule ND0 ∧El on
the bottom – therefore we decorate the hypothesis and the rule by corresponding numbers (here
the label “1”). Note the local assumption A ∧ B is local to the proof fragment delineated by the
corresponding (local) hypothesis and the discharging rule, i.e. even if this derivation is only a
fragment of a larger proof, then we cannot use its (local) hypothesis anywhere else.
Note also that we can use as many copies of the local hypothesis as we need; they are all
discharged at the same time.
In the right example we see that local hypotheses can be nested as long as they are kept local.
In particular, we may not use the hypothesis B after the ND0 ⇒I 2 , e.g. to continue with a
ND0 ⇒E.
One of the nice things about the natural deduction calculus is that the deduction theorem is
almost trivial to prove. In a sense, the triviality of the deduction theorem is the central idea of
the calculus and the feature that makes it so natural.

A Deduction Theorem for ND0


 Theorem 10.4.4. H, A⊢ND0 B, iff H⊢ND0 A ⇒ B.
 Proof: We show the two directions separately
1. If H, A⊢ND0 B, then H⊢ND0 A ⇒ B by ND0 ⇒I , and
2. If H⊢ND0 A ⇒ B, then H, A⊢ND0 A ⇒ B by weakening and H, A⊢ND0 B by
ND0 ⇒E.

Michael Kohlhase: Artificial Intelligence 1 340 2025-02-06

Another characteristic of the natural deduction calculus is that it has inference rules (introduction
and elimination rules) for all connectives. So we extend the set of rules from ?? for disjunction,
negation and falsity.

More Rules for Natural Deduction


 Note: ND0 does not try to be minimal, but comfortable to work in!

 Definition 10.4.5. ND0 has the following additional inference rules for the remain-
236 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

ing connectives.
1 1
[A] [B]
.. ..
A∨B . .
A B C C
ND0 ∨Il ND0 ∨Ir ND0 ∨E 1
A∨B A∨B C
1 1
[A] [A]
.. ..
. .
C ¬C ND ¬I 1 ¬¬A
0 ND0¬E
¬A A

¬A A F
ND0FI ND0FE
F A

 Again: ND0¬E is used only in classical logic (otherwise


constructive/intuitionistic)

Michael Kohlhase: Artificial Intelligence 1 341 2025-02-06

Natural Deduction in Sequent Calculus Formulation


 Idea: Represent hypotheses explicitly. (lift calculus to judgments)
 Definition 10.4.6. A judgment is a meta-statement about the provability of propo-
sitions.
 Definition 10.4.7. A sequent is a judgment of the form H⊢A about the provability
of the formula A from the set H of hypotheses. We write ⊢A for ∅⊢A.
 Idea: Reformulate ND0 inference rules so that they act on sequents.

 Example 10.4.8.We give the sequent style version of ??:

ND⊢0 Ax ND⊢0 Ax
A ∧ B⊢A ∧ B A ∧ B⊢A ∧ B ND⊢0 Ax
ND⊢0 ∧ Er ND⊢0 ∧ El A, B⊢A
A ∧ B⊢B A ∧ B⊢A ND⊢0 ⇒I
ND⊢0 ∧ I A⊢B ⇒ A
A ∧ B⊢B ∧ A ND⊢0 ⇒I
ND⊢0 ⇒I ⊢A ⇒ B ⇒ A
⊢A ∧ B ⇒ B ∧ A

 Note: Even though the antecedent of a sequent is written like a sequences, it is


actually a set. In particular, we can permute and duplicate members at will.

Michael Kohlhase: Artificial Intelligence 1 342 2025-02-06

Sequent-Style Rules for Natural Deduction


10.4. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 237

 Definition 10.4.9. The following inference rules make up the propositional sequent
style natural deduction calculus ND⊢0 :

Γ⊢B
ND⊢0 Ax ND⊢0 weaken ND⊢0 TND
Γ, A⊢A Γ, A⊢B Γ⊢A ∨ ¬A

Γ⊢A Γ⊢B Γ⊢A ∧ B Γ⊢A ∧ B


ND⊢0 ∧ I ND⊢0 ∧ El ND⊢0 ∧ Er
Γ⊢A ∧ B Γ⊢A Γ⊢B

Γ⊢A Γ⊢B Γ⊢A ∨ B Γ, A⊢C Γ, B⊢C


ND⊢0 ∨Il ND⊢0 ∨Ir ND⊢0 ∨E
Γ⊢A ∨ B Γ⊢A ∨ B Γ⊢C

Γ, A⊢B Γ⊢A ⇒ B Γ⊢A


ND⊢0 ⇒I ND⊢0 ⇒E
Γ⊢A ⇒ B Γ⊢B

Γ, A⊢F Γ⊢¬¬A
ND⊢0 ¬I ND⊢0 ¬E
Γ⊢¬A Γ⊢A

Γ⊢¬A Γ⊢A Γ⊢F


ND⊢0 F I ND⊢0 F E
Γ⊢F Γ⊢A

Michael Kohlhase: Artificial Intelligence 1 343 2025-02-06

Linearized Notation for (Sequent-Style) ND Proofs


 Definition 10.4.10. Linearized notation for sequent-style ND proofs
1. H1 ⊢ A1 (J 1 )
H1 ⊢A1 H2 ⊢A2
2. H2 ⊢ A2 (J 2 ) corresponds to R
H3 ⊢A3
3. H3 ⊢ A3 (J 3 1, 2)
 Example 10.4.11. We show a linearized version of the ND0 examples ??

ND⊢0 Ax ND⊢0 Ax
A ∧ B⊢A ∧ B A ∧ B⊢A ∧ B ND⊢0 Ax
ND⊢0 ∧ Er ND⊢0 ∧ El A, B⊢A
A ∧ B⊢B A ∧ B⊢A ND⊢0 ⇒I
ND⊢0 ∧ I A⊢B ⇒ A
A ∧ B⊢B ∧ A ND⊢0 ⇒I
ND⊢0 ⇒I ⊢A ⇒ B ⇒ A
⊢A ∧ B ⇒ B ∧ A

# hyp ⊢ f ormula N Djust # hyp ⊢ f ormula N Djust


1. 1 ⊢ A∧B ND⊢0 Ax 1. 1 ⊢ A ND⊢0 Ax
2. 1 ⊢ B ND⊢0 ∧ Er 1 2. 2 ⊢ B ND⊢0 Ax
3. 1 ⊢ A ND⊢0 ∧ El 1 3. 1, 2 ⊢ A ND⊢0 weaken 1, 2
4. 1 ⊢ B∧A ND⊢0 ∧ I 2, 3 4. 1 ⊢ B⇒A ND⊢0 ⇒I 3
5. ⊢ A∧B⇒B∧A ND⊢0 ⇒I 4 5. ⊢ A⇒B⇒A ND⊢0 ⇒I 4

Michael Kohlhase: Artificial Intelligence 1 344 2025-02-06

Each row in the table represents one inference step in the proof. It consists of line number (for
referencing), a formula for the statement, a justification via a ND inference rule (and the rows this
one is derived from), and finally a sequence of row numbers of proof steps that are local hypotheses
in effect for the current row.
238 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

10.5 Predicate Logic Without Quantifiers


In the hair-color example we have seen that we are able to model complex situations in PL0 .
The trick of using variables with fancy names like bla(N ) is a bit dubious, and we can already
imagine that it will be difficult to support programmatically unless we make names like bla(N )
into first-class citizens i.e. expressions of the logic language themselves.

Issues with Propositional Logic


 Awkward to write for humans: E.g., to model the Wumpus world we had to
make a copy of the rules for every cell . . .
R1 := ¬S 1,1 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,1
R2 := ¬S 2,1 ⇒ ¬W 1,1 ∧ ¬W 2,1 ∧ ¬W 2,2 ∧ ¬W 3,1
R3 := ¬S 1,2 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,2 ∧ ¬W 1,3
Compared to

Cell adjacent to Wumpus: Stench (else: None)


that is not a very nice description language . . .
 Can we design a more human-like logic?: Yep!

 Idea: Introduce explict representations for


 individuals, e.g. the wumpus, the gold, numbers, . . .
 functions on individuals, e.g. the cell at i, j, . . .
 relations between them, e.g. being in a cell, being adjacent, . . .

This is essentially the same as PL0 , so we can reuse the calculi. (up next)

Michael Kohlhase: Artificial Intelligence 1 345 2025-02-06

Individuals and their Properties/Relationships


 Observation: We want to talk about individuals like Stefan, Nicole, and Jochen
and their properties, e.g. being blond, or studying AI
and relationships, e.g. that Stefan loves Nicole.

 Idea: Re-use PL0 , but replace propositional variables with something more expres-
sive! (instead of fancy variable name
trick)
 Definition 10.5.1. A first-order signature ⟨Σf , Σp ⟩ consists of
S
 Σ
f
:= k∈N Σfk of function constants, where members of Σfk denote k-ary
functions on individuals,
S
k∈N Σ k of predicate constants, where members of Σ k denote k-ary
p p p
 Σ :=
relations among individuals,

where Σfk and Σp k are pairwise disjoint, countable sets of symbols for each k ∈ N.
A 0-ary function constant refers to a single individual, therefore we call it a individual
constant.
10.5. PREDICATE LOGIC WITHOUT QUANTIFIERS 239

Michael Kohlhase: Artificial Intelligence 1 346 2025-02-06

A Grammar for PLnq


 Definition 10.5.2. The formulae of PLnq are given by the following grammar

function constants fk ∈ Σfk


predicate constants pk ∈ Σp k
terms t ::= f0 individualconstant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
| A1 ∧ A2 conjunction

Michael Kohlhase: Artificial Intelligence 1 347 2025-02-06

PLnq Semantics
 Definition 10.5.3. Domains D0 = {T, F} of truth values and Dι ̸= ∅ of individuals.
 Definition 10.5.4. Interpretation I assigns values to constants, e.g.
 I(¬) : D0 → D0 ; T 7→ F; F 7→ T and I(∧) = . . . (as in PL0 )
 I : Σf0 → Dι (interpret individual constants as individuals)
 I: Σfk → Dι → Dι k
(interpret function constants as functions)
k
 I: Σ p
k → P(Dι ) (interpret predicate constants as relations)
 Definition 10.5.5. The value function I assigns values to formulae: (recursively)

 I(f (A1 , . . ., Ak )) := I(f )(I(A1 ), . . . , I(Ak ))


 I(p(A1 , . . ., Ak )) := T, iff ⟨I(A1 ), . . . , I(Ak )⟩ ∈ I(p)
 I(¬A) = I(¬)(I(A)) and I(A ∧ B) = I(∧)(I(A), I(G)) (just as in PL0 )
 Definition 10.5.6. Model: M = ⟨Dι , I⟩ varies in Dι and I.

 Theorem 10.5.7. PLnq is isomorphic to PL0 (interpret atoms as prop. variables)

Michael Kohlhase: Artificial Intelligence 1 348 2025-02-06

All of the definitions above are quite abstract, we now look at them again using a very concrete –
if somewhat contrived – example: The relevant parts are a universe D with four elements, and an
interpretation that maps the signature into individuals, functions, and predicates over D, which
are given as concrete sets.

A Model for PLnq


 Example 10.5.8. Let L := {a, b, c, d, e, P , Q, R, S}, we set the universe D :=
{♣, ♠, ♡, ♢}, and specify the interpretation function I by setting

 a 7→ ♣, b 7→ ♠, c 7→ ♡, d 7→ ♢, and e 7→ ♢ for constants,


240 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

 P 7→ {♣, ♠} and Q 7→ {♠, ♢}, for unary predicate constants.


 R 7→ {⟨♡, ♢⟩, ⟨♢, ♡⟩}, and S 7→ {⟨♢, ♠⟩, ⟨♠, ♣⟩} for binary predicate constants.
 Example 10.5.9 (Computing Meaning in this Model).

 I(R(a, b) ∧ P (c)) = T, iff


 I(R(a, b)) = T and I(P (c)) = T, iff
 ⟨I(a), I(b)⟩ ∈ I(R) and I(c) ∈ I(P ), iff
 ⟨♣, ♠⟩ ∈ {⟨♡, ♢⟩, ⟨♢, ♡⟩} and ♡ ∈ {♣, ♠}

So, I(R(a, b) ∧ P (c)) = F.

Michael Kohlhase: Artificial Intelligence 1 349 2025-02-06

The example above also shows how we can compute of meaning by in a concrete model: we just
follow the evaluation rules to the letter.
We now come to the central technical result about PLnq : it is essentially the same as propositional
logic (PL0 ). We say that the two logic are isomorphic. Technically, this means that the formulae
of PLnq can be translated to PL0 and there is a corresponding model translation from the models
of PL0 to those of PLnq such that the respective notions of evaluation are assignped to each other.

PLnq and PL0 are Isomorphic


 Observation: For every choice of Σ of signature, the set AΣ of atomic PLnq
formulae is countable, so there is a V Σ ⊆ V0 and a bijection θΣ : AΣ → V Σ .
θΣ can be extended to formulae as PLnq and PL0 share connectives.
 Lemma 10.5.10. For every model M = ⟨Dι , I⟩, there is a variable assignment
φM , such that I φM (A) = I(A).

 Proof sketch: We just define φM (X) := I(θ−1


Σ (X))

 Lemma 10.5.11. For every variable assignment ψ : V Σ → {T, F} there is a model


Mψ = ⟨Dψ , I ψ ⟩, such that I ψ (A) = I ψ (A).
 Proof sketch: see next slide

 Corollary 10.5.12. PLnq is isomorphic to PL0 , i.e. the following diagram commutes:

ψ 7→ Mψ
⟨Dψ , I ψ ⟩ V Σ → {T, F}

I ψ () I φM ()
θΣ
PLnq (Σ) PL0 (AΣ )

 Note: This constellation with a language isomorphism and a corresponding model


isomorphism (in converse direction) is typical for a logic isomorphism.

Michael Kohlhase: Artificial Intelligence 1 350 2025-02-06

The practical upshot of the commutative diagram from ?? is that if we have a way of computing
evaluation (or entailment for that matter) in PL0 , then we can “borrow” it for PLnq by composing
it with the language and model translations. In other words, we can reuse calculi and automated
10.6. CONCLUSION 241

theorem provers from PL0 for PLnq .


But we still have to provide the proof for ??, which we do now.

Valuation and Satisfiability


 Lemma 10.5.13. For every variable assignment ψ : V Σ → {T, F} there is a model
Mψ = ⟨Dψ , I ψ ⟩, such that I ψ (A) = I ψ (A).

 Proof: We construct Mψ = ⟨Dψ , I ψ ⟩ and show that it works as desired.


1. Let Dψ be the set of PLnq terms over Σ, and
ψk
ψ k
 I (f ) : Dι → D ; ⟨A1 , . . ., Ak ⟩ 7→ f (A1 , . . ., Ak ) for f ∈ Σfk
−1
 I (p) := {⟨A1 , . . ., Ak ⟩ | ψ(θ ψ p(A1 , . . ., Ak )) = T} for p ∈ Σ .
ψ p

2. We show I ψ (A) = A for terms A by induction on A


2.1. If A = c, then I ψ (A) = I ψ (c) = c = A
2.2. If A = f (A1 , . . . , An ) then
I ψ (A) = I ψ (f )(I(A1 ), . . . , I(An )) = I ψ (f )(A1 , . . ., Ak ) = A.
3. For a PLnq formula A we show that I ψ (A) = I ψ (A) by induction on A.
3.1. If A = p(A1 , . . ., Ak ), then I ψ (A) = I ψ (p)(I(A1 ), . . . , I(An )) = T, iff
⟨A1 , . . ., Ak ⟩ ∈ I ψ (p), iff ψ(θ−1 ψ A) = T, so I (A) = I ψ (A) as desired.
ψ

3.2. If A = ¬B, then I ψ (A) = T, iff I ψ (B) = F, iff I ψ (B) = I ψ (B), iff
I ψ (A) = I ψ (A).
3.3. If A = B ∧ C then we argue similarly
4. Hence I ψ (A) = I ψ (A) for all PLnq formulae and we have concluded the proof.

Michael Kohlhase: Artificial Intelligence 1 351 2025-02-06

10.6 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25027.

Summary
 Sometimes, it pays off to think before acting.

 In AI, “thinking” is implemented in terms of reasoning to deduce new knowledge


from a knowledge base represented in a suitable logic.
 Logic prescribes a syntax for formulas, as well as a semantics prescribing which
interpretations satisfy them. A entails B if all interpretations that satisfy A also
satisfy B. Deduction is the process of deriving new entailed formulae.

 Propositional logic formulae are built from atomic propositions, with the connectives
and, or, not.

Michael Kohlhase: Artificial Intelligence 1 352 2025-02-06

Issues with Propositional Logic


 Time: For things that change (e.g., Wumpus moving according to certain rules),
242 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES

we need time-indexed propositions (like, S2,1


t=7
) to represent validity over time ;
further expansion of the rules.
 Can we design a more human-like logic?: Yep

 Predicate logic: quantification of variables ranging over individuals. (cf. ??


and ??)
 . . . and a whole zoo of logics much more powerful still.
 Note: In applications, propositional CNF encodings are generated by computer
programs. This mitigates (but does not remove!) the inconveniences of propo-
sitional modeling.

Michael Kohlhase: Artificial Intelligence 1 353 2025-02-06

Suggested Reading:

• Chapter 7: Logical Agents, Sections 7.1 – 7.5 [RN09].


– Sections 7.1 and 7.2 roughly correspond to my “Introduction”, Section 7.3 roughly corresponds
to my “Logic (in AI)”, Section 7.4 roughly corresponds to my “Propositional Logic”, Section
7.5 roughly corresponds to my “Resolution” and “Killing a Wumpus”.
– Overall, the content is quite similar. I have tried to add some additional clarifying illustra-
tions. RN gives many complementary explanations, nice as additional background reading.
– I would note that RN’s presentation of resolution seems a bit awkward, and Section 7.5 con-
tains some additional material that is imho not interesting (alternate inference rules, forward
and backward chaining). Horn clauses and unit resolution (also in Section 7.5), on the other
hand, are quite relevant.
Chapter 11

Formal Systems: Syntax, Semantics,


Entailment, and Derivation in
General

We will now take a more abstract view and introduce the necessary prerequisites of abstract rule
systems. We will also take the opportunity to discuss the quality criteria for calculi.

Recap: General Aspects of Propositional Logic


 There are many ways to define Propositional Logic:

 We chose ∧ and ¬ as primitive, and many others as defined.


 We could have used ∨ and ¬ just as well.
 We could even have used only one connective e.g. negated conjunction ↑ or
disjunction ↓ and defined ∧, ∨, and ¬ via ↑ and ↓ respectively.
↑ ⊤ ⊥ ↓ ⊤ ⊥ ¬a a↑a a↓a
⊤ F T ⊤ F F ab a↑b↑a↑b a ↓ ab ↓ b
⊥ T T ⊥ F T ab a↑a↑b↑b a↓b↓a↓b

 Observation: The set wff0 (V0 ) of well-formed propositional formulae is a formal


language over the alphabet given by V0 , the connectives, and brackets.

 Recall: We are mostly interested in


 satisfiability i.e. whether M ⊨ A, and
 entailment i.e whether A ⊨ B.
 Observation: In particular, the inductive/compositional nature of wff0 (V0 ) and
I φ : wff0 (V0 ) → D0 are secondary.

 Idea: Concentrate on language, models (M, φ), and satisfiability.

Michael Kohlhase: Artificial Intelligence 1 354 2025-02-06

The notion of a logical system is at the basis of the field of logic. In its most abstract form, a logical
system consists of a formal language, a class of models, and a satisfaction relation between models
and expressions of the formal language. The satisfaction relation tells us when an expression is
deemed true in this model.

243
244 CHAPTER 11. FORMAL SYSTEMS

Logical Systems
 Definition 11.0.1. A logical system (or simply a logic) is a triple L := ⟨L, K, ⊨⟩,
where the language L is a formal language, the model class K is a set, and ⊨ ⊆ K×L.
Members of L are called formulae of L, members of K models for L, and ⊨ the
satisfaction relation.
 Example 11.0.2 (Propositional Logic). ⟨wff(ΣP L0 , V P L0 ), Ko , |=⟩ is a logical
system, if we define Ko := V0 ⇀ D0 (the set of variable assignments) and φ |= A
iff I φ (A) = T.

 Definition 11.0.3. Let ⟨L, K, ⊨⟩ be a logical system, M ∈ K a model and A ∈ L


a formula. Then we say that A is
 satisfied by M iff M ⊨ A.
 satisfiable iff A is satisfied by some model.
 unsatisfiable iff A is not satisfiable.
 falsified by M iff M ̸⊨ A.
 valid or unfalsifiable (write ⊨ A) iff A is satisfied by every model.
 invalid or falsifiable (write ̸⊨ A) iff A is not valid.

Michael Kohlhase: Artificial Intelligence 1 355 2025-02-06

Let us now turn to the syntactical counterpart of the entailment relation: derivability in a cal-
culus. Again, we take care to define the concepts at the general level of logical systems.
The intuition of a calculus is that it provides a set of syntactic rules that allow to reason by
considering the form of propositions alone. Such rules are called inference rules, and they can be
strung together to derivations — which can alternatively be viewed either as sequences of formulae
where all formulae are justified by prior formulae or as trees of inference rule applications. But we
can also define a calculus in the more general setting of logical systems as an arbitrary relation on
formulae with some general properties. That allows us to abstract away from the homomorphic
setup of logics and calculi and concentrate on the basics.

Derivation Relations and Inference Rules


 Definition 11.0.4. Let L be a formal language, then we call a relation ⊢ ⊆
P(L) × L a derivation relation for L, if

 H ⊢ A, if A ∈ H (⊢ is proof reflexive),
 H ⊢ A and (H′ ∪ {A}) ⊢ B imply (H ∪ H′ ) ⊢ B (⊢ is proof transitive),
 H ⊢ A and H ⊆ H′ imply H′ ⊢ A (⊢ is monotonic or admits weakening).
 Definition 11.0.5. Let L be a formal language, then an inference rule over L is a
decidable n + 1 ary relation on L. Inference rules are traditionally written as
A1 . . . An
N
C
where A1 , . . ., An and C are schemata for words in L and N is a name. The Ai
are called assumptions of N , and C is called its conclusion.
245

Any n + 1-tuple
a1 . . . an
c
in N is called an application of N and we say that we apply N to a set M of words
with a1 , . . ., an ∈ M to obtain c.
 Definition 11.0.6. An inference rule without assumptions is called an axiom.
 Definition 11.0.7. A calculus (or inference system) is a formal language L equipped
with a set C of inference rules over L.

Michael Kohlhase: Artificial Intelligence 1 356 2025-02-06

With formula schemata we mean representations of sets of formulae, we use boldface uppercase
letters as (meta)-variables for formulae, for instance the formula schema A ⇒ B represents the set
of formulae whose head is ⇒.

Derivations
 Definition 11.0.8.Let L := ⟨L, K, ⊨⟩ be a logical system and C a calculus for L,
then a C-derivation of a formula C ∈ L from a set H ⊆ L of hypotheses (write
H⊢C C) is a sequence A1 , . . ., Am of L-formulae, such that
 Am = C, (derivation culminates in C)
 for all 1 ≤ i ≤ m, either Ai ∈ H, or (hypothesis)
Al 1 . . . Al k
 there is an inference rule in C with lj < i for all j ≤ k. (rule
Ai
application)
We can also see a derivation as a derivation tree, where the Alj are the children of
the node Ai .

 Example 11.0.9.
In the propositional Hilbert calculus H0 we have the K
derivation P ⊢H0 Q ⇒ P : the sequence is P ⇒ Q ⇒ P ⇒ Q ⇒ P P
MP
P , P , Q ⇒ P and the corresponding tree on the right. Q⇒P

Michael Kohlhase: Artificial Intelligence 1 357 2025-02-06

Inference rules are relations on formulae represented by formula schemata (where boldface, up-
percase letters are used as metavariables for formulae). For instance, in ?? the inference rule
A⇒B A
was applied in a situation, where the metavariables A and B were instantiated by the
B
formulae P and Q ⇒ P .
As axioms do not have assumptions, they can be added to a derivation at any time. This is just
what we did with the axioms in ??.

Formal Systems
 Let ⟨L, K, ⊨⟩ be a logical system and C a calculus, then ⊢C is a derivation relation
and thus ⟨L, K, ⊨, ⊢C ⟩ a derivation system.
 Therefore we will sometimes also call ⟨L, C , K, ⊨⟩ a formal system, iff L :=
246 CHAPTER 11. FORMAL SYSTEMS

⟨L, K, ⊨⟩ is a logical system, and C a calculus for L.


 Definition 11.0.10. Let C be a calculus, then a C-derivation ∅⊢C A is called a
proof of A and if one exists (write ⊢C A) then A is called a C-theorem.
Definition 11.0.11. The act of finding a proof for A is called proving A.

 Definition 11.0.12. An inference rule I is called admissible in a calculus C, if the


extension of C by I does not yield new theorems.
 Definition 11.0.13. An inference rule
A1 . . . An
C
is called derivable (or a derived rule) in a calculus C, if there is a C-derivation
A1 , . . ., An ⊢C C.

 Observation 11.0.14. Derivable inference rules are admissible, but not the other
way around.

Michael Kohlhase: Artificial Intelligence 1 358 2025-02-06

The notion of a formal system encapsulates the most general way we can conceptualize a logical
system with a calculus, i.e. a system in which we can do “formal reasoning”.
Chapter 12

Machine-Oriented Calculi for


Propositional Logic

A Video Nugget covering this chapter can be found at https://fau.tv/clip/id/22531.

12.1 Test Calculi


Automated Deduction as an Agent Inference Procedure

 Recall: Our knowledge of the cave entails a definite Wumpus position!(slide 316)
 Problem: That was human reasoning, can we build an agent function that does
this?
 Answer: As for constraint networks, we use inference, here resolution/tableaux.

Michael Kohlhase: Artificial Intelligence 1 359 2025-02-06

The following theorem is simple, but will be crucial later on.

Unsatisfiability Theorem
 Theorem 12.1.1 (Unsatisfiability Theorem). H ⊨ A iff H ∪ {¬A} is unsatisfi-
able.

 Proof: We prove both directions separately


1. “⇒”: Say H ⊨ A
1.1. For any φ with φ|=H we have φ|=A and thus φ̸|=(¬A).
2. “⇐”: Say H ∪ {¬A} is unsatisfiable.
2.1. For any φ with φ|=H we have φ̸|=(¬A) and thus φ|=A.
 Observation 12.1.2. Entailment can be tested via satisfiability.

Michael Kohlhase: Artificial Intelligence 1 360 2025-02-06

247
248 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

Test Calculi: A Paradigm for Automating Inference


 Definition 12.1.3. Given a formal system ⟨L, C , K, ⊨⟩, the task of theorem proving
consists in determining whether H⊢C C for a conjecture C ∈ L and hypotheses
H ⊆ L.
 Definition 12.1.4. Automated theorem proving (ATP) is the automation of theo-
rem proving
H|=A
 Idea: A set H of hypotheses and a conjecture A induce a search problem ΠC :=
⟨S , A, T , I , G ⟩, where the states S are sets of formulae, the actions A are the
inference rules from C, the initial state I = H, and the goal states are those with
A ∈ S.
 Problem: ATP as a search problem does not admit good heuristics, since these
need to take the conjecture A into account.

 Idea: Turn the search around – using the unsatisfiability theorem (??).
 Definition 12.1.5. For a given conjecture A and hypotheses H a test calculus T
tries to derive a refutation H, A⊢T ⊥ instead of H⊢A, where A is unsatisfiable iff
A is valid and ⊥, an “obviously” unsatisfiable formula.

 Observation: A test calculus C induces a search problem where the initial state is
H ∪ {¬A} and S ∈ S is a goal state iff ⊥ ∈ S.(proximity of ⊥ easier for heuristics)
 Searching for ⊥ admits simple heuristics, e.g. size reduction. (⊥ minimal)

Michael Kohlhase: Artificial Intelligence 1 361 2025-02-06

12.1.1 Normal Forms


Before we can start, we will need to recap some nomenclature on formulae.

Recap: Atoms and Literals


 Definition 12.1.6. A formula is called atomic (or an atom) if it does not contain
logical constants, else it is called complex.
 Definition 12.1.7. Let ⟨L, K, ⊨⟩ be a logical system and A ∈ L, then we call a
pair Aα of a formula and a truth value α ∈ {T, F} a labeled formula. For a set Φ
of formulae we use Φα :={Aα | A ∈ Φ}.
We call a labeled formula AT positive and AF negative.
Definition 12.1.8. Let ⟨L, K, ⊨⟩ be a logical system and Aα a labeled formula.
Then we say that M ∈ K satisfies Aα (written M|=A), iff α = T and M ⊨ A or
α = F and M ⊨ ̸ A.

 Definition 12.1.9. Let ⟨L, K, ⊨⟩ be a logical system, A ∈ L atomic, and α ∈


{T, F}, then we call a Aα a literal.
 Intuition: To satisfy a formula, we make it “true”. To satisfy a labeled formula
Aα , it must have the truth value α.
12.2. ANALYTICAL TABLEAUX 249

 Definition 12.1.10. For a literal Aα , we call the literal Aβ with α ̸= β the


opposite literal (or partner literal).

Michael Kohlhase: Artificial Intelligence 1 362 2025-02-06

The idea about literals is that they are atoms (the simplest formulae) that carry around their
intended truth value.

Alternative Definition: Literals


 Note: Literals are often defined without recurring to labeled formulae:
 Definition 12.1.11. A literal is an atom A (positive literal) or negated atom ¬A
(negative literal). A and ¬A are opposite literals.
 Note: This notion of literal is equivalent to the labeled formulae-notion of literal,
but does not generalize as well to logics with more than two truth values.

Michael Kohlhase: Artificial Intelligence 1 363 2025-02-06

Normal Forms
 There are two quintessential normal forms for propositional formulae: (there are
others as well)
 Definition 12.1.12. A formula is in conjunctive normal form^ (CNF)
_ if it is T or a
conjunction of disjunctions of literals: i.e. if it is of the form i=1 m
n
j=1 lij
i

 Definition 12.1.13. A formula is in disjunctive normal form_ (DNF)


^if it is F or a
disjunction of conjunctions of literals: i.e. if it is of the form i=1 m
n
j=1 lij
i

 Observation 12.1.14. Every formula has equivalent formulae in CNF and DNF.

Michael Kohlhase: Artificial Intelligence 1 364 2025-02-06

Video Nuggets covering this chapter can be found at https://fau.tv/clip/id/23705 and


https://fau.tv/clip/id/23708.

12.2 Analytical Tableaux


Video Nuggets covering this section can be found at https://fau.tv/clip/id/23705 and
https://fau.tv/clip/id/23708.

12.2.1 Analytical Tableaux

Test Calculi: Tableaux and Model Generation


 Idea: A tableau calculus is a test calculus that
 analyzes a labeled formulae in a tree to determine satisfiability,
 its branches correspond to valuations (; models).
250 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

 Example 12.2.1.Tableau calculi try to construct models for labeled formulae:

Tableau refutation (Validity) Model generation (Satisfiability)


⊨P ∧ Q ⇒ Q ∧ P ⊨ P ∧ (Q ∨ ¬R) ∧ ¬Q
(P ∧ (Q ∨ ¬R) ∧ ¬Q)T
(P ∧ Q ⇒ Q ∧ P )F
(P ∧ (Q ∨ ¬R))T
(P ∧ Q)T
¬QT
(Q ∧ P )F QF
PT PT
QT
(Q ∨ ¬R)T
P F QF
QT ¬RT
⊥ ⊥
⊥ RF
No Model Herbrand model {P T , QF , RF }
φ := {P 7→ T, Q 7→ F, R 7→ F}

 Idea: Open branches in saturated tableaux yield models.

 Algorithm: Fully expand all possible tableaux, (no rule can be applied)
 Satisfiable, iff there are open branches (correspond to models)

Michael Kohlhase: Artificial Intelligence 1 365 2025-02-06

Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis
on when a formula can be made true (or false). Therefore the formulae are decorated with upper
indices that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T). This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one. The latter corresponds a model.
Now that we have seen the examples, we can write down the tableau rules formally.

Analytical Tableaux (Formal Treatment of T0 )


 Idea: A test calculus where
 A labeled formula is analyzed in a tree to determine satisfiability,
 branches correspond to valuations (models)

 Definition 12.2.2. The propositional tableau calculus T0 has two inference rules
per connective (one for each possible label)


T F α ̸= β
(A ∧ B) (A ∧ B) ¬AT
¬AF

T0 ∧ T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT

Use rules exhaustively as long as they contribute new material (; termination)

 Definition 12.2.3. We call any tree ( introduces branches) produced by the T0


inference rules from a set Φ of labeled formulae a tableau for Φ.
12.2. ANALYTICAL TABLEAUX 251

 Definition 12.2.4. Call a tableau saturated, iff no rule adds new material and a
branch closed, iff it ends in ⊥, else open. A tableau is closed, iff all of its branches
are.
In analogy to the ⊥ at the end of closed branches, we sometimes decorate open
branches with a 2 symbol.

Michael Kohlhase: Artificial Intelligence 1 366 2025-02-06

These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 12.2.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
The saturated tableau represents a full case analysis of what is necessary to give A the truth
value α; since all branches are closed (contain contradictions) this is impossible.

Analytical Tableaux (T0 continued)


 Definition 12.2.6 (T0 -Theorem/Derivability). A is a T0 -theorem (⊢T0 A), iff
there is a closed tableau with AF at the root.
Φ ⊆ wff0 (V0 ) derives A in T0 (Φ⊢T0 A), iff there is a closed tableau starting with AF
and ΦT . The tableau with only a branch of AF and ΦT is called initial for Φ⊢T0 A.

Michael Kohlhase: Artificial Intelligence 1 367 2025-02-06

Definition 12.2.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We now look at a formulation of propositional logic with fancy variable names. Note that
loves(mary, bill) is just a variable name like P or X, which we have used earlier.

A Valid Real-World Example


252 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

 Example 12.2.8. If Mary loves Bill and John loves Mary, then John loves Mary
F
(loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary))
F
¬(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
¬¬(loves(mary, bill) ∧ loves(john, mary))
F
¬(loves(mary, bill) ∧ loves(john, mary))
T
(loves(mary, bill) ∧ loves(john, mary))
T
¬loves(john, mary)
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)

This is a closed tableau, so the loves(mary, bill)∧loves(john, mary)⇒loves(john, mary)


is a T0 -theorem.
As we will see, T0 is sound and complete, so

loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary)

is valid.

Michael Kohlhase: Artificial Intelligence 1 368 2025-02-06

We could have used the unsatisfiability theorem (??) here to show that If Mary loves Bill and John
loves Mary entails John loves Mary. But there is a better way to show entailment: we directly
use derivability in T0 .

Deriving Entailment in T0
 Example 12.2.9. Mary loves Bill and John loves Mary together entail that John
loves Mary
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)

This is a closed tableau, so {loves(mary, bill), loves(john, mary)}⊢T0 loves(john, mary).
Again, as T0 is sound and complete we have

{loves(mary, bill), loves(john, mary)} ⊨ loves(john, mary)

Michael Kohlhase: Artificial Intelligence 1 369 2025-02-06

Note: We can also use the tableau calculus to try and show entailment (and fail). The nice thing
is that the failed proof, we can see what went wrong.

A Falsifiable Real-World Example


 Example 12.2.10. * If Mary loves Bill or John loves Mary, then John loves
Mary
12.2. ANALYTICAL TABLEAUX 253

Try proving the implication (this fails)


F
((loves(mary, bill) ∨ loves(john, mary)) ⇒ loves(john, mary))
F
¬(¬¬(loves(mary, bill) ∨ loves(john, mary)) ∧ ¬loves(john, mary))
T
(¬¬(loves(mary, bill) ∨ loves(john, mary)) ∧ ¬loves(john, mary))
T
¬loves(john, mary)
F
loves(john, mary)
T
¬¬(loves(mary, bill) ∨ loves(john, mary))
F
¬(loves(mary, bill) ∨ loves(john, mary))
T
(loves(mary, bill) ∨ loves(john, mary))
T T
loves(mary, bill) loves(john, mary)

Indeed we can make I φ (loves(mary, bill)) = T but I φ (loves(john, mary)) = F.

Michael Kohlhase: Artificial Intelligence 1 370 2025-02-06

Obviously, the tableau above is saturated, but not closed, so it is not a tableau proof for our initial
entailment conjecture. We have marked the literal on the open branch green, since they allow us
to read of the conditions of the situation, in which the entailment fails to hold. As we intuitively
argued above, this is the situation, where Mary loves Bill. In particular, the open branch gives us
a variable assignment (marked in green) that satisfies the initial formula. In this case, Mary loves
Bill, which is a situation, where the entailment fails.
Again, the derivability version is much simpler:

Testing for Entailment in T0


 Example 12.2.11. Does Mary loves Bill or John loves Mary entail that John
loves Mary?
T
(loves(mary, bill) ∨ loves(john, mary))
F
loves(john, mary)
T T
loves(mary, bill) loves(john, mary)

This saturated tableau has an open branch that shows that the interpretation with
I φ (loves(mary, bill)) = T but I φ (loves(john, mary)) = F falsifies the derivability/en-
tailment conjecture.

Michael Kohlhase: Artificial Intelligence 1 371 2025-02-06

We have seen in the examples above that while it is possible to get by with only the connectives
∨ and ¬, it is a bit unnatural and tedious, since we need to eliminate the other connectives first.
In this section, we will make the calculus less frugal by adding rules for the other connectives,
without losing the advantage of dealing with a small calculus, which is good making statements
about the calculus itself.

12.2.2 Practical Enhancements for Tableaux


The main idea here is to add the new rules as derivable inference rules, i.e. rules that only
abbreviate derivations in the original calculus. Generally, adding derivable inference rules does
not change the derivation relation of the calculus, and is therefore a safe thing to do. In particular,
we will add the following rules to our tableau calculus.
We will convince ourselves that the first rule is derivable, and leave the other ones as an exercise.
254 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

Derived Rules of Inference


 Definition 12.2.12. An inference rule
A1 . . . An
C
is called derivable (or a derived rule) in a calculus C, if there is a C-derivation
A1 , . . ., An ⊢C C.

 Definition 12.2.13. We have the following derivable inference rules in T0 :

AT AT
T T
(A ⇒ B)
T
(A ⇒ B)
F
(A ⇒ B) (A ⇒ B)
T
AT BT (¬A ∨ B)
AF BT T
BF ¬(¬¬A ∧ ¬B)
F
T F T F
(¬¬A ∧ ¬B)
(A ∨ B) (A ∨ B) (A ⇔ B) (A ⇔ B) ¬¬AF ¬BF
AT BT AF AT AF AT AF ¬AT BT
BF BT BF BF BT AF

Michael Kohlhase: Artificial Intelligence 1 372 2025-02-06

With these derived rules, theorem proving becomes quite efficient. With these rules, the tableau
(??) would have the following simpler form:

Tableaux with derived Rules (example)


Example 12.2.14.
F
(loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary))
T
(loves(mary, bill) ∧ loves(john, mary))
F
loves(john, mary)
T
loves(mary, bill)
T
loves(john, mary)

Michael Kohlhase: Artificial Intelligence 1 373 2025-02-06

12.2.3 Soundness and Termination of Tableaux


As always we need to convince ourselves that the calculus is sound, otherwise, tableau proofs do
not guarantee validity, which we are after. Since we are now in a refutation setting we cannot just
show that the inference rules preserve validity: we care about unsatisfiability (which is the dual
notion to validity), as we want to show the initial labeled formula to be unsatisfiable. Before we
can do this, we have to ask ourselves, what it means to be (un)-satisfiable for a labeled formula
or a tableau.

Soundness (Tableau)
12.2. ANALYTICAL TABLEAUX 255

 Idea: A test calculus is refutation sound, iff its inference rules preserve satisfiability
and the goal formulae are unsatisfiable.
 Definition 12.2.15. A labeled formula Aα is valid under φ, iff I φ (A) = α.

 Definition 12.2.16. A tableau T is satisfiable, iff there is a satisfiable branch P


in T , i.e. if the set of formulae on P is satisfiable.
 Lemma 12.2.17. T0 rules transform satisfiable tableaux into satisfiable ones.
 Theorem 12.2.18 (Soundness). T0 is sound, i.e. Φ ⊆ wff0 (V0 ) valid, if there is
a closed tableau T for ΦF .

 Proof: by contradiction
1. Suppose Φ isfalsifiable =b not valid.
2. Then the initial tableau is satisfiable, (ΦF satisfiable)
3. so T is satisfiable, by ??.
4. Thus there is a satisfiable branch (by definition)
5. but all branches are closed (T closed)
 Theorem 12.2.19 (Completeness). T0 is complete, i.e. if Φ ⊆ wff0 (V0 ) is valid,
then there is a closed tableau T for ΦF .
Proof sketch: Proof difficult/interesting; see ??

Michael Kohlhase: Artificial Intelligence 1 374 2025-02-06

Thus we only have to prove ??, this is relatively easy to do. For instance for the first rule: if we
T
have a tableau that contains (A ∧ B) and is satisfiable, then it must have a satisfiable branch.
T
If (A ∧ B) is not on this branch, the tableau extension will not change satisfiability, so we can
assume that it is on the satisfiable branch and thus I φ (A ∧ B) = T for some variable assignment
φ. Thus I φ (A) = T and I φ (B) = T, so after the extension (which adds the formulae AT and BT
to the branch), the branch is still satisfiable. The cases for the other rules are similar.
The next result is a very important one, it shows that there is a procedure (the tableau procedure)
that will always terminate and answer the question whether a given propositional formula is valid
or not. This is very important, since other logics (like the often-studied first-order logic) does not
enjoy this property.

 Termination for Tableaux


 Lemma 12.2.20. T0 terminates, i.e. every T0 tableau becomes saturated after
finitely many rule applications.
 Proof: By examining the rules wrt. a measure µ
1. Let us call a labeled formulae Aα worked off in a tableau T , if a T0 rule has already
been applied to it.
2. It is easy to see that applying rules to worked off formulae will only add formulae that
are already present in its branch.
3. Let µ(T ) be the number of connectives in labeled formulae in T that are not worked
off.
4. Then each rule application to a labeled formula in T that is not worked off reduces
µ(T ) by at least one. (inspect the rules)
5. At some point the tableau only contains worked off formulae and literals.
6. Since there are only finitely many literals in T , so we can only apply T0 ⊥ a finite
number of times.
256 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

 Corollary 12.2.21. T0 induces a decision procedure for validity in PL0 .

Proof: We combine the results so far


 1. By ?? it is decidable whether ⊢T0 A
2. By soundness (??) and completeness (??), ⊢T0 A iff A is valid.

Michael Kohlhase: Artificial Intelligence 1 375 2025-02-06

Note: The proof above only works for the “base T0 ” because (only) there the rules do not “copy”.
A rule like
T
(A ⇔ B)
AT AF
BT BF
does, and in particular the number of non-worked-off variables below the line is larger than above
the line. For such rules, we would have a more intricate version of µ which – instead of returning
a natural number – returns a more complex object; a multiset of numbers. would work here. In
our proof we are just assuming that the defined connectives have already eliminated. The
tableau calculus basically computes the disjunctive normal form: every branch is a disjunct that
is a conjunction of literals. The method relies on the fact that a DNF is unsatisfiable, iff each
literal is, i.e. iff each branch contains a contradiction in form of a pair of opposite literals.

12.3 Resolution for Propositional Logic


12.3.1 Resolution for Propositional Logic
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/23712.
The next calculus is a test calculus based on the conjunctive normal form: the resolution calculus.
In contrast to the tableau method, it does not compute the normal form as it goes along, but has
a pre-processing step that does this and a single inference rule that maintains the normal form.
The goal of this calculus is to derive the empty clause, which is unsatisfiable.

Another Test Calculus: Resolution


 Definition 12.3.1. A clause is a disjunction lα 1 ∨ . . . ∨ ln of literals. We will use
1 αn

2 for the “empty” disjunction (no disjuncts) and call it the empty clause. A clause
with exactly one literal is called a unit clause.

 Definition 12.3.2 (Resolution Calculus). The resolution calculus R0 operates a


clause sets via a single inference rule:

PT ∨ A PF ∨ B
R
A∨B
This rule allows to add the resolvent (the clause below the line) to a clause set which
contains the two clauses above. The literals P T and P F are called cut literals.
 Definition 12.3.3 (Resolution Refutation). Let S be a clause set, then we call
an R0 -derivation of 2 from S R0 -refutation and write D : S⊢R0 2.

Michael Kohlhase: Artificial Intelligence 1 376 2025-02-06


12.3. RESOLUTION FOR PROPOSITIONAL LOGIC 257

Clause Normal Form Transformation (A calculus)


 Definition 12.3.4. We will often write a clause set {C 1 , . . ., C n } as C 1 ; . . . ; C n ,
use S ; T for the union of the clause sets S and T , and S ; C for the extension by a
clause C.
 Definition 12.3.5 (Transformation into Clause Normal Form). The CNF trans-
formation calculus CNF0 consists of the following four inference rules on sets of
labeled formulae.
T F
C ∨ (A ∨ B) C ∨ (A ∨ B) C ∨ ¬AT C ∨ ¬AF
C ∨ AT ∨ BT C ∨ AF ; C ∨ BF C ∨ AF C ∨ AT

 Definition 12.3.6. We write CNF0 (Aα ) for the set of all clauses derivable from
Aα via the rules above.

Michael Kohlhase: Artificial Intelligence 1 377 2025-02-06

that the C-terms in the definition of the inference rules are necessary, since we assumed that
the assumptions of the inference rule must match full clauses. The C terms are used with the
T
convention that they are optional. So that we can also simplify (A ∨ B) to AT ∨ BT .
Background: The background behind this notation is that A and T ∨ A are equivalent for any
A. That allows us to interpret the C-terms in the assumptions as T and thus leave them out.
The clause normal form translation as we have formulated it here is quite frugal; we have left
out rules for the connectives ∨, ⇒, and ⇔, relying on the fact that formulae containing these
connectives can be translated into ones without before CNF transformation. The advantage of
having a calculus with few inference rules is that we can prove meta properties like soundness and
completeness with less effort (these proofs usually require one case per inference rule). On the
other hand, adding specialized inference rules makes proofs shorter and more readable.
Fortunately, there is a way to have your cake and eat it. Derivable inference rules have the property
that they are formally redundant, since they do not change the expressive power of the calculus.
Therefore we can leave them out when proving meta-properties, but include them when actually
using the calculus.

Derived Rules of Inference


 Definition 12.3.7. An inference rule
A1 . . . An
C
is called derivable (or a derived rule) in a calculus C, if there is a C-derivation
A1 , . . ., An ⊢C C.
 Idea: Derived rules make derivations shorter.

T
C ∨ (A ⇒ B)
T
C ∨ (¬A ∨ B) C ∨ (A ⇒ B)
T
 Example 12.3.8. ;
C ∨ ¬AT ∨ BT C ∨ AF ∨ BT

C ∨ AF ∨ BT
258 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

 Other Derived CNF Rules:


T F T F
C ∨ (A ⇒ B) C ∨ (A ⇒ B) C ∨ (A ∧ B) C ∨ (A ∧ B)
C ∨ AF ∨ BT C ∨ AT ; C ∨ BF C ∨ AT ; C ∨ BT C ∨ AF ∨ BF

Michael Kohlhase: Artificial Intelligence 1 378 2025-02-06

With these derivable rules, theorem proving becomes quite efficient. To get a better understanding
of the calculus, we look at an example: we prove an axiom of the Hilbert Calculus we have studied
above.

Example: Proving Axiom S with Resolution


 Example 12.3.9. Clause Normal Form transformation
F
((P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R)
T F
(P ⇒ Q ⇒ R) ; ((P ⇒ Q) ⇒ P ⇒ R)
T T F
P F ∨ (Q ⇒ R) ; (P ⇒ Q) ; (P ⇒ R)
P F ∨ QF ∨ RT ; P F ∨ QT ; P T ; RF

Result {P F ∨ QF ∨ RT , P F ∨ QT , P T , RF }
 Example 12.3.10. Resolution Proof

1 P F ∨ QF ∨ RT initial
2 P F ∨ QT initial
3 PT initial
4 RF initial
5 P F ∨ QF resolve 1.3 with 4.1
6 QF resolve 5.1 with 3.1
7 PF resolve 2.2 with 6.1
8 2 resolve 7.1 with 3.1

Michael Kohlhase: Artificial Intelligence 1 379 2025-02-06

Clause Set Simplification


 Observation: Let ∆ be a clause set, l a literal with l ∈ ∆ (unit clause), and ∆′
be ∆ where

 all clauses l ∨ C have been removed and


 and all clauses l ∨ C have been shortened to C.
Then ∆ is satisfiable, iff ∆′ is. We call ∆′ the clause set simplification of ∆ wrt. l.
 Corollary 12.3.11. Adding clause set simplification wrt. unit clauses to R0 does
not affect soundness and completeness.
 This is almost always a good idea! (clause set simplification is cheap)

Michael Kohlhase: Artificial Intelligence 1 380 2025-02-06


12.3. RESOLUTION FOR PROPOSITIONAL LOGIC 259

12.3.2 Killing a Wumpus with Propositional Inference


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/23713.
Let us now consider an extended example, where we also address the question how inference
in PL0 – here resolution is embedded into the rational agent metaphor we use in AI-1: we come
back to the Wumpus world.

Applying Propositional Inference: Where is the Wumpus?


 Example 12.3.12 (Finding the Wumpus). The situation and what the agent
knows

 What should the agent do next and why?


 One possibility: Convince yourself that the Wumpus is in [1, 3] and shoot it.
 What is the general mechanism here? (for the agent function)

Michael Kohlhase: Artificial Intelligence 1 381 2025-02-06

Before we come to the general mechanism, we will go into how we would “convince ourselves that
the Wumpus is in [1, 3].

Where is the Wumpus? Our Knowledge

 Idea: We formalize the knowledge about the Wumpus world in PL0 and use a test
calculus to check for entailment.
 Simplification: We worry only about the Wumpus and stench:
b stench in [i, j], W i,j =
S i,j = b Wumpus in [i, j].
260 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

 Propositions whose value we know: ¬S 1,1 , ¬W 1,1 , ¬S 2,1 , ¬W 2,1 , S 1,2 , ¬W 1,2 .
 Knowledge about the Wumpus and smell:
From Cell adjacent to Wumpus: Stench (else: None), we get

R1 := ¬S 1,1 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,1


R2 := ¬S 2,1 ⇒ ¬W 1,1 ∧ ¬W 2,1 ∧ ¬W 2,2 ∧ ¬W 3,1
R3 := ¬S 1,2 ⇒ ¬W 1,1 ∧ ¬W 1,2 ∧ ¬W 2,2 ∧ ¬W 1,3
R4 := S 1,2 ⇒ (W 1,3 ∨ W 2,2 ∨ W 1,1 )
..
.
 To show:
R1 , R2 , R3 , R4 ⊨ W 1,3 (we will use resolution)

Michael Kohlhase: Artificial Intelligence 1 382 2025-02-06

The first in is to compute the clause normal form of the relevant knowledge.

And Now Using Resolution Conventions


 We obtain the clause set ∆ composed of the following clauses:

 Propositions whose value we know: S 1,1 F , W 1,1 F , S 2,1 F , W 2,1 F , S 1,2 T ,


W 1,2 F
 Knowledge about the Wumpus and smell:
from clauses
R1 S 1,1 T ∨ W 1,1 F , S 1,1 T ∨ W 1,2 F , S 1,1 T ∨ W 2,1 F
R2 S 2,1 T ∨ W 1,1 F , S 2,1 T ∨ W 2,1 F , S 2,1 T ∨ W 2,2 F , S 2,1 T ∨ W 3,1 F
R3 S 1,2 T ∨ W 1,1 F , S 1,2 T ∨ W 1,2 F , S 1,2 T ∨ W 2,2 F , S 1,2 T ∨ W 1,3 F
R4 S 1,2 F ∨ W 1,3 T ∨ W 2,2 T ∨ W 1,1 T
 Negated goal formula: W 1,3 F

Michael Kohlhase: Artificial Intelligence 1 383 2025-02-06

Given this clause normal form, we only need to find generate empty clause via repeated applications
of the resolution rule.

Resolution Proof Killing the Wumpus!


 Example 12.3.13 (Where is the Wumpus). We show a derivation that proves
that he is in (1, 3).
 Assume the Wumpus is not in (1, 3). Then either there’s no stench in (1, 2),
or the Wumpus is in some other neigbor cell of (1, 2).
 Parents: W 1,3 F and S 1,2 F ∨ W 1,3 T ∨ W 2,2 T ∨ W 1,1 T .
 Resolvent: S 1,2 F ∨ W 2,2 T ∨ W 1,1 T .
 There’s a stench in (1, 2), so it must be another neighbor.
 Parents: S 1,2 T and S 1,2 F ∨ W 2,2 T ∨ W 1,1 T .
 Resolvent: W 2,2 T ∨ W 1,1 T .
12.4. CONCLUSION 261

 We’ve been to (1, 1), and there’s no Wumpus there, so it can’t be (1, 1).
 Parents: W 1,1 F and W 2,2 T ∨ W 1,1 T .
 Resolvent: W 2,2 T .
 There is no stench in (2, 1) so it can’t be (2, 2) either, in contradiction.
 Parents: S 2,1 F and S 2,1 T ∨ W 2,2 F .
 Resolvent: W 2,2 F .
 Parents: W 2,2 F and W 2,2 T .
 Resolvent: 2.

As resolution is sound, we have shown that indeed R1 , R2 , R3 , R4 ⊨ W 1,3 .

Michael Kohlhase: Artificial Intelligence 1 384 2025-02-06

Now that we have seen how we can use propositional inference to derive consequences of the
percepts and world knowledge, let us come back to the question of a general mechanism for agent
functions with propositional inference.

Where does the Conjecture W 1,3 F come from?

 Question: Where did the W 1,3 F come from?


 Observation 12.3.14. We need a general mechanism for making conjectures.

 Idea: Interpret the Wumpus world as a search problem P := ⟨S , A, T , I , G ⟩ where


 the states S are given by the cells (and agent orientation) and
 the actions A by the possible actions of the agent.
Use tree search as the main agent function and a test calculus for testing all dangers
(pits), opportunities (gold) and the Wumpus.
 Example 12.3.15 (Back to the Wumpus). In ??, the agent is in [1, 2], it has
perceived stench, and the possible actions include shoot, and goForward. Evalu-
ating either of these leads to the conjecture W 1,3 . And since W 1,3 is entailed, the
action shoot probably comes out best, heuristically.

 Remark: Analogous to the backtracking with inference algorithm from CSP.

Michael Kohlhase: Artificial Intelligence 1 385 2025-02-06

Admittedly, the search framework from ?? does not quite cover the agent function we have here,
since that assumes that the world is fully observable, which the Wumpus world is emphatically not.
But it already gives us a good impression of what would be needed for the “general mechanism”.

12.4 Conclusion
Summary
 Every propositional formula can be brought into conjunctive normal form (CNF),
which can be identified with a set of clauses.
262 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC

 The tableau and resolution calculi are deduction procedures based on trying to
derive a contradiction from the negated theorem (a closed tableau or the empty
clause). They are refutation complete, and can be used to prove KB ⊨ A by
showing that KB ∪ {¬A} is unsatisfiable.

Michael Kohlhase: Artificial Intelligence 1 386 2025-02-06

Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
AI-1, but provide one for the calculi introduced so far in??.
Chapter 13

Propositional Reasoning: SAT


Solvers

13.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25019.

Reminder: Our Agenda for Propositional Logic


 ??: Basic definitions and concepts; machine-oriented calculi
 Sets up the framework. Tableaux and resolution are the quintessential reasoning
procedures underlying most successful SAT solvers.

 This chapter: The Davis Putnam procedure and clause learning.


 State-of-the-art algorithms for reasoning about propositional logic, and an im-
portant observation about how they behave.

Michael Kohlhase: Artificial Intelligence 1 387 2025-02-06

SAT: The Propositional Satisfiability Problem


 Definition 13.1.1. The SAT problem (SAT): Given a propositional formula A,
decide whether or not A is satisfiable. We denote the class of all SAT problems
with SAT
 The SAT problem was the first problem proved to be NP-complete!

 A is commonly assumed to be in CNF. This is without loss of generality, because


any A can be transformed into a satisfiability-equivalent CNF formula (cf. ??) in
polynomial time.
 Active research area, annual SAT conference, lots of tools etc. available: http:
//www.satlive.org/

 Definition 13.1.2. Tools addressing SAT are commonly referred to as SAT solvers.

263
264 CHAPTER 13. PROPOSITIONAL REASONING: SAT SOLVERS

 Recall: To decide whether KB ⊨ A, decide satisfiability of θ := KB ∪ {¬A}: θ


is unsatisfiable iff KB ⊨ A.
 Consequence: Deduction can be performed using SAT solvers.

Michael Kohlhase: Artificial Intelligence 1 388 2025-02-06

SAT vs. CSP


 Recall: Constraint network ⟨V , D, C ⟩ has variables v ∈ V with finite domains
Dv ∈ D, and binary constraints C uv ∈ C which are relations over u, and v speci-
fying the permissible combined assignments to u and v. One extension is to allow
constraints of higher arity.
 Observation 13.1.3 (SAT: A kind of CSP). SAT can be viewed as a CSP problem
in which all variable domains are Boolean, and the constraints have unbounded arity.
 Theorem 13.1.4 (Encoding CSP as SAT). Given any constraint network C, we
can in low order polynomial time construct a CNF formula A(C) that is satisfiable
iff C is solvable.
 Proof: We design a formula, relying on known transformation to CNF
1. encode multi-XOR for each variable
2. encode each constraint by DNF over relation
3. Running time: O(nd2 +md2 ) where n is the number of variables, d the domain
size, and m the number of constraints.

 Upshot: Anything we can do with CSP, we can (in principle) do with SAT.

Michael Kohlhase: Artificial Intelligence 1 389 2025-02-06

Example Application: Hardware Verification


 Example 13.1.5 (Hardware Verification).
 Counter, repeatedly from c = 0 to c = 2.

 2 bits x1 and x0 ; c = 2 ∗ x1 + x0 .
 (FF=
b Flip-Flop, D =
b Data IN, CLK =
b Clock)
 To Verify: If c < 3 in current clock cycle,
then c < 3 in next clock cycle.

 Step 1: Encode into propositional logic.


 Propositions: x1 , x0 ; and y 1 , y 0 (value in next cycle).
 Transition relation: y 1 ⇔ y 0 ; y 0 ⇔ ¬(x1 ∨ x0 ).
 Initial state: ¬(x1 ∧ x0 ).
 Error property: x1 ∧ y 0 .
 Step 2: Transform to CNF, encode as a clause set ∆.
13.2. DAVIS-PUTNAM 265

 Clauses: y 1 F ∨ x0 T , y 1 T ∨ x0 F , y 0 T ∨ x1 T ∨ x0 T , y 0 F ∨ x1 F , y 0 F ∨ x0 F , x1 F ∨ x0 F ,
y1 T , y0 T .
 Step 3: Call a SAT solver (up next).

Michael Kohlhase: Artificial Intelligence 1 390 2025-02-06

Our Agenda for This Chapter


 The Davis-Putnam (Logemann-Loveland) Procedure: How to systematically
test satisfiability?

 The quintessential SAT solving procedure, DPLL.


 DPLL is (A Restricted Form of) Resolution: How does this relate to what we
did in the last chapter?
 mathematical understanding of DPLL.

 Why Did Unit Propagation Yield a Conflict?: How can we analyze which
mistakes were made in “dead” search branches?
 Knowledge is power, see next.
 Clause Learning: How can we learn from our mistakes?

 One of the key concepts, perhaps the key concept, underlying the success of
SAT.
 Phase Transitions – Where the Really Hard Problems Are: Are all formulas
“hard” to solve?
 The answer is “no”. And in some cases we can figure out exactly when they
are/aren’t hard to solve.

Michael Kohlhase: Artificial Intelligence 1 391 2025-02-06

13.2 The Davis-Putnam (Logemann-Loveland) Procedure


A Video Nugget covering this section can be found at https://fau.tv/clip/id/25026.

The DPLL Procedure


 Definition 13.2.1. The Davis Putnam procedure (DPLL) is a SAT solver called
on a clause set ∆ and the empty assignment ϵ. It interleaves unit propagation (UP)
and splitting:
function DPLL(∆,I) returns a partial assignment I, or ‘‘unsatisfiable’’
/∗ Unit Propagation (UP) Rule: ∗/
∆′ := a copy of ∆; I ′ := I
while ∆′ contains a unit clause C = P α do
extend I ′ with [α/P ], clause−set simplify ∆′
/∗ Termination Test: ∗/
if 2 ∈ ∆′ then return ‘‘unsatisfiable’’
266 CHAPTER 13. PROPOSITIONAL REASONING: SAT SOLVERS

if ∆′ = {} then return I ′
/∗ Splitting Rule: ∗/
select some proposition P for which I ′ is not defined
I ′′ := I ′ extended with one truth value for P ; ∆′′ := a copy of ∆′ ; simplify ∆′′
if I ′′′ := DPLL(∆′′ ,I ′′ ) ̸= ‘‘unsatisfiable’’ then return I ′′′
I ′′ := I ′ extended with the other truth value for P ; ∆′′ := ∆′ ; simplify ∆′′
return DPLL(∆′′ ,I ′′ )

 In practice, of course one uses flags etc. instead of “copy”.

Michael Kohlhase: Artificial Intelligence 1 392 2025-02-06

DPLL: Example (Vanilla1)

 Example 13.2.2 (UP and Splitting). Let ∆ := P T ∨QT ∨RF ;P F ∨QF ;RT ;P T ∨QF
1. UP Rule: R 7→ T
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
2. Splitting Rule:
2a. P 7→ F 2b. P 7→ T
QT ; QF QF
3a. UP Rule: Q 7→ T 3b. UP Rule: Q 7→ F
2 clause set empty
returning “unsatisfiable” returning “R 7→ T, P 7→ T, Q 7→ F

Michael Kohlhase: Artificial Intelligence 1 393 2025-02-06

DPLL: Example (Vanilla2)


 Observation: Sometimes UP is all we need.
 Example 13.2.3. Let ∆ := QF ∨ P F ; P T ∨ QF ∨ RF ∨ S F ; QT ∨ S F ; RT ∨ S F ; S T
1. UP Rule: S 7→ T
QF ∨ P F ; P T ∨ QF ∨ RF ; QT ; RT

2. UP Rule: Q 7→ T
P F ; P T ∨ RF ; RT
3. UP Rule: R 7→ T
PF ; PT

4. UP Rule: P 7→ T
2

Michael Kohlhase: Artificial Intelligence 1 394 2025-02-06

DPLL: Example (Redundance1)


13.3. DPLL =
b (A RESTRICTED FORM OF) RESOLUTION 267

 Example 13.2.4. We introduce some nasty redundance to make DPLL slow.


∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F

P F
T

X1
T F

Xn Xn
T F T F

Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2

Michael Kohlhase: Artificial Intelligence 1 395 2025-02-06

Properties of DPLL
 Unsatisfiable case: What can we say if “unsatisfiable” is returned?
 In this case, we know that ∆ is unsatisfiable: Unit propagation is sound, in the
sense that it does not reduce the set of solutions.

 Satisfiable case: What can we say when a partial interpretation I is returned?


 Any extension of I to a complete interpretation satisfies ∆. (By construction,
I suffices to satisfy all clauses.)
 Déjà Vu, Anybody?

 DPLL =
b backtracking with inference, where inference =
b unit propagation.
 Unit propagation is sound: It does not reduce the set of solutions.
 Running time is exponential in worst case, good variable/value selection strate-
gies required.

Michael Kohlhase: Artificial Intelligence 1 396 2025-02-06

13.3 b (A Restricted Form of) Resolution


DPLL =
A Video Nugget covering this section can be found at https://fau.tv/clip/id/27022.
In the last slide we have discussed the semantic properties of the DPLL procedure: DPLL
is (refutation) sound and complete. Note that this is a theoretical resultin the sense that the
algorithm is, but that does not mean that a particular implementation of DPLL might not contain
bugs that affect sounds and completeness.
In the satisfiable case, DPLL returns a satisfying variable assignment, which we can check (in
low-order polynomial time) but in the unsatisfiable case, it just reports on the fact that it has tried
all branches and found nothing. This is clearly unsatisfactory, and we will address this situation
now by presenting a way that DPLL can output a resolution proof in the unsatisfiable case.
268 CHAPTER 13. PROPOSITIONAL REASONING: SAT SOLVERS

UP =
b Unit Resolution
 Observation: The unit propagation (UP) rule corresponds to a calculus:
while ∆′ contains a unit clause {l} do
extend I ′ with the respective truth value for the proposition underlying l
simplify ∆′ /∗ remove false literals ∗/

 Definition 13.3.1 (Unit Resolution). Unit resolution (UR) is the test calculus
consisting of the following inference rule:

C ∨ P α P β α ̸= β
UR
C

 Unit propagation =
b resolution restricted to cases where one parent is unit clause.
 Observation 13.3.2 (Soundness). UR is refutation sound. (since resolution is)
 Observation 13.3.3 (Completeness). UR is not refutation complete (alone).

 Example 13.3.4. P T ∨ QT ; P T ∨ QF ; P F ∨ QT ; P F ∨ QF is unsatisfiable but UR


cannot derive the empty clause 2.
 UR makes only limited inferences, as long as there are unit clauses. It does not
guarantee to infer everything that can be inferred.

Michael Kohlhase: Artificial Intelligence 1 397 2025-02-06

DPLL vs. Resolution


 Definition 13.3.5. We define the number of decisions of a DPLL run as the total
number of times a truth value was set by either unit propagation or splitting.

 Theorem 13.3.6. If DPLL returns “unsatisfiable” on ∆, then ∆⊢R0 2 with a


resolution proof whose length is at most the number of decisions.
 Proof: Consider first DPLL without UP
1. Consider any leaf node N , for proposition X, both of whose truth values directly
result in a clause C that has become empty.
2. Then for X = F the respective clause C must contain X T ; and for X = T the
respective clause C must contain X F . Thus we can resolve these two clauses
to a clause C(N ) that does not contain X.
3. C(N ) can contain only the negations of the decision literals l1 , . . ., lk above N .
Remove N from the tree, then iterate the argument. Once the tree is empty,
we have derived the empty clause.
4. Unit propagation can be simulated via applications of the splitting rule, choos-
ing a proposition that is constrained by a unit clause: One of the two truth
values then immediately yields an empty clause.

Michael Kohlhase: Artificial Intelligence 1 398 2025-02-06


13.3. DPLL =
b (A RESTRICTED FORM OF) RESOLUTION 269

DPLL vs. Resolution: Example (Vanilla2)


 Observation: The proof of ?? is constructive, so we can use it as a method to
read of a resolution proof from a DPLL trace.
 Example 13.3.7. We follow the steps in the proof of ?? for ∆ := QF ∨ P F ; P T ∨
QF ∨ RF ∨ S F ; QT ∨ S F ; RT ∨ S F ; S T

DPLL: (Without UP; leaves an- Resolution proof from that DPLL tree:
notated with clauses that became
empty)
S 2
F
T

Q ST SF ST
F
T

R QT ∨ S F QF ∨ S F QT ∨ S F
F
T

P F
RT ∨ S F QF ∨ RF ∨ S F RT ∨ S F
T

Q ∨ P F P T ∨ QF ∨ RF ∨ S F
F
QF ∨ P F P T ∨ QF ∨ RF ∨ S F

 Intuition: From a (top-down) DPLL tree, we generate a (bottom-up) resolution


proof.

Michael Kohlhase: Artificial Intelligence 1 399 2025-02-06

For reference, we give the full proof here.


Theorem 13.3.8. If DPLL returns “unsatisfiable” on a clause set ∆, then ∆⊢R0 2 with a R0 -
derivation whose length is at most the number of decisions.
Proof: Consider first DPLL with no unit propagation.
1. If the search tree is not empty, then there exists a leaf node N , i.e., a node associated to
proposition X so that, for each value of X, the partial assignment directly results in an empty
clause.
2. Denote the parent decisions of N by L1 , . . ., Lk , where Li is a literal for proposition X i and
the search node containing X i is N i .
3. Denote the empty clause for X by C(N, X), and denote the empty clause for X F by C(N, X F ).
4. For each x ∈ {X T , X F } we have the following properties:
1. xF ∈ C(N, x); and
2. C(N, x) ⊆ {xF , L1 , . . . , Lk }.
Due to , we can resolve C(N, X) with C(N, X F ); denote the outcome clause by C(N ).
5. We obviously have that (1) C(N ) ⊆ {L1 , . . . , Lk }.
6. The proof now proceeds by removing N from the search tree and attaching C(N ) at the Lk
branch of N k , in the role of C(N k , Lk ) as above. Then we select the next leaf node N ′ and
iterate the argument; once the tree is empty, by (1) we have derived the empty clause. What
we need to show is that, in each step of this iteration, we preserve the properties (a) and (b)
for all leaf nodes. Since we did not change anything in other parts of the tree, the only node
we need to show this for is N ′ := N k .
7. Due to (1), we have (b) for N k . But we do not necessarily have (a): C(N ) ⊆ {L1 , . . . , Lk },
but there are cases where Lk ̸∈ C(N ) (e.g., if X k is not contained in any clause and thus
270 CHAPTER 13. PROPOSITIONAL REASONING: SAT SOLVERS

branching over it was completely unnecessary). If so, however, we can simply remove N k and
all its descendants from the tree as well. We attach C(N ) at the L(k−1) branch of N (k−1) |,
in the role of C(N (k−1) , L(k−1) ). If L(k−1) ∈ C(N ) then we have (a) for N ′ := N (k−1) and
can stop. If L(k−1) F ̸∈ C(N ), then we remove N (k−1) and so forth, until either we stop
with (a), or have removed N 1 and thus must already have derived the empty clause (because
C(N ) ⊆ {L1 , . . . , Lk }\{L1 , . . . , Lk }).
8. Unit propagation can be simulated via applications of the splitting rule, choosing a proposi-
tion that is constrained by a unit clause: One of the two truth values then immediately yields
an empty clause.

DPLL vs. Resolution: Discussion


 So What?: The theorem we just proved helps to understand DPLL:
DPLL is an efficient practical method for conducting resolution proofs.
 In fact: DPLL =
b tree resolution.

 Definition 13.3.9. In a tree resolution, each derived clause C is used only once
(at its parent).
 Problem: The same C must be derived anew every time it is used!
 This is a fundamental weakness: There are inputs ∆ whose shortest tree reso-
lution proof is exponentially longer than their shortest (general) resolution proof.
 Intuitively: DPLL makes the same mistakes over and over again.
 Idea: DPLL should learn from its mistakes on one search branch, and apply the
learned knowledge to other branches.

 To the rescue: clause learning (up next)

Michael Kohlhase: Artificial Intelligence 1 400 2025-02-06

Excursion: Practical SAT solvers use a technique called CDCL that analyzes failure and learns
from that in terms of inferred clauses. Unfortunately, we cannot cover this in AI-1.??.

13.4 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25090.

Summary
 SAT solvers decide satisfiability of CNF formulas. This can be used for deduction,
and is highly successful as a general problem solving technique (e.g., in verification).
 DPLL = b backtracking with inference performed by unit propagation (UP), which
iteratively instantiates unit clauses and simplifies the formula.
 DPLL proofs of unsatisfiability correspond to a restricted form of resolution. The
restriction forces DPLL to “makes the same mistakes over again”.
 Implication graphs capture how UP derives conflicts. Their analysis enables us to
do clause learning. DPLL with clause learning is called CDCL. It corresponds to full
13.4. CONCLUSION 271

resolution, not “making the same mistakes over again”.


 CDCL is state of the art in applications, routinely solving formulas with millions of
propositions.

 In particular random formula distributions, typical problem hardness is characterized


by phase transitions.

Michael Kohlhase: Artificial Intelligence 1 401 2025-02-06

State of the Art in SAT


 SAT competitions:
 Since beginning of the 90s http://www.satcompetition.org/
 random vs. industrial vs. handcrafted benchmarks.
 Largest industrial instances: > 1.000.000 propositions.

 State of the art is CDCL:


 Vastly superior on handcrafted and industrial benchmarks.
 Key techniques: clause learning! Also: Efficient implementation (UP!), good
branching heuristics, random restarts, portfolios.

 What about local search?:


 Better on random instances.
 No “dramatic” progress in last decade.
 Parameters are difficult to adjust.

Michael Kohlhase: Artificial Intelligence 1 402 2025-02-06

But – What About Local Search for SAT?


 There’s a wealth of research on local search for SAT, e.g.:

 Definition 13.4.1. The GSAT algorithm OUTPUT: a satisfying truth assignment


of ∆, if found

function GSAT (∆, M axF lips M axT ries


for i :=1 to M axT ries
I := a randomly−generated truth assignment
for j :=1 to M axF lips
if I satisfies ∆ then return I
X:= a proposition reversing whose truth assignment gives
the largest increase in the number of satisfied clauses
I := I with the truth assignment of X reversed
end for
end for
return ‘‘no satisfying assignment found’’
272 CHAPTER 13. PROPOSITIONAL REASONING: SAT SOLVERS

 local search is not as successful in SAT applications, and the underlying ideas are
very similar to those presented in ?? (Not covered here)

Michael Kohlhase: Artificial Intelligence 1 403 2025-02-06

Topics We Didn’t Cover Here


 Variable/value selection heuristics: A whole zoo is out there.

 Implementation techniques: One of the most intensely researched subjects. Fa-


mous “watched literals” technique for UP had huge practical impact.
 Local search: In space of all truth value assignments. GSAT (slide 403) had huge
impact at the time (1992), caused huge amount of follow-up work. Less intensely
researched since clause learning hit the scene in the late 90s.

 Portfolios: How to combine several SAT solvers efficiently?


 Random restarts: Tackling heavy-tailed runtime distributions.
 Tractable SAT: Polynomial-time sub-classes (most prominent: 2-SAT, Horn for-
mulas).

 MaxSAT: Assign weight to each clause, maximize weight of satisfied clauses (=


optimization version of SAT).
 Resolution special cases: There’s a universe in between unit resolution and full
resolution: trade off inference vs. search.

 Proof complexity: Can one resolution special case X simulate another one Y
polynomially? Or is there an exponential separation (example families where X is
exponentially less efficient than Y )?

Michael Kohlhase: Artificial Intelligence 1 404 2025-02-06

Suggested Reading:

• Chapter 7: Logical Agents, Section 7.6.1 [RN09].


– Here, RN describe DPLL, i.e., basically what I cover under “The Davis-Putnam (Logemann-
Loveland) Procedure”.
– That’s the only thing they cover of this Chapter’s material. (And they even mark it as “can
be skimmed on first reading”.)
– This does not do the state of the art in SAT any justice.
• Chapter 7: Logical Agents, Sections 7.6.2, 7.6.3, and 7.7 [RN09].
– Sections 7.6.2 and 7.6.3 say a few words on local search for SAT, which I recommend as
additional background reading. Section 7.7 describes in quite some detail how to build an
agent using propositional logic to take decisions; nice background reading as well.
Chapter 14

First-Order Predicate Logic

14.1 Motivation: A more Expressive Language


A Video Nugget covering this section can be found at https://fau.tv/clip/id/25091.

Let’s
Let’s Talk
Talk About
AboutBlocks,
Blocks,Baby
Baby. .... .
 Question: What do you see here?
I Question: What do you see here?

A D B E C

I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
 You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
I And now: Say it in propositional logic!
 And now: Say it in propositional logic!
 Answer: “isRedA”,“isRedB”, . . . , “onTableA”, “onTableB”, . . . , “isBlockA”, . . .

 Wait a sec!: Why don’t we just say, e.g., “AllBlocksAreRed” and “isBlockA”?
 Problem: Could we conclude that A is red? (No)
These statements are atomic (just strings); their inner structure (“all blocks”, “is a
block”) is not captured.

 Idea: Predicate Logic (PL1 ) extends propositional logic with the ability to explicitly
speak about objects and their properties.
 How?: Variables ranging over objects, predicates describing object properties, . . .
Kohlhase: Künstliche Intelligenz 1 416 July 5, 2018
 Example 14.1.1. “∀x.block(x) ⇒ red(x)”; “block(A)”

Michael Kohlhase: Artificial Intelligence 1 405 2025-02-06

Let’s Talk About the Wumpus Instead?

273
274 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

Percepts: [Stench, Breeze, Glitter , Bump, Scream]


 Cell adjacent to Wumpus: Stench (else: None).

 Cell adjacent to Pit: Breeze (else: None).



 Cell that contains gold: Glitter (else: None).
 You walk into a wall: Bump (else: None).

 Wumpus shot by arrow: Scream (else: None).

 Say, in propositional logic: “Cell adjacent to Wumpus: Stench.”


 W 1,1 ⇒ S 1,2 ∧ S 2,1
 W 1,2 ⇒ S 2,2 ∧ S 1,1 ∧ S 1,3
 W 1,3 ⇒ S 2,3 ∧ S 1,2 ∧ S 1,4
 ...

 Note: Even when we can describe the problem suitably, for the desired reasoning,
the propositional formulation typically is way too large to write (by hand).
 PL1 solution: “∀x.Wumpus(x) ⇒ (∀y.adj(x, y) ⇒ stench(y))”

Michael Kohlhase: Artificial Intelligence 1 406 2025-02-06

Blocks/Wumpus, Who Cares? Let’s Talk About Numbers!


 Even worse!
 Example 14.1.2 (Integers). A limited vocabulary to talk about these
 The objects: {1, 2, 3, . . . }.
 Predicate 1: “even(x)” should be true iff x is even.
 Predicate 2: “eq(x, y)” should be true iff x = y.
 Function: succ(x) maps x to x + 1.
 Old problem: Say, in propositional logic, that “1 + 1 = 2”.
 Inner structure of vocabulary is ignored (cf. “AllBlocksAreRed”).
 PL1 solution: “eq(succ(1), 2)”.
 New Problem: Say, in propositional logic, “if x is even, so is x + 2”.
 It is impossible to speak about infinite sets of objects!
 PL1 solution: “∀x.even(x) ⇒ even(succ(succ(x)))”.

Michael Kohlhase: Artificial Intelligence 1 407 2025-02-06

Now We’re Talking


14.1. MOTIVATION: A MORE EXPRESSIVE LANGUAGE 275

 Example 14.1.3.

∀n.gt(n, 2) ⇒ ¬(∃a, b, c.eq(plus(pow(a, n), pow(b, n)), pow(c, n)))

Read: Forall n > 2, there are no a, b, c, such that an + bn = cn (Fermat’s last


theorem)
 Theorem proving in PL1: Arbitrary theorems, in principle.
 Fermat’s last theorem is of course infeasible, but interesting theorems can and
have been proved automatically.
 See http://en.wikipedia.org/wiki/Automated_theorem_proving.
 Note: Need to axiomatize “Plus”, “PowerOf”, “Equals”. See http://en.wikipedia.
org/wiki/Peano_axioms

Michael Kohlhase: Artificial Intelligence 1 408 2025-02-06

What Are the Practical Relevance/Applications?


 . . . even asking this question is a sacrilege:

 (Quotes from Wikipedia)


 “In Europe, logic was first developed by Aristotle. Aristotelian logic became
widely accepted in science and mathematics.”
 “The development of logic since Frege, Russell, and Wittgenstein had a profound
influence on the practice of philosophy and the perceived nature of philosophical
problems, and Philosophy of mathematics.”
 “During the later medieval period, major efforts were made to show that Aris-
totle’s ideas were compatible with Christian faith.”
 (In other words: the church issued for a long time that Aristotle’s ideas were
incompatible with Christian faith.)

Michael Kohlhase: Artificial Intelligence 1 409 2025-02-06

What Are the Practical Relevance/Applications?


 You’re asking it anyhow:

 Logic programming. Prolog et al.


 Databases. Deductive databases where elements of logic allow to conclude
additional facts. Logic is tied deeply with database theory.
 Semantic technology. Mega-trend since > a decade. Use PL1 fragments to
annotate data sets, facilitating their use and analysis.
 Prominent PL1 fragment: Web Ontology Language OWL.
 Prominent data set: The WWW. (semantic web)
 Assorted quotes on Semantic Web and OWL:
276 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

 The brain of humanity.


 The Semantic Web will never work.
 A TRULY meaningful way of interacting with the Web may finally be here:
the Semantic Web. The idea was proposed 10 years ago. A triumvirate of
internet heavyweights – Google, Twitter, and Facebook – are making it real.

Michael Kohlhase: Artificial Intelligence 1 410 2025-02-06

(A Few) Semantic Technology Applications

Web Queries Jeopardy (IBM Watson)

Context-Aware Apps Healthcare

Michael Kohlhase: Artificial Intelligence 1 411 2025-02-06

Our Agenda for This Topic


 This Chapter: Basic definitions and concepts; normal forms.
 Sets up the framework and basic operations.
 Syntax: How to write PL1 formulas? (Obviously required)
 Semantics: What is the meaning of PL1 formulas? (Obviously required.)
 Normal Forms: What are the basic normal forms, and how to obtain them?
(Needed for algorithms, which are defined on these normal forms.)
 Next Chapter: Compilation to propositional reasoning; unification; lifted resolu-
tion/tableau.

 Algorithmic principles for reasoning about predicate logic.

Michael Kohlhase: Artificial Intelligence 1 412 2025-02-06


14.2. FIRST-ORDER LOGIC 277

14.2 First-Order Logic


A Video Nugget covering this section can be found at https://fau.tv/clip/id/25093.
First-order logic is the most widely used formal systems for modelling knowledge and inference
processes. It strikes a very good bargain in the trade-off between expressivity and conceptual
and computational complexity. To many people first-order logic is “the logic”, i.e. the only logic
worth considering, its applications range from the foundations of mathematics to natural language
semantics.

First-Order Predicate Logic (PL1 )

 Coverage: We can talk about (All humans are mortal)


 individual things and denote them by variables or constants
 properties of individuals, (e.g. being human or mortal)
 relations of individuals, (e.g. sibling_of relationship)
 functions on individuals, (e.g. the f ather_of function)
We can also state the existence of an individual with a certain property, or the
universality of a property.
 But we cannot state assertions like

 There is a surjective function from the natural numbers into the reals.
 First-Order Predicate Logic has many good properties (complete calculi,
compactness, unitary, linear unification,. . . )
 But too weak for formalizing: (at least directly)

 natural numbers, torsion groups, calculus, . . .


 generalized quantifiers (most, few,. . . )

Michael Kohlhase: Artificial Intelligence 1 413 2025-02-06

14.2.1 First-Order Logic: Syntax and Semantics


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/25094.
The syntax and semantics of first-order logic is systematically organized in two distinct layers: one
for truth values (like in propositional logic) and one for individuals (the new, distinctive feature
of first-order logic).
The first step of defining a formal language is to specify the alphabet, here the first-order signatures
and their components.

PL1 Syntax (Signature and Variables)

 Definition 14.2.1. First-order logic (PL1 ), is a formal system extensively used in


mathematics, philosophy, linguistics, and computer science. It combines proposi-
tional logic with the ability to quantify over individuals.
 PL1 talks about two kinds of objects: (so we have two kinds of symbols)

 truth values by reusing PL0


278 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

 individuals, e.g. numbers, foxes, Pokémon,. . .


 Definition 14.2.2. A first-order signature consists of (all disjoint; k ∈ N)
 connectives: Σ0 = {T , F , ¬, ∨, ∧, ⇒, ⇔, . . .} (functions on truth values)
 function constants: Σfk = {f , g, h, . . .} (k-ary functions on individuals)
 predicate constants: Σ p
k = {p, q, r, . . .} (k-ary relations among individuals.)
 (Skolem constants: Σsk
k = {fk1 , fk2 , . . .}) (witness constructors; countably ∞)
 We take Σ1 to be all of these together: Σ1 := Σf ∪ Σp ∪ Σsk and define
Σ := Σ1 ∪ Σ0 .
 Definition 14.2.3. We assume a set of individual variables: Vι := {X, Y , Z, . . .}.
(countably ∞)

Michael Kohlhase: Artificial Intelligence 1 414 2025-02-06

We make the deliberate, but non-standard design choice here to include Skolem constants into
the signature from the start. These are used in inference systems to give names to objects and
construct witnesses. Other than the fact that they are usually introduced by need, they work
exactly like regular constants, which makes the inclusion rather painless. As we can never predict
how many Skolem constants we are going to need, we give ourselves countably infinitely many for
every arity. Our supply of individual variables is countably infinite for the same reason.
The formulae of first-order logic are built up from the signature and variables as terms (to represent
individuals) and proposition (to represent proposition). The latter include the connectives from
PL0 , but also quantifiers.

PL1 Syntax (Formulae)

 Definition 14.2.4. Terms: A ∈ wff ι (Σ1 , Vι ) (denote individuals)


 Vι ⊆ wff ι (Σ1 , Vι ),
 if f ∈ Σfk and Ai ∈ wff ι (Σ1 , Vι ) for i ≤ k, then f (A1 , . . ., Ak ) ∈ wff ι (Σ1 , Vι ).

 Definition 14.2.5. First-order propositions: A ∈ wff o (Σ1 , Vι ): (denote truth


values)
 if p ∈ Σp k and Ai ∈ wff ι (Σ1 , Vι ) for i ≤ k, then p(A1 , . . ., Ak ) ∈ wff o (Σ1 , Vι ),
 if A, B ∈ wff o (Σ1 , Vι ) and X ∈ Vι , then T , A ∧ B, ¬A, ∀X.A ∈ wff o (Σ1 , Vι ).
∀ is a binding operator called the universal quantifier.
 Definition 14.2.6. We define the connectives F , ∨, ⇒, ⇔ via the abbreviations
A ∨ B:=¬(¬A ∧ ¬B), A ⇒ B:=¬A ∨ B, A ⇔ B:=(A ⇒ B) ∧ (B ⇒ A), and
F := ¬T . We will use them like the primary connectives ∧ and ¬
 Definition 14.2.7. We use ∃X.A as an abbreviation for ¬(∀X.¬A). ∃ is a binding
operator called the existential quantifier.
 Definition 14.2.8. Call formulae without connectives or quantifiers atomic else
complex.

Michael Kohlhase: Artificial Intelligence 1 415 2025-02-06

Note: We only need e.g. conjunction, negation, and universal quantifier, all other logi-
14.2. FIRST-ORDER LOGIC 279

cal constants can be defined from them (as we will see when we have fixed their interpreta-
tions).

Alternative Notations for Quantifiers

Here Elsewhere
V
∀x.A x.A (x)A
W
∃x.A x.A

Michael Kohlhase: Artificial Intelligence 1 416 2025-02-06

The introduction of quantifiers to first-order logic brings a new phenomenon: variables that are
under the scope of a quantifiers will behave very differently from the ones that are not. Therefore
we build up a vocabulary that distinguishes the two.

Free and Bound Variables


 Definition 14.2.9. We call an occurrence of a variable X bound in a formula A
(otherwise free), iff it occurs in a sub-formula ∀X.B of A.
For a formula A, we will use BVar(A) (and free(A)) for the set of bound (free)
variables of A, i.e. variables that have a free/bound occurrence in A.
 Definition 14.2.10. We define the set free(A) of free variables of a formula A:

free(X) := {X} S
free(f (A1 , . . ., An )) := S 1≤i≤n free(Ai )
free(p(A1 , . . ., An )) := 1≤i≤n free(Ai )
free(¬A) := free(A)
free(A ∧ B) := free(A) ∪ free(B)
free(∀X.A) := free(A)\{X}

 Definition 14.2.11. We call a formula A closed or ground, iff free(A) = ∅. We


call a closed proposition a sentence, and denote the set of all ground term with
cwff ι (Σι ) and the set of sentences with cwff o (Σι ).
 Axiom 14.2.12. Bound variables can be renamed, i.e. any subterm ∀X.B of a
formula A can be replaced by A′ := (∀Y .B′ ), where B′ arises from B by replacing
all X ∈ free(B) with a new variable Y that does not occur in A. We call A′ an
alphabetical variant of A – and the other way around too.

Michael Kohlhase: Artificial Intelligence 1 417 2025-02-06

We will be mainly interested in (sets of) sentences – i.e. closed propositions – as the representations
of meaningful statements about individuals. Indeed, we will see below that free variables do
not gives us expressivity, since they behave like constants and could be replaced by them in all
situations, except the recursive definition of quantified formulae. Indeed in all situations where
variables occur freely, they have the character of metavariables, i.e. syntactic placeholders that
can be instantiated with terms when needed in a calculus.
The semantics of first-order logic is a Tarski-style set-theoretic semantics where the atomic syn-
tactic entities are interpreted by mapping them into a well-understood structure, a first-order
universe that is just an arbitrary set.
280 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

Semantics of PL1 (Models)

 Definition 14.2.13. We inherit the domain D0 = {T, F} of truth values from PL0
and assume an arbitrary domain Dι ̸= ∅ of individuals. (this choice is a parameter
to the semantics)
 Definition 14.2.14. An interpretation I assigns values to constants, e.g.

 I(¬) : D0 → D0 with T 7→ F, F 7→ T, and I(∧) = . . . (as in PL0 )


 I : Σfk → Dι k → Dι (interpret function symbols as arbitrary functions)
 I : Σp k → P(Dι k ) (interpret predicates as arbitrary relations)
 Definition 14.2.15. A variable assignment φ : Vι → Dι maps variables into the
domain.

 Definition 14.2.16. A model M = ⟨Dι , I⟩ of PL1 consists of a domain Dι and


an interpretation I.

Michael Kohlhase: Artificial Intelligence 1 418 2025-02-06

We do not have to make the domain of truth values part of the model, since it is always the same;
we determine the model by choosing a domain and an interpretation functiong.
Given a first-order model, we can define the evaluation function as a homomorphism over the
construction of formulae.

Semantics of PL1 (Evaluation)


 Definition 14.2.17. Given a model ⟨D, I⟩, the value function I φ is recursively
defined: (two parts: terms & propositions)

 I φ : wff ι (Σ1 , Vι ) → Dι assigns values to terms.


 I φ (X) := φ(X) and
 I φ (f (A1 , . . ., Ak )) := I(f )(I φ (A1 ), . . ., I φ (Ak ))

 I φ : wff o (Σ1 , Vι ) → D0 assigns values to formulae:


 I φ (T ) = I(T ) = T,
 I φ (¬A) = I(¬)(I φ (A))
 I φ (A ∧ B) = I(∧)(I φ (A), I φ (B)) (just as in PL0 )
 I φ (p(A1 , . . ., Ak )) := T, iff ⟨I φ (A1 ), . . ., I φ (Ak )⟩ ∈ I(p)

 I φ (∀X.A) := T, iff I φ,[a/X] (A) = T for all a ∈ Dι .

 Definition 14.2.18 (Assignment Extension). Let φ be a variable assignment


into D and a ∈ D, then φ,[a/X] is called the extension of φ with [a/X] and is
defined as {(Y ,a) ∈ φ | Y ̸= X} ∪ {(X,a)}: φ,[a/X] coincides with φ off X, and
gives the result a there.

Michael Kohlhase: Artificial Intelligence 1 419 2025-02-06

The only new (and interesting) case in this definition is the quantifier case, there we define the
value of a quantified formula by the value of its scope – but with an extension of the incoming
variable assignment. Note that by passing to the scope A of ∀x.A, the occurrences of the variable
x in A that were bound in ∀x.A become free and are amenable to evaluation by the variable
14.2. FIRST-ORDER LOGIC 281

assignment ψ := φ,[a/X]. Note that as an extension of φ, the assignment ψ supplies exactly the
right value for x in A. This variability of the variable assignment in the definition of the value
function justifies the somewhat complex setup of first-order evaluation, where we have the (static)
interpretation function for the symbols from the signature and the (dynamic) variable assignment
for the variables.
Note furthermore, that the value I φ (∃x.A) of ∃x.A, which we have defined to be ¬(∀x.¬A) is
true, iff it is not the case that I φ (∀x.¬A) = I ψ (¬A) = F for all a ∈ Dι and ψ := φ,[a/X]. This is
the case, iff I ψ (A) = T for some a ∈ Dι . So our definition of the existential quantifier yields the
appropriate semantics.

Semantics Computation: Example


 Example 14.2.19. We define an instance of first-order logic:
 Signature: Let Σf0 := {j, m}, Σf1 := {f }, and Σp 2 := {o}
 Universe: Dι := {J, M }
 Interpretation: I(j) := J, I(m) := M , I(f )(J) := M , I(f )(M ) := M , and
I(o) := {(M ,J)}.
Then ∀X.o(f (X), X) is a sentence and with ψ := φ,[a/X] for a ∈ Dι we have

I φ (∀X.o(f (X), X)) = T iff I ψ (o(f (X), X)) = T for all a ∈ Dι


iff (I ψ (f (X)),I ψ (X)) ∈ I(o) for all a ∈ {J, M }
iff (I(f )(I ψ (X)),ψ(X)) ∈ {(M ,J)} for all a ∈ {J, M }
iff (I(f )(ψ(X)),a) = (M ,J) for all a ∈ {J, M }
iff I(f )(a) = M and a = J for all a ∈ {J, M }

But a ̸= J for a = M , so I φ (∀X.o(f (X), X)) = F in the model ⟨Dι , I⟩.

Michael Kohlhase: Artificial Intelligence 1 420 2025-02-06

14.2.2 First-Order Substitutions


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/25156.
We will now turn our attention to substitutions, special formula-to-formula mappings that
operationalize the intuition that (individual) variables stand for arbitrary terms.

Substitutions on Terms
 Intuition: If B is a term and X is a variable, then we denote the result of
systematically replacing all occurrences of X in a term A by B with [B/X](A).
 Problem: What about [Z/Y ], [Y /X](X), is that Y or Z?

 Folklore: [Z/Y ], [Y /X](X) = Y , but [Z/Y ]([Y /X](X)) = Z of course.


(Parallel application)
 Definition 14.2.20. Let wfe(Σ, V) be an expression language, then we call σ : V →
wfe(Σ, V) a substitution, iff the support supp(σ):={X | (X,A) ∈ σ, X ̸= A} of σ
is finite. We denote the empty substitution with ϵ.
282 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

 Definition 14.2.21 (Substitution Application). We define substitution applica-


tion by
 σ(c) = c for c ∈ Σ
 σ(X) = A, iff X ∈ V and (X,A) ∈ σ.
 σ(f (A1 , . . ., An )) = f (σ(A1 ), . . ., σ(An )),
 σ(∀X.A) = ∀X.σ−X (A). (∃ analogous)
 Example 14.2.22. [a/x], [f (b)/y], [a/z] instantiates g(x, y, h(z)) to g(a, f (b), h(a)).

Michael Kohlhase: Artificial Intelligence 1 421 2025-02-06

The extension of a substitution is an important operation, which you will run into from time
to time. Given a substitution σ, a variable x, and an expression A, σ,[A/x] extends σ with a
new value for x. The intuition is that the values right of the comma overwrite the pairs in the
substitution on the left, which already has a value for x, even though the representation of σ may
not show it.

Substitution Extension
 Definition 14.2.23 (Substitution Extension). Let σ be a substitution, then we
denote the extension of σ with [A/X] by σ,[A/X] and define it as {(Y ,B) ∈
σ | Y ̸= X} ∪ {(X,A)}: σ,[A/X] coincides with σ off X, and gives the result A
there.
 Note: If σ is a substitution, then σ,[A/X] is also a substitution.
 We also need the dual operation: removing a variable from the support:

 Definition 14.2.24. We can discharge a variable X from a substitution σ by


setting σ−X :=σ,[X/X].

Michael Kohlhase: Artificial Intelligence 1 422 2025-02-06

Note that the use of the comma notation for substitutions defined in ?? is consistent with sub-
stitution extension. We can view a substitution [a/x], [f (b)/y] as the extension of the empty
substitution (the identity function on variables) by [f (b)/y] and then by [a/x]. Note furthermore,
that substitution extension is not commutative in general.
For first-order substitutions we need to extend the substitutions defined on terms to act on propo-
sitions. This is technically more involved, since we have to take care of bound variables.

Substitutions on Propositions
 Problem: We want to extend substitutions to propositions, in particular to quan-
tified formulae: What is σ(∀X.A)?

 Idea: σ should not instantiate bound variables. ([A/X](∀X.B) = ∀A.B′


ill-formed)
 Definition 14.2.25. σ(∀X.A) := (∀X.σ−X (A)).
 Problem: This can lead to variable capture: [f (X)/Y ](∀X.p(X, Y )) would eval-
uate to ∀X.p(X, f (X)), where the second occurrence of X is bound after instanti-
14.2. FIRST-ORDER LOGIC 283

ation, whereas it was free before. Solution: Rename away the bound variable X
in ∀X.p(X, Y ) before applying the substitution.
 Definition 14.2.26 (Capture-Avoiding Substitution Application). Let σ be a
substitution, A a formula, and A′ an alphabetic variant of A, such that intro(σ) ∩
BVar(A) = ∅. Then we define capture-avoiding substitution application via
σ(A) := σ(A′ ).

Michael Kohlhase: Artificial Intelligence 1 423 2025-02-06

We now introduce a central tool for reasoning about the semantics of substitutions: the “sub-
stitution value Lemma”, which relates the process of instantiation to (semantic) evaluation. This
result will be the motor of all soundness proofs on axioms and inference rules acting on variables
via substitutions. In fact, any logic with variables and substitutions will have (to have) some form
of a substitution value Lemma to get the meta-theory going, so it is usually the first target in any
development of such a logic. We establish the substitution-value Lemma for first-order logic in
two steps, first on terms, where it is very simple, and then on propositions.

Substitution Value Lemma for Terms


 Lemma 14.2.27. Let A and B be terms, then I φ ([B/X]A) = I ψ (A), where
ψ = φ, [I φ (B)/X].

 Proof: by induction on the depth of A:


1. depth=0 Then A is a variable (say Y ), or constant, so we have three cases
1.1. A = Y = X
1.1.1. then I φ ([B/X](A)) = I φ ([B/X](X)) = I φ (B) = ψ(X) = I ψ (X) =
I ψ (A).
1.2. A = Y ̸= X
1.2.1. then I φ ([B/X](A)) = I φ ([B/X](Y )) = I φ (Y ) = φ(Y ) = ψ(Y ) =
I ψ (Y ) = I ψ (A).
1.3. A is a constant
1.3.1. Analogous to the preceding case (Y ̸= X).
1.4. This completes the base case (depth = 0).
2. depth> 0
2.1. then A = f (A1 , . . ., An ) and we have

I φ ([B/X](A)) = I(f )(I φ ([B/X](A1 )), . . ., I φ ([B/X](An )))


= I(f )(I ψ (A1 ), . . ., I ψ (An ))
= I ψ (A).

by induction hypothesis
2.2. This completes the induction step, and we have proven the assertion.

Michael Kohlhase: Artificial Intelligence 1 424 2025-02-06

Substitution Value Lemma for Propositions


 Lemma 14.2.28. I φ ([B/X](A)) = I ψ (A), where ψ = φ,[I φ (B)/X].
 Proof: by induction on the number n of connectives and quantifiers in A:
284 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

1. n = 0
1.1. then A is an atomic proposition, and we can argue like in the induction
step of the substitution value lemma for terms.
2. n > 0 and A = ¬B or A = C ◦ D
2.1. Here we argue like in the induction step of the term lemma as well.
3. n > 0 and A = ∀Y .C where (WLOG) X ̸= Y (otherwise rename)
3.1. then I ψ (A) = I ψ (∀Y .C) = T, iff I ψ,[a/Y ] (C) = T for all a ∈ Dι .
3.2. But I ψ,[a/Y ] (C) = I φ,[a/Y ] ([B/X](C)) = T, by induction hypothesis.
3.3. So I ψ (A) = I φ (∀Y .[B/X](C)) = I φ ([B/X](∀Y .C)) = I φ ([B/X](A))

Michael Kohlhase: Artificial Intelligence 1 425 2025-02-06

To understand the proof fully, you should think about where the WLOG – it stands for without
loss of generality comes from.

14.3 First-Order Natural Deduction


A Video Nugget covering this section can be found at https://fau.tv/clip/id/25157.
In this section, we will introduce the first-order natural deduction calculus. Recall from ??
that natural deduction calculus have introduction and elimination for every logical constant (the
connectives in PL0 ). Recall furthermore that we had two styles/notations for the calculus, the
classical ND calculus and the sequent-style notation. These principles will be carried over to
natural deduction in PL1 .
This allows us to introduce the calculi in two stages, first for the (propositional) connectives
and then extend this to a calculus for first-order logic by adding rules for the quantifiers. In
particular, we can define the first-order calculi simply by adding (introduction and elimination)
rules for the (universal and existential) quantifiers to the calculus ND0 defined in ??.
To obtain a first-order calculus, we have to extend ND0 with (introduction and elimination) rules
for the quantifiers.

First-Order Natural Deduction (ND1 ; Gentzen [Gen34])


 Rules for connectives just as always
 Definition 14.3.1 (New Quantifier Rules). The first-order natural deduction
calculus ND1 extends ND0 by the following four rules:

A ∀X.A
ND1 ∀I ∗ ND1 ∀E
∀X.A [B/X](A)
1
[[c/X](A)]
..
∃X.A . 0 new
c ∈ Σsk
[B/X](A) C
ND1 ∃I ND1 ∃E 1
∃X.A C

means that A does not depend on any hypothesis in which X is free.

Michael Kohlhase: Artificial Intelligence 1 426 2025-02-06

The intuition behind the rule ND1 ∀I is that a formula A with a (free) variable X can be generalized
to ∀X.A, if X stands for an arbitrary object, i.e. there are no restricting assumptions about X.
The ND1 ∀E rule is just a substitution rule that allows to instantiate arbitrary terms B for X
14.3. FIRST-ORDER NATURAL DEDUCTION 285

in A. The ND1 ∃I rule says if we have a witness B for X in A (i.e. a concrete term B that
makes A true), then we can existentially close A. The ND1 ∃E rule corresponds to the common
mathematical practice, where we give objects we know exist a new name c and continue the proof
by reasoning about this concrete object c. Anything we can prove from the assumption [c/X](A)
we can prove outright if ∃X.A is known.
Now we reformulate the classical formulation of the calculus of natural deduction as a
sequent calculus by lifting it to the “judgments level” as we did for propositional logic. We only
need provide new quantifier rules.

First-Order Natural Deduction in Sequent Formulation


 Rules for connectives from ND⊢0
 Definition 14.3.2 (New Quantifier Rules). The inference rules of the first-order
sequent calculus ND⊢1 consist of those from ND⊢0 plus the following quantifier rules:

Γ⊢A X ̸∈ free(Γ) Γ⊢∀X.A


ND⊢1 ∀I ND⊢1 ∀E
Γ⊢∀X.A Γ⊢[B/X](A)

Γ⊢[B/X](A) 0 new
Γ⊢∃X.A Γ, [c/X](A)⊢C c ∈ Σsk
ND⊢1 ∃I ND⊢1 ∃E
Γ⊢∃X.A Γ⊢C

Michael Kohlhase: Artificial Intelligence 1 427 2025-02-06

Natural Deduction with Equality


 Definition 14.3.3 (First-Order Logic with Equality). We extend PL1 with a
new logical constant for equality =∈ Σp 2 and fix its interpretation to I(=) :=
{(x,x) | x ∈ Dι }. We call the extended logic first-order logic with equality (PL1= )

 We now extend natural deduction as well.


 Definition 14.3.4. For the calculus of natural deduction with equality (ND=
1
) we
add the following two rules to ND to deal with equality:
1

A = B C [A]p
=I =E
A=A [B/p]C

where C [A]p if the formula C has a subterm A at position p and [B/p]C is the
result of replacing that subterm with B.

 In many ways equivalence behaves like equality, we will use the following rules in
ND1
 Definition 14.3.5. ⇔I is derivable and ⇔E is admissible in ND1 :
A ⇔ B C [A]p
⇔I ⇔E
A⇔A [B/p]C

Michael Kohlhase: Artificial Intelligence 1 428 2025-02-06

Again, we have two rules that follow the introduction/elimination pattern of natural deduction
286 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

calculi.
Definition 14.3.6. We have the canonical sequent rules that correspond to them: =I, =E, ⇔I,
and ⇔E
To make sure that we understand the constructions here, let us get back to the “replacement at
position” operation used in the equality rules.

Positions in Formulae
 Idea: Formulae are (naturally) trees, so we can use tree positions to talk about
subformulae
 Definition 14.3.7. A position p is a tuple of natural numbers that in each node
of an expression (tree) specifies into which child to descend. For an expression A
we denote the subexpression at p with A|p .
We will sometimes write an expression C as C [A]p to indicate that C the subex-
pression A at position p.
If C [A]p and A is atomic, then we speak of an occurrence of A in C.
 Definition 14.3.8. Let p be a position, then [A/p]C is the expression obtained
from C by replacing the subexpression at p by A.

 Example 14.3.9 (Schematically).

C [B/p]C

p p

A = C|p B

Michael Kohlhase: Artificial Intelligence 1 429 2025-02-06

The operation of replacing a subformula at position p is quite different from e.g. (first-order)
substitutions:
• We are replacing subformulae with subformulae instead of instantiating variables with terms.
• Substitutions replace all occurrences of a variable in a formula, whereas formula replacement
only affects the (one) subformula at position p.
We conclude this section with an extended example: the proof of a classical mathematical result
in the natural deduction calculus with equality. This shows us that we can derive strong properties
about complex situations (here the real numbers; an uncountably infinite set of numbers).

1

ND= Example: 2 is Irrational

 We can do real mathematics with ND=


1
:

 Theorem 14.3.10. 2 is irrational
Proof: We prove the assertion by contradiction

1. Assume that 2 is rational. √
2. Then there are numbers p and q such that 2 = p/q.
14.3. FIRST-ORDER NATURAL DEDUCTION 287

3. So we know 2q 2 = p2 .
4. But 2q 2 has an odd number of prime factors while p2 an even number.
5. This is a contradiction (since they are equal), so we have proven the assertion

Michael Kohlhase: Artificial Intelligence 1 430 2025-02-06

If we want to formalize this into ND1 , we have to write down all the assertions in the proof steps
in PL1 syntax and come up with justifications for them in terms of ND1 inference rules. The next
two slides show such a proof, where we write ′n to denote that n is prime, use #(n) for the number
of prime factors of a number n, and write irr(r) if r is irrational.

1

ND= Example: 2 is Irrational (the Proof)

# hyp formula NDjust


1 ∀n, m.¬(2n + 1) = (2m) lemma
2 ∀n, m.#(nm ) = m#(n) lemma
3 ∀n, p.prime(p) ⇒ #(pn) = (#(n) + 1) lemma
4 ∀x.irr(x)
√ ⇔ ¬(∃p, q.x
√ = p/q) definition
5 irr( √ 2) ⇔ ¬(∃p, q. 2 = p/q) ND⊢1 ∀E(4)
6 6 ¬irr( 2) √ ND⊢0 Ax
7 6 ¬¬(∃p,√ q. 2 = p/q) ⇔E(6, 5)
8 6 ∃p,
√ q. 2 = p/q ND⊢0 ¬E(7)
9 6,9 2 = p/q ND⊢0 Ax
10 6,9 2q 2 = p2 arith(9)
11 6,9 #(p2 ) = 2#(p) ND⊢1 ∀E 2 (2)
12 6,9 prime(2) ⇒ #(2q 2 ) = (#(q 2 ) + 1) ND⊢1 ∀E 2 (1)

Michael Kohlhase: Artificial Intelligence 1 431 2025-02-06

Lines 6 and 9 are local hypotheses for the proof (they only have an implicit counterpart in the
inference rules as defined above). Finally we have abbreviated the arithmetic simplification of line
9 with the justification “arith” to avoid having to formalize elementary arithmetic.

1

ND= Example: 2 is Irrational (the Proof continued)

13 prime(2) lemma
14 6,9 #(2q 2 ) = #(q 2 ) + 1 ND0 ⇒E(13, 12)
15 6,9 #(q 2 ) = 2#(q) ND1 ∀E 2 (2)
16 6,9 #(2q 2 ) = 2#(q) + 1 =E(14, 15)
17 #(p2 ) = #(p2 ) =I
18 6,9 #(2q 2 ) = #(q 2 ) =E(17, 10)
19 6.9 2#(q) + 1 = #(p2 ) =E(18, 16)
20 6.9 2#(q) + 1 = 2#(p) =E(19, 11)
21 6.9 ¬(2#(q) + 1) = (2#(p)) ND1 ∀E 2 (1)
22 6,9 F ND0FI(20, 21)
23 6 F √ ND1 ∃E 6 (22)
24 ¬¬irr(
√ 2) ND0¬I 6 (23)
25 irr( 2) ND0¬E 2 (23)
288 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC

Michael Kohlhase: Artificial Intelligence 1 432 2025-02-06

We observe that the ND1 proof is much more detailed, and needs quite a few Lemmata about
# to go through. Furthermore, we have added a definition of irrationality (and treat definitional
equality via the equality rules). Apart from these artefacts of formalization, the two representations
of proofs correspond to each other very directly.

14.4 Conclusion
Summary (Predicate Logic)
 First-order logic allows to explicitly speak about objects and their properties. It is
thus a more natural and compact representation language than propositional logic;
it also enables us to speak about infinite sets of objects.
 Logic has thousands of years of history. A major current application in AI is semantic
technology. (up soon)

 First-order logic (PL1 ) allows universal and existential quantifier quantification over
individuals.
 A PL1 model consists of a universe Dι and a function I mapping individual con-
stants/predicate constants/function constants to elements/relations/functions on
Dι .
 First-order natural deduction is a sound and complete calculus for PL1 intended
and optimized for human understanding.

Michael Kohlhase: Artificial Intelligence 1 433 2025-02-06

Applications for ND1 (and extensions)

 Recap: We can express mathematical theorems in PL1 and prove them in ND1 .
 Problem: These proofs can be huge (giga-steps), how can we trust them?

 Definition 14.4.1. A proof checker for a calculus C is a program that reads (a


formal representation) of a C-proof P and performs proof-checking, i.e. it checks
whether all rule applications in P are (syntactically) correct.
 Remark: Proof-checking goes step-by-step ; proof checkers run in linear time.

 Idea: If the logic can express (safety)-properties of programs, we can use proof
checkers for formal program verification. (there are extensions of PL1 that can)
 Problem: These proofs can be humongous, how can humans write them?
 Idea: Automate proof construction via

 lemma/theorem libraries that collect useful intermediate results


 tactics =
b subroutines that construct recurring sub-proofs
 calls to automated theorem prover (ATP) (next chapter)
14.4. CONCLUSION 289

Proof checkers that do any/all of these are called proof assistants.


 Definition 14.4.2. Formal methods are logic-based techniques for the specification,
development, analysis, and verification of software and hardware.

 Formal methods is a major (industrial) application of AI/logic technology.

Michael Kohlhase: Artificial Intelligence 1 434 2025-02-06

Suggested Reading:
• Chapter 8: First-Order Logic, Sections 8.1 and 8.2 in [RN09]
– A less formal account of what I cover in “Syntax” and “Semantics”. Contains different exam-
ples, and complementary explanations. Nice as additional background reading.

• Sections 8.3 and 8.4 provide additional material on using PL1, and on modeling in PL1, that I
don’t cover in this lecture. Nice reading, not required for exam.
• Chapter 9: Inference in First-Order Logic, Section 9.5.1 in [RN09]
– A very brief (2 pages) description of what I cover in “Normal Forms”. Much less formal; I
couldn’t find where (if at all) RN cover transformation into prenex normal form. Can serve
as additional reading, can’t replace the lecture.
• Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this
in AI-1, but provide one for the calculi introduced so far in??.
290 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC
Chapter 15

Automated Theorem Proving in


First-Order Logic

In this chapter, we take up the machine-oriented calculi for propositional logic from ?? and extend
them to the first-order case. While this has been relatively easy for the natural deduction calculus
– we only had to introduce the notion of substitutions for the elimination rule for the universal
quantifier we have to work much more here to make the calculi effective for implementation.

15.1 First-Order Inference with Tableaux


15.1.1 First-Order Tableau Calculi
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/25156.

Test Calculi: Tableaux and Model Generation


 Idea: A tableau calculus is a test calculus that
 analyzes a labeled formulae in a tree to determine satisfiability,
 its branches correspond to valuations (; models).
 Example 15.1.1.Tableau calculi try to construct models for labeled formulae:

Tableau refutation (Validity) Model generation (Satisfiability)


⊨P ∧ Q ⇒ Q ∧ P ⊨ P ∧ (Q ∨ ¬R) ∧ ¬Q
(P ∧ (Q ∨ ¬R) ∧ ¬Q)T
(P ∧ Q ⇒ Q ∧ P )F
(P ∧ (Q ∨ ¬R))T
(P ∧ Q)T
¬QT
(Q ∧ P )F QF
PT PT
QT
(Q ∨ ¬R)T
P F QF
QT ¬RT
⊥ ⊥
⊥ RF
No Model Herbrand model {P T , QF , RF }
φ := {P 7→ T, Q 7→ F, R 7→ F}

 Idea: Open branches in saturated tableaux yield models.


 Algorithm: Fully expand all possible tableaux, (no rule can be applied)

291
292 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

 Satisfiable, iff there are open branches (correspond to models)

Michael Kohlhase: Artificial Intelligence 1 435 2025-02-06

Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis
on when a formula can be made true (or false). Therefore the formulae are decorated with upper
indices that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T). This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one. The latter corresponds a model.
Now that we have seen the examples, we can write down the tableau rules formally.

Analytical Tableaux (Formal Treatment of T0 )


 Idea: A test calculus where
 A labeled formula is analyzed in a tree to determine satisfiability,
 branches correspond to valuations (models)
 Definition 15.1.2. The propositional tableau calculus T0 has two inference rules
per connective (one for each possible label)


T F α ̸= β
(A ∧ B) (A ∧ B) ¬AT
¬AF

T0 ∧ T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT

Use rules exhaustively as long as they contribute new material (; termination)

 Definition 15.1.3. We call any tree ( introduces branches) produced by the T0


inference rules from a set Φ of labeled formulae a tableau for Φ.
 Definition 15.1.4. Call a tableau saturated, iff no rule adds new material and a
branch closed, iff it ends in ⊥, else open. A tableau is closed, iff all of its branches
are.
In analogy to the ⊥ at the end of closed branches, we sometimes decorate open
branches with a 2 symbol.

Michael Kohlhase: Artificial Intelligence 1 436 2025-02-06

These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 15.1.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 293

The saturated tableau represents a full case analysis of what is necessary to give A the truth
value α; since all branches are closed (contain contradictions) this is impossible.

Analytical Tableaux (T0 continued)


 Definition 15.1.6 (T0 -Theorem/Derivability). A is a T0 -theorem (⊢T0 A), iff
there is a closed tableau with AF at the root.
Φ ⊆ wff0 (V0 ) derives A in T0 (Φ⊢T0 A), iff there is a closed tableau starting with AF
and ΦT . The tableau with only a branch of AF and ΦT is called initial for Φ⊢T0 A.

Michael Kohlhase: Artificial Intelligence 1 437 2025-02-06

Definition 15.1.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We will now extend the propositional tableau techniques to first-order logic. We only have to add
two new rules for the universal quantifier (in positive and negative polarity).

First-Order Standard Tableaux (T1 )


 Definition 15.1.8. The standard tableau calculus (T1 ) extends T0 (propositional
tableau calculus) with the following quantifier rules:
T F
(∀X.A) C ∈ cwff ι (Σι ) (∀X.A) 0 new
c ∈ Σsk
T
T1 ∀ F
T1 ∃
([C/X](A)) ([c/X](A))

 Problem: The rule T1 ∀ displays a case of “don’t know indeterminism”: to find a


refutation we have to guess a formula C from the (usually infinite) set cwff ι (Σι ).
For proof search, this means that we have to systematically try all, so T1 ∀ is infinitely
branching in general.

Michael Kohlhase: Artificial Intelligence 1 438 2025-02-06

The rule T1 ∀ operationalizes the intuition that a universally quantified formula is true, iff all
of the instances of the scope are. To understand the T1 ∃ rule, we have to keep in mind that
F T
∃X.A abbreviates ¬(∀X.¬A), so that we have to read (∀X.A) existentially — i.e. as (∃X.¬A) ,
stating that there is an object with property ¬A. In this situation, we can simply give this
object a name: c, which we take from our (infinite) set of witness constants Σsk 0 , which we have
given ourselves expressly for this purpose when we defined first-order syntax. In other words
T F
([c/X](¬A)) = ([c/X](A)) holds, and this is just the conclusion of the T1 ∃ rule.
Note that the T1 ∀ rule is computationally extremely inefficient: we have to guess an (i.e. in a
search setting to systematically consider all) instance C ∈ wff ι (Σι , Vι ) for X. This makes the rule
infinitely branching.
294 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

In the next calculus we will try to remedy the computational inefficiency of the T1 ∀ rule. We do
this by delaying the choice in the universal rule.

Free variable Tableaux (T1f )

 Definition 15.1.9. The free variable tableau calculus (T1f ) extends T0 (proposi-
tional tableau calculus) with the quantifier rules:

(∀X.A)T Y new f (∀X.A)F free(∀X.A) = {X 1 , . . ., X k } f ∈ Σsk


k new
T1 ∀ T1f ∃
([Y /X](A))T ([f (X 1 , . . . , X k )/X](A))F

and generalizes its cut rule T0 ⊥ to:


α ̸= β σ(A) = σ(B)

T1f⊥
⊥:σ

T1f⊥ instantiates the whole tableau by σ.

 Advantage: No guessing necessary in T1f ∀-rule!


 New Problem: find suitable substitution (most general unifier) (later)

Michael Kohlhase: Artificial Intelligence 1 439 2025-02-06

Metavariables: Instead of guessing a concrete instance for the universally quantified variable
as in the T1 ∀ rule, T1f ∀ instantiates it with a new metavariable Y , which will be instantiated by
need in the course of the derivation.
Skolem terms as witnesses: The introduction of metavariables makes is necessary to extend
the treatment of witnesses in the existential rule. Intuitively, we cannot simply invent a new name,
since the meaning of the body A may contain metavariables introduced by the T1f ∀ rule. As we
do not know their values yet, the witness for the existential statement in the antecedent of the
T1f ∃ rule needs to depend on that. So witness it using a witness term, concretely by applying a
Skolem function to the metavariables in A.
Instantiating Metavariables: Finally, the T1f⊥ rule completes the treatment of metavariables,
it allows to instantiate the whole tableau in a way that the current branch closes. This leaves us
with the problem of finding substitutions that make two terms equal.

Free variable Tableaux (T1f ): Derivable Rules

 Definition 15.1.10. Derivable quantifier rules in T1f :


T
(∃X.A) k new
free(∀X.A) = {X 1 , . . ., X k } f ∈ Σsk
T
([f (X 1 , . . . , X k )/X](A))
F
(∃X.A) Y new
F
([Y /X](A))

Michael Kohlhase: Artificial Intelligence 1 440 2025-02-06


15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 295

Let’s Talk
Tableau Aboutabout
Reasons Blocks, Baby . . .
Blocks
 Example 15.1.11 (Reasoning about Blocks). Returing to slide 405
I Question: What do you see here?

A D B E C

I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
Can we prove red(A) from ∀x.block(x) ⇒ red(x) and block(A)?
I And now: Say it in propositional logic!
T
(∀X.block(X) ⇒ red(X))
T
block(A)
F
red(A)
T
(block(Y ) ⇒ red(Y ))
F T
block(Y ) red(A)
⊥ : [A/Y ] ⊥

Michael Kohlhase: Artificial Intelligence 1 441 2025-02-06

15.1.2 First-Order Unification


Video Nuggets covering this subsection can be found
Kohlhase: Künstliche Intelligenz 1 416
at https://fau.tv/clip/id/26810
July 5, 2018
and
https://fau.tv/clip/id/26811.
We will now look into the problem of finding a substitution σ that make two terms equal (we
say it unifies them) in more detail. The presentation of the unification algorithm we give here
“transformation-based” this has been a very influential way to treat certain algorithms in theoret-
ical computer science.
A transformation-based view of algorithms: The “transformation-based” view of algorithms
divides two concerns in presenting and reasoning about algorithms according to Kowalski’s slogan
[Kow97]
algorithm = logic + control
The computational paradigm highlighted by this quote is that (many) algorithms can be thought
of as manipulating representations of the problem at hand and transforming them into a form
that makes it simple to read off solutions. Given this, we can simplify thinking and reasoning
about such algorithms by separating out their “logical” part, which deals with is concerned with
how the problem representations can be manipulated in principle from the “control” part, which
is concerned with questions about when to apply which transformations.
It turns out that many questions about the algorithms can already be answered on the “logic”
level, and that the “logical” analysis of the algorithm can already give strong hints as to how to
optimize control.
In fact we will only concern ourselves with the “logical” analysis of unification here.
The first step towards a theory of unification is to take a closer look at the problem itself. A first
set of examples show that we have multiple solutions to the problem of finding substitutions that
make two terms equal. But we also see that these are related in a systematic way.

Unification (Definitions)
 Definition 15.1.12. For given terms A and B, unification is the problem of finding
a substitution σ, such that σ(A) = σ(B).
 Notation: We write term pairs as A=?B e.g. f (X)=?f (g(Y )).
296 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

 Definition 15.1.13. Solutions (e.g. [g(a)/X], [a/Y ], [g(g(a))/X], [g(a)/Y ], or


[g(Z)/X], [Z/Y ]) are called unifiers, U(A=?B) := {σ | σ(A) = σ(B)}.
 Idea: Find representatives in U(A=?B), that generate the set of solutions.

 Definition 15.1.14. Let σ and θ be substitutions and W ⊆ Vι , we say that


a substitution σ is more general than θ (on W ; write σ≤θ[W ]), iff there is a
substitution ρ, such that θ=ρ ◦ σ[W ], where σ=ρ[W ], iff σ(X) = ρ(X) for all
X ∈ W.
 Definition 15.1.15. σ is called a most general unifier (mgu) of A and B, iff it is
minimal in U(A=?B) wrt. ≤[(free(A) ∪ free(B))].

Michael Kohlhase: Artificial Intelligence 1 442 2025-02-06

The idea behind a most general unifier is that all other unifiers can be obtained from it by (further)
instantiation. In an automated theorem proving setting, this means that using most general
unifiers is the least committed choice — any other choice of unifiers (that would be necessary for
completeness) can later be obtained by other substitutions.
Note that there is a subtlety in the definition of the ordering on substitutions: we only compare
on a subset of the variables. The reason for this is that we have defined substitutions to be total
on (the infinite set of) variables for flexibility, but in the applications (see the definition of most
general unifiers), we are only interested in a subset of variables: the ones that occur in the initial
problem formulation. Intuitively, we do not care what the unifiers do off that set. If we did not
have the restriction to the set W of variables, the ordering relation on substitutions would become
much too fine-grained to be useful (i.e. to guarantee unique most general unifiers in our case).
Now that we have defined the problem, we can turn to the unification algorithm itself. We
will define it in a way that is very similar to logic programming: we first define a calculus that
generates “solved forms” (formulae from which we can read off the solution) and reason about
control later. In this case we will reason that control does not matter.

Unification Problems (=
b Equational Systems)
 Idea: Unification is equation solving.
 Definition 15.1.16. We call a formula A1=?B1 ∧ . . . ∧ An=?Bn an unification
problem iff Ai , Bi ∈ wff ι (Σι , Vι ).
 Note: We consider unification problems as sets of equations (∧ is ACI), and
equations as two-element multisets (=? is C).

 Definition 15.1.17. A substitution is called a unifier for a unification problem E


(and thus D unifiable), iff it is a (simultaneous) unifier for all pairs in E.

Michael Kohlhase: Artificial Intelligence 1 443 2025-02-06

In principle, unification problems are sets of equations, which we write as conjunctions, since all of
them have to be solved for finding a unifier. Note that it is not a problem for the “logical view” that
the representation as conjunctions induces an order, since we know that conjunction is associative,
commutative and idempotent, i.e. that conjuncts do not have an intrinsic order or multiplicity,
if we consider two equational problems as equal, if they are equivalent as propositional formulae.
In the same way, we will abstract from the order in equations, since we know that the equality
relation is symmetric. Of course we would have to deal with this somehow in the implementation
(typically, we would implement equational problems as lists of pairs), but that belongs into the
“control” aspect of the algorithm, which we are abstracting from at the moment.
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 297

Solved forms and Most General Unifiers


 Definition 15.1.18. We call a pair A=?B solved in a unification problem E, iff
A = X, E = X=?A ∧ E ′ , and X ̸∈ (free(A) ∪ free(E ′ )). We call an unification
problem E a solved form, iff all its pairs are solved.

 Lemma 15.1.19. Solved forms are of the form X 1=?B1 ∧ . . . ∧ X n=?Bn where
the X i are distinct and X i ∈
̸ free(Bj ).
 Definition 15.1.20. Any substitution σ = [B1 /X 1 ], . . . ,[Bn /X n ] induces a solved
unification problem E σ :=(X 1=?B1 ∧ . . . ∧ X n=?Bn ).
 Lemma 15.1.21. If E = X 1=?B1 ∧ . . . ∧ X n=?Bn is a solved form, then E has
the unique most general unifier σ E :=[B1 /X 1 ], . . . ,[Bn /X n ].
 Proof: Let θ ∈ U(E)
1. then θ(X i ) = θ(Bi ) = θ ◦ σ E (X i )
2. and thus θ=θ ◦ σ E [supp(σ)].

 Note: We can rename the introduced variables in most general unifiers!

Michael Kohlhase: Artificial Intelligence 1 444 2025-02-06

It is essential to our “logical” analysis of the unification algorithm that we arrive at unification
problems whose unifiers we can read off easily. Solved forms serve that need perfectly as ??
shows.
Given the idea that unification problems can be expressed as formulae, we can express the algo-
rithm in three simple rules that transform unification problems into solved forms (or unsolvable
ones).

Unification Algorithm
 Definition 15.1.22. The inference system U consists of the following rules:

E ∧ f (A1 , . . ., An )=?f (B1 , . . ., Bn ) E ∧ A=?A


Udec Utriv
E ∧ A1=?B1 ∧ . . . ∧ An=?Bn E
E ∧ X=?A X ̸∈ free(A) X ∈ free(E)
Uelim
[A/X](E) ∧ X=?A

 Lemma 15.1.23. U is correct: E⊢U F implies U(F) ⊆ U(E).


 Lemma 15.1.24. U is complete: E⊢U F implies U(E) ⊆ U(F).

 Lemma 15.1.25. U is confluent: the order of derivations does not matter.


 Corollary 15.1.26. First-order unification is unitary: i.e. most general unifiers are
unique up to renaming of introduced variables.
 Proof sketch: U is trivially branching.

Michael Kohlhase: Artificial Intelligence 1 445 2025-02-06

The decomposition rule Udec is completely straightforward, but note that it transforms one unifi-
cation pair into multiple argument pairs; this is the reason, why we have to directly use unification
298 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

problems with multiple pairs in U.


Note furthermore, that we could have restricted the Utriv rule to variable-variable pairs, since
for any other pair, we can decompose until only variables are left. Here we observe, that constant-
constant pairs can be decomposed with the Udec rule in the somewhat degenerate case without
arguments.
Finally, we observe that the first of the two variable conditions in Uelim (the “occurs-in-check”)
makes sure that we only apply the transformation to unifiable unification problems, whereas the
second one is a termination condition that prevents the rule to be applied twice.
The notion of completeness and correctness is a bit different than that for calculi that we compare
to the entailment relation. We can think of the “logical system of unifiability” with the model class
of sets of substitutions, where a set satisfies an equational problem E, iff all of its members are
unifiers. This view induces the soundness and completeness notions presented above.
The three meta-properties above are relatively trivial, but somewhat tedious to prove, so we leave
the proofs as an exercise to the reader.
We now fortify our intuition about the unification calculus by two examples. Note that we only
need to pursue one possible U derivation since we have confluence.

Unification Examples
 Example 15.1.27. Two similar unification problems:

f (g(X, X), h(a))=?f (g(a, Z), h(Z))


U dec f (g(X, X), h(a))=?f (g(b, Z), h(Z))
g(X, X)=?g(a, Z) ∧ h(a)=?h(Z) U dec
U dec g(X, X)=?g(b, Z) ∧ h(a)=?h(Z)
X=?a ∧ X=?Z ∧ h(a)=?h(Z) U dec
U dec X=?b ∧ X=?Z ∧ h(a)=?h(Z)
X=?a ∧ X=?Z ∧ a=?Z U dec
U elim X=?b ∧ X=?Z ∧ a=?Z
X=?a ∧ a=?Z ∧ a=?Z U elim
U elim X=?b ∧ b=?Z ∧ a=?Z
X=?a ∧ Z=?a ∧ a=?a U elim
U triv X=?b ∧ Z=?b ∧ a=?b
? ?
X= a ∧ Z= a

MGU: [a/X], [a/Z] a=?b not unifiable

Michael Kohlhase: Artificial Intelligence 1 446 2025-02-06

We will now convince ourselves that there cannot be any infinite sequences of transformations in
U. Termination is an important property for an algorithm.
The proof we present here is very typical for termination proofs. We map unification problems
into a partially ordered set ⟨S, ≺⟩ where we know that there cannot be any infinitely descending
sequences (we think of this as measuring the unification problems). Then we show that all trans-
formations in U strictly decrease the measure of the unification problems and argue that if there
were an infinite transformation in U, then there would be an infinite descending chain in S, which
contradicts our choice of ⟨S, ≺⟩.
The crucial step in coming up with such proofs is finding the right partially ordered set.
Fortunately, there are some tools we can make use of. We know that ⟨N, <⟩ is terminating, and
there are some ways of lifting component orderings to complex structures. For instance it is well-
known that the lexicographic ordering lifts a terminating ordering to a terminating ordering on
finite dimensional Cartesian spaces. We show a similar, but less known construction with multisets
for our proof.
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 299

Unification (Termination)
 Definition 15.1.28. Let S and T be multisets and ≤ a partial ordering on S ∪ T .
Then we define S ≺m S, iff S = C ⊎ T ′ and T = C ⊎ {t}, where s≤t for all s ∈ S ′ .
We call ≤m the multiset ordering induced by ≤.
 Definition 15.1.29. We call a variable X solved in an unification problem E, iff E
contains a solved pair X=?A.

 Lemma 15.1.30. If ≺ is linear/terminating On S, then ≺m is linear/terminating


on P(S).
 Lemma 15.1.31. U is terminating. (any U-derivation is finite)
 Proof: We prove termination by mapping U transformation into a Noetherian space.
1. Let µ(E):=⟨n, N ⟩, where
 n is the number of unsolved variables in E

 N is the multiset of term depths in E


2. The lexicographic order ≺ on pairs µ(E) is decreased by all inference rules.
2.1. Udec and Utriv decrease the multiset of term depths without increasing
the unsolved variables.
2.2. Uelim decreases the number of unsolved variables (by one), but may in-
crease term depths.

Michael Kohlhase: Artificial Intelligence 1 447 2025-02-06

But it is very simple to create terminating calculi, e.g. by having no inference rules. So there
is one more step to go to turn the termination result into a decidability result: we must make sure
that we have enough inference rules so that any unification problem is transformed into solved
form if it is unifiable.

First-Order Unification is Decidable


 Definition 15.1.32. We call an equational problem E U-reducible, iff there is a
U-step E⊢U F from E.
 Lemma 15.1.33. If E is unifiable but not solved, then it is U-reducible.

 Proof: We assume that E is unifiable but unsolved and show the U rule that applies.
1. There is an unsolved pair A=?B in E = E ∧ A=?B′ .
we have two cases
2. A, B ̸∈ Vι
2.1. then A = f (A1 . . . An ) and B = f (B1 . . . Bn ), and thus Udec is appli-
cable
3. A = X ∈ free(E)
3.1. then Uelim (if B ̸= X) or Utriv (if B = X) is applicable.
 Corollary 15.1.34. First-order unification is decidable in PL1 .
Proof:
 1. U-irreducible unification problems can be reached in finite time by ??.
2. They are either solved or unsolvable by ??, so they provide the answer.
300 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

Michael Kohlhase: Artificial Intelligence 1 448 2025-02-06

15.1.3 Efficient Unification


Now that we have seen the basic ingredients of an unification algorithm, let us as always consider
complexity and efficiency issues.
We start with a look at the complexity of unification and – somewhat surprisingly – find expo-
nential time/space complexity based simply on the fact that the results – the unifiers – can be
exponentially large.

Complexity of Unification
 Observation: Naive implementations of unification are exponential in time and
space.

 Example 15.1.35. Consider the terms

sn = f (f (x0 , x0 ), f (f (x1 , x1 ), f (. . . , f (xn−1 , xn−1 )) . . .))


tn = f (x1 , f (x2 , f (x3 , f (· · · , xn ) · · · )))

 The most general unifier of sn and tn is


σ n := [f (x0 , x0 )/x1 ], [f (f (x0 , x0 ), f (x0 , x0 ))/x2 ], [f (f (f (x0 , x0 ), f (x0 , x0 )), f (f (x0 , x0 ), f (x0 , x0 )))/x3 ], . . .
Pn
 It contains i=1 2i = 2n+1 − 2 occurrences of the variable x0 . (exponential)
 Problem: The variable x0 has been copied too often.

 Idea: Find a term representation that re-uses subterms.

Michael Kohlhase: Artificial Intelligence 1 449 2025-02-06

Indeed, the only way to escape this combinatorial explosion is to find representations of substitu-
tions that are more space efficient.

Directed Acyclic Graphs (DAGs) for Terms


 Recall: Terms in first-order logic are essentially trees.

 Concrete Idea: Use directed acyclic graphs for representing terms:


 variables my only occur once in the DAG.
 subterms can be referenced multiply. (subterm sharing)
 we can even represent multiple terms in a common DAG

 Observation 15.1.36. Terms can be transformed into DAGs in linear time.


 Example 15.1.37. Continuing from ?? . . . s3 , t3 , and σ 3 (s3 ) as DAGs:
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 301

s3 t3 σ3 (t3 )
f f f
f
f f f f
f
x0 f f f

x1 x2 x3 x0

In general: sn , tn , and σ n (sn ) only need space in O(n). (just count)

Michael Kohlhase: Artificial Intelligence 1 450 2025-02-06

If we look at the unification algorithm from ?? and the considerations in the termination proof
(??) with a particular focus on the role of copying, we easily find the culprit for the exponential
blowup: Uelim, which applies solved pairs as substitutions.

DAG Unification Algorithm


 Observation: In U, the Uelim rule applies solved pairs ; subterm duplication.

 Idea: Replace Uelim the notion of solved forms by something better.


 Definition 15.1.38. We say that X 1=?B1 ∧ . . . ∧ X n=?Bn is a DAG solved form,
iff the X i are distinct and X i ̸∈ free(Bj ) for i ≤ j.
 Definition 15.1.39. The inference system DU contains rules Udec and Utriv from
U plus the following:

E ∧ X=?A ∧ X=?B A, B ̸∈ Vι |A| ≤ |B|


DUmerge
E ∧ X=?A ∧ A=?B
E ∧ X=?Y X ̸= Y X, Y ∈ free(E)
DUevar
[Y /X](E) ∧ X=?Y
where |A| is the number of symbols in A.

 The analysis for U applies mutatis mutandis.

Michael Kohlhase: Artificial Intelligence 1 451 2025-02-06

We will now turn the ideas we have developed in the last couple of slides into a usable func-
tional algorithm. The starting point is treating terms as DAGs. Then we try to conduct the
transformation into solved form without adding new nodes.

Unification by DAG-chase
 Idea: Extend the Input-DAGs by edges that represent unifiers.
 Definition 15.1.40. Write n.a, if a is the symbol of node n.

 (standard) auxiliary procedures: (all constant or linear time in DAGs)


 find(n) follows the path from n and returns the end node.
302 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

 union(n, m) adds an edge between n and m.


 occur(n, m) determines whether n.x occurs in the DAG with root m.

Michael Kohlhase: Artificial Intelligence 1 452 2025-02-06

Algorithm dag−unify
 Input: symmetric pairs of nodes in DAGs
fun dag−unify(n,n) = true
| dag−unify(n.x,m) = if occur(n,m) then true else union(n,m)
| dag−unify(n.f ,m.g) =
if g!=f then false
else
forall (i,j) => dag−unify(find(i),find(j)) (chld m,chld n)
end

 Observation 15.1.41. dag−unify uses linear space, since no new nodes are created,
and at most one link per variable.
 Problem: dag−unify still uses exponential time.

 Example 15.1.42. Consider terms f (sn , f (t′ n , xn )), f (tn , f (s′ n , y n ))), where s′ n =
[y i /xi ](sn ) und t′ n = [y i /xi ](tn ).
dag−unify needs exponentially many recursive calls to unify the nodes xn and y n .
(they are unified after n calls, but checking needs the time)

Michael Kohlhase: Artificial Intelligence 1 453 2025-02-06

Algorithm uf−unify
 Recall: dag−unify still uses exponential time.
 Idea: Also bind the function nodes, if the arguments are unified.
uf−unify(n.f ,m.g) =
if g!=f then false
else union(n,m);
forall (i,j) => uf−unify(find(i),find(j)) (chld m,chld n)
end

 This only needs linearly many recursive calls as it directly returns with true or makes
a node inaccessible for find.
 Linearly many calls to linear procedures give quadratic running time.

 Remark: There are versions of uf−unify that are linear in time and space, but for
most purposes, our algorithm suffices.

Michael Kohlhase: Artificial Intelligence 1 454 2025-02-06


15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 303

15.1.4 Implementing First-Order Tableaux


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/26797.
We now come to some issues (and clarifications) pertaining to implementing proof search for free
variable tableaux. They all have to do with the – often overlooked – fact that T1f⊥ instantiates
the whole tableau.
The first question one may ask for implementation is whether we expect a terminating proof
search; after all, T0 terminated. We will see that the situation for T1f is different.

Termination and Multiplicity in Tableaux


 Recall: In T0 , all rules only needed to be applied once.
; T0 terminates and thus induces a decision procedure for PL0 .
 Observation 15.1.43. All T1f rules except T1f ∀ only need to be applied once.

 Example 15.1.44. A tableau proof for (p(a) ∨ p(b)) ⇒ (∃.p()).

Start, close left branch use T1f ∀ again (right branch)


F
((p(a) ∨ p(b)) ⇒ (∃.p()))
F T
((p(a) ∨ p(b)) ⇒ (∃.p())) (p(a) ∨ p(b))
T F
(p(a) ∨ p(b)) (∃x.p(x))
F T
(∃x.p(x)) (∀x.¬p(x))
T T
(∀x.¬p(x)) ¬p(a)
T F
¬p(y) p(a)
F T T
p(y) p(a) p(b)
p(a)
T
p(b)
T ⊥ : [a/y] ¬p(z)T
⊥ : [a/y] p(z)
F

⊥ : [b/z]

After we have used up p(y) by applying [a/y] in T1f⊥, we have to get a new instance
F

p(z) via T1f ∀.


F

 Definition 15.1.45. Let T be a tableau for A, and a positive occurrence of ∀x.B


in A, then we call the number of applications of T1f ∀ to ∀x.B its multiplicity.
 Observation 15.1.46. Given a prescribed multiplicity for each positive ∀, satura-
tion with T1f terminates.

 Proof sketch: All T1f rules reduce the number of connectives and negative ∀ or the
multiplicity of positive ∀.
 Theorem 15.1.47. T1f is only complete with unbounded multiplicities.

 Proof sketch: Replace p(a) ∨ p(b) with p(a1 ) ∨ . . . ∨ p(an ) in ??.


 Remark: Otherwise validity in PL1 would be decidable.
 Implementation: We need an iterative multiplicity deepening process.

Michael Kohlhase: Artificial Intelligence 1 455 2025-02-06

The other thing we need to realize is that there may be multiple ways we can use T1f⊥ to close a
branch in a tableau, and – as T1f⊥ instantiates the whole tableau and not just the branch itself –
304 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

this choice matters.

Treating T1f⊥

 Recall: The T1f⊥ rule instantiates the whole tableau.

 Problem: There may be more than one T1f⊥ opportunity on a branch.

 Example 15.1.48. Choosing which matters – this tableau does not close!
F
(∃x.(p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(x)))
F
((p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(y)))
F F
(p(a) ∧ p(b) ⇒ p()) (q(b) ⇒ q(y))
T T
p(a) q(b)
T F
p(b) q(y)
F
p(y)
⊥ : [a/y]

choosing the other T1f⊥ in the left branch allows closure.

 Idea: Two ways of systematic proof search in T1f :

 backtracking search over T1f⊥ opportunities


 saturate without T1f⊥ and find spanning matings (next slide)

Michael Kohlhase: Artificial Intelligence 1 456 2025-02-06

The method of spanning matings follows the intuition that if we do not have good information
on how to decide for a pair of opposite literals on a branch to use in T1f⊥, we delay the choice by
initially disregarding the rule altogether during saturation and then – in a later phase– looking
for a configuration of cuts that have a joint overall unifier. The big advantage of this is that we
only need to know that one exists, we do not need to compute or apply it, which would lead to
exponential blow-up as we have seen above.

Spanning Matings for T1f⊥

 Observation 15.1.49. T1f without T1f⊥ is terminating and confluent for given
multiplicities.
 Idea: Saturate without T1f⊥ and treat all cuts at the same time (later).
 Definition 15.1.50.
Let T be a T1f tableau, then we call a unification problem E := A1=?B1 ∧ . . . ∧
An=?Bn a mating for T , iff Ai T and Bi F occur in the same branch in T .
We say that E is a spanning mating, if E is unifiable and every branch B of T
contains Ai T and Bi F for some i.
 Theorem 15.1.51. A T1f -tableau with a spanning mating induces a closed T1
tableau.

 Proof sketch: Just apply the unifier of the spanning mating.


15.2. FIRST-ORDER RESOLUTION 305

 Idea: Existence is sufficient, we do not need to compute the unifier.


 Implementation: Saturate without T1f⊥, backtracking search for spanning mat-
ings with DU, adding pairs incrementally.

Michael Kohlhase: Artificial Intelligence 1 457 2025-02-06

Excursion: Now that we understand basic unification theory, we can come to the meta-theoretical
properties of the tableau calculus. We delegate this discussion to??.

15.2 First-Order Resolution


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26817.

First-Order Resolution (and CNF)


 Definition 15.2.1. The first-order CNF calculus CNF1 is given by the inference
rules of CNF0 extended by the following quantifier rules:
T
(∀X.A) ∨ C Z ̸∈ (free(A) ∪ free(C))
T
([Z/X](A)) ∨ C
F
k new
(∀X.A) ∨ C {X 1 , . . ., X k } = free(∀X.A) f ∈ Σsk
F
([f (X 1 , . . ., X k )/X](A)) ∨ C
the first-order CNF CNF1 (Φ) of Φ is the set of all clauses that can be derived from
Φ.

 Definition 15.2.2 (First-Order Resolution Calculus). The First-order resolution


calculus (R1 ) is a test calculus that manipulates formulae in conjunctive normal
form. R1 has two inference rules:

AT ∨ C BF ∨ D σ = mgu(A, B) Aα ∨ Bα ∨ C σ = mgu(A, B)
(σ(C)) ∨ (σ(D)) (σ(A)) ∨ (σ(C))

Michael Kohlhase: Artificial Intelligence 1 458 2025-02-06

First-Order CNF – Derived Rules


 Definition 15.2.3. The following inference rules are derivable from the ones above
via (∃X.A) = ¬(∀X.¬A):
T
k new
(∃X.A) ∨ C {X 1 , . . ., X k } = free(∀X.A) f ∈ Σsk
T
([f (X 1 , . . ., X k )/X](A)) ∨ C
F
(∃X.A) ∨ C Z ̸∈ (free(A) ∪ free(C))
F
([Z/X](A)) ∨ C

Michael Kohlhase: Artificial Intelligence 1 459 2025-02-06


306 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

Excursion: Again, we relegate the meta-theoretical properties of the first-order resolution calculus
to??.

15.2.1 Resolution Examples

Col. West, a Criminal?


 Example 15.2.4. From [RN09]
The law says it is a crime for an American to sell weapons to hostile nations.
The country Nono, an enemy of America, has some missiles, and all of its
missiles were sold to it by Colonel West, who is American.
Prove that Col. West is a criminal.
 Remark: Modern resolution theorem provers prove this in less than 50ms.

 Problem: That is only true, if we only give the theorem prover exactly the right
laws and background knowledge. If we give it all of them, it drowns in the combi-
natorial explosion.
 Let us build a resolution proof for the claim above.
 But first we must translate the situation into first-order logic clauses.

 Convention: In what follows, for better readability we will sometimes write impli-
cations P ∧ Q ∧ R ⇒ S instead of clauses P F ∨ QF ∨ RF ∨ S T .

Michael Kohlhase: Artificial Intelligence 1 460 2025-02-06

Col. West, a Criminal?


 It is a crime for an American to sell weapons to hostile nations:
Clause: ami(X1 ) ∧ weap(Y1 ) ∧ sell(X1 , Y1 , Z1 ) ∧ host(Z1 ) ⇒ crook(X1 )
 Nono has some missiles: ∃X.own(NN, X) ∧ mle(X)
T
Clauses: own(NN, c) and mle(c) (c is Skolem constant)
 All of Nono’s missiles were sold to it by Colonel West.
Clause: mle(X2 ) ∧ own(NN, X2 ) ⇒ sell(West, X2 , NN)
 Missiles are weapons:
Clause: mle(X3 ) ⇒ weap(X3 )
 An enemy of America counts as “hostile”:
Clause: enmy(X4 , USA) ⇒ host(X4 )

 West is an American:
Clause: ami(West)
 The country Nono is an enemy of America:
enmy(NN, USA)

Michael Kohlhase: Artificial Intelligence 1 461 2025-02-06


15.2. FIRST-ORDER RESOLUTION 307

Col. West, a Criminal! PL1 Resolution Proof

ami(X1 )F ∨ weapon(Y1 )F ∨ sell(X1 , Y1 , Z1 )F ∨ hostile(Z1 )F ∨ crook(X1 )T crook(West)F


[West/X1 ]

ami(West)T ami(West)F ∨ weapon(Y1 )F ∨ sell(West, Y1 , Z1 )F ∨ hostile(Z1 )F

missile(X3 )F ∨ weapon(X3 )T weapon(Y1 )F ∨ sell(West, Y1 , Z1 )F ∨ hostile(Z1 )F


[Y1 /X3 ]

missile(c) missile(Y1 ) ∨ sell(West, Y1 , Z1 ) ∨ hostile(Z1 )F


T F F

[c/Y1 ]
missile(X2 )F ∨ own(NoNo, X2 )F ∨ sell(West, X2 , NoNo)T

sell(West, c, Z1 )F ∨ hostile(Z1 )F [c/X2 ]


[NoNo/Z1 ]

missile(c)T missile(c)F ∨ own(NoNo, c)F ∨ hostile(NoNo)F

own(NoNo, c)T own(NoNo, c)F ∨ hostile(NoNo)F

enemy(X4 , U SA)F ∨ hostile(X4 )T hostile(NoNo)F


[NoNo/X4 ]
T
enemy(NoNo, U SA) enemy(NoNo, U SA)F
2

Michael Kohlhase: Artificial Intelligence 1 462 2025-02-06

Curiosity Killed the Cat?


 Example 15.2.5. From [RN09]

Everyone who loves all animals is loved by someone.


Anyone who kills an animal is loved by noone.
Jack loves all animals.
Cats are animals.
Either Jack or curiosity killed the cat (whose name is “Garfield”).

Prove that curiosity killed the cat.

Michael Kohlhase: Artificial Intelligence 1 463 2025-02-06

Curiosity Killed the Cat? Clauses


 Everyone who loves all animals is loved by someone:
∀X.(∀Y .animal(Y ) ⇒ love(X, Y )) ⇒ (∃.love(Z, X))
T T F T
Clauses: animal(g(X1 )) ∨love(g(X1 ), X1 ) and love(X2 , f (X2 )) ∨love(g(X2 ), X2 )
 Anyone who kills an animal is loved by noone:
∀X.∃Y .animal(Y ) ∧ kill(X, Y ) ⇒ (∀.¬love(Z, X))
F F F
Clause: animal(Y3 ) ∨ kill(X3 , Y3 ) ∨ love(Z3 , X3 )

 Jack loves all animals:


308 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

F T
Clause: animal(X4 ) ∨ love(jack, X4 )
 Cats are animals:
F T
Clause: cat(X5 ) ∨ animal(X5 )

 Either Jack or curiosity killed the cat (whose name is “Garfield”):


T T T
Clauses: kill(jack, garf) ∨ kill(curiosity, garf) and cat(garf)

Michael Kohlhase: Artificial Intelligence 1 464 2025-02-06

Curiosity Killed the Cat! PL1 Resolution Proof

cat(garf)T cat(X5 )F ∨ anl(X5 )T


[garf/X5 ]

anl(garf)T anl(Y3 )F ∨ kill(X3 , Y3 )F ∨ love(Z3 , X3 )F


[garf/Y3 ]

kill(X3 , garf)F ∨ love(Z3 , X3 )F kill(jack, garf)T ∨ kill(curty, garf)T kill(curty, garf)F

[jack/X3 ] kill(jack, garf)T

love(Z3 , jack)F love(X2 , f (X2 ))F ∨ love(g(X2 ), X2 )T anl(X4 )F ∨ love(jack, X4 )T


[jack/X2 ], [f (jack)/X4 ]

love(g(jack), jack)T ∨ anl(f (jack))F anl(f (X1 ))T ∨ love(g(X1 ), X1 )T


[g(jack)/Z3 ]
[jack/X1 ]

love(g(jack), jack)T

Michael Kohlhase: Artificial Intelligence 1 465 2025-02-06

Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
the course, but provide one for the calculi introduced so far in??.

15.3 Logic Programming as Resolution Theorem Proving


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26820.
To understand Prolog better, we can interpret the language of Prolog as resolution in PL1 .

We know all this already


 Goals, goal sets, rules, and facts are just clauses. (called Horn clauses)

 Observation 15.3.1 (Rule). H:−B 1 ,. . .,B n . corresponds to H T ∨B 1 F ∨. . .∨B n F


(head the only positive literal)

 Observation 15.3.2 (Goal set). ?− G1 ,. . .,Gn . corresponds to G1 F ∨ . . . ∨ Gn F


 Observation 15.3.3 (Fact). F . corresponds to the unit clause F T .
15.3. LOGIC PROGRAMMING AS RESOLUTION THEOREM PROVING 309

 Definition 15.3.4. A Horn clause is a clause with at most one positive literal.
 Recall: Backchaining as search:
 state = tuple of goals; goal state = empty list (of goals).
 next(⟨G, R1 , . . ., Rl ⟩) := ⟨σ(B 1 ), . . ., σ(B m ), σ(R1 ), . . ., σ(Rl )⟩ if there is a
rule H:−B 1 ,. . ., B m . and a substitution σ with σ(H) = σ(G).
 Note: Backchaining becomes resolution

PT ∨ A PF ∨ B
A∨B
positive, unit-resulting hyperresolution (PURR)

Michael Kohlhase: Artificial Intelligence 1 466 2025-02-06

This observation helps us understand Prolog better, and use implementation techniques from
automated theorem proving.

PROLOG (Horn Logic)


 Definition 15.3.5. A clause is called a Horn clause, iff contains at most one
positive literal, i.e. if it is of the form B 1 F ∨ . . . ∨ B n F ∨ AT – i.e. A:−B 1 ,. . .,B n .
in Prolog notation.
 Rule clause: general case, e.g. fallible(X) : human(X).
 Fact clause: no negative literals, e.g. human(sokrates).
 Program: set of rule and fact clauses.
 Query: no positive literals: e.g. ?− fallible(X),greek(X).

 Definition 15.3.6. Horn logic is the formal system whose language is the set of
Horn clauses together with the calculus H given by MP, ∧I, and Subst.
 Definition 15.3.7. A logic program P entails a query Q with answer substitution
σ, iff there is a H derivation D of Q from P and σ is the combined substitution of
the Subst instances in D.

Michael Kohlhase: Artificial Intelligence 1 467 2025-02-06

PROLOG: Our Example


 Program:
human(leibniz).
human(sokrates).
greek(sokrates).
fallible(X):−human(X).

 Example 15.3.8 (Query). ?− fallible(X),greek(X).


 Answer substitution: [sokrates/X]
310 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC

Michael Kohlhase: Artificial Intelligence 1 468 2025-02-06

To gain an intuition for this quite abstract definition let us consider a concrete knowledge base
about cars. Instead of writing down everything we know about cars, we only write down that cars
are motor vehicles with four wheels and that a particular object c has a motor and four wheels. We
can see that the fact that c is a car can be derived from this. Given our definition of a knowledge
base as the deductive closure of the facts and rule explicitly written down, the assertion that c is
a car is in the induced knowledge base, which is what we are after.

Knowledge Base (Example)


 Example 15.3.9. car(c). is in the knowlege base generated by
has_motor(c).
has_wheels(c,4).
car(X):− has_motor(X),has_wheels(X,4).

m(c) w(c, 4) m(x) ∧ w(x, 4) ⇒ car()


∧I Subst
m(c) ∧ w(c, 4) m(c) ∧ w(c, 4) ⇒ car()
MP
car(c)

Michael Kohlhase: Artificial Intelligence 1 469 2025-02-06

In this very simple example car(c) is about the only fact we can derive, but in general, knowledge
bases can be infinite (we will see examples below).

Why Only Horn Clauses?


 General clauses of the form A1,. . .,An : B1,. . .,Bn.

 e.g. greek(sokrates),greek(perikles)
 Question: Are there fallible greeks?
 Indefinite answer: Yes, Perikles or Sokrates
 Warning: how about Sokrates and Perikles?

 e.g. greek(sokrates),roman(sokrates):−.
 Query: Are there fallible greeks?
 Answer: Yes, Sokrates, if he is not a roman
 Is this abduction?????

Michael Kohlhase: Artificial Intelligence 1 470 2025-02-06

Three Principal Modes of Inference


 Definition 15.3.10. Deduction =
b knowledge extension
15.4. SUMMARY: ATP IN FIRST-ORDER LOGIC 311

rains ⇒ wet_street rains


 Example 15.3.11. D
wet_street
 Definition 15.3.12. Abduction =
b explanation
rains ⇒ wet_street wet_street
 Example 15.3.13. A
rains
 Definition 15.3.14. Induction =
b learning general rules from examples
wet_street rains
 Example 15.3.15. I
rains ⇒ wet_street

Michael Kohlhase: Artificial Intelligence 1 471 2025-02-06

15.4 Summary: ATP in First-Order Logic

Summary: ATP in First-Order Logic


 The propositional calculi for ATP can be extended to first-order logic by adding
quantifier rules.
 The rule for the universal quantifier can be made efficient by introducing metavari-
ables that postpone the decision for instances.
 We have to extend the witness constants in the rules for existential quantifiers
to Skolem functions.
 The cut rules can used to instantiate the metavariables by unification.
These ideas are enough to build a tableau calculus for first-order logic.
 Unification is an efficient decision procdure for finding substitutions that make first-
order terms (syntactically) equal.
 In prenex normal form, all quantifiers are up front. In Skolem normal form, addi-
tionally there are no existential quantifiers. In claus normal form, additionally the
formula is in CNF.

 Any PL1 formula can efficiently be brought into a satisfiability-equivalent clause


normal form.
 This allows first-order resolution.

Michael Kohlhase: Artificial Intelligence 1 472 2025-02-06


312 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Chapter 16

Knowledge Representation and the


Semantic Web

The field of “Knowledge Representation” is usually taken to be an area in Artificial Intelligence


that studies the representation of knowledge in formal systems and how to leverage inference
techniques to generate new knowledge items from existing ones. Note that this definition
coincides with with what we know as logical systems in this course. This is the view taken by
the subfield of “description logics”, but restricted to the case, where the logical systems have an
entailment relation to ensure applicability. This chapter is organized as follows. We will first
give a general introduction to the concepts of knowledge representation using semantic networks
– an early and very intuitive approach to knowledge representation – as an object-to-think-with.
In ?? we introduce the principles and services of logic-based knowledge-representation using a
non-standard interpretation of propositional logic as the basis, this gives us a formal account of
the taxonomic part of semantic networks. In ?? we introduce the logic ALC that adds relations
(called “roles”) and restricted quantification and thus gives us the full expressive power of semantic
networks. Thus ALC can be seen as a prototype description logic. In ?? we show how description
logics are applied as the basis of the “semantic web”.

16.1 Introduction to Knowledge Representation


A Video Nugget covering the introduction to knowledge representation can be found at https:
//fau.tv/clip/id/27279.
Before we start into the development of description logics, we set the stage by looking into
alternatives for knowledge representation.

16.1.1 Knowledge & Representation


To approach the question of knowledge representation, we first have to ask ourselves, what
knowledge might be. This is a difficult question that has kept philosophers occupied for millennia.
We will not answer this question in this course, but only allude to and discuss some aspects that
are relevant to our cause of knowledge representation.

What is knowledge? Why Representation?


 Lots/all of (academic) disciplines deal with knowledge!
 According to Probst/Raub/Romhardt [PRR97]

313
314 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

 For the purposes of this course: Knowledge is the information necessary to


support intelligent reasoning!

representation can be used to determine


set of words whether a word is admissible
list of words the rank of a word
a lexicon translation and/or grammatical function
structure function

Michael Kohlhase: Artificial Intelligence 1 473 2025-02-06

According to an influential view of [PRR97], knowledge appears in layers. Staring with a character
set that defines a set of glyphs, we can add syntax that turns mere strings into data. Adding context
information gives information, and finally, by relating the information to other information allows
to draw conclusions, turning information into knowledge.
Note that we already have aspects of representation and function in the diagram at the top of the
slide. In this, the additional functionaltiy added in the successive layers gives the representations
more and more functions, until we reach the knowledge level, where the function is given by infer-
encing. In the second example, we can see that representations determine possible functions.
Let us now strengthen our intuition about knowledge by contrasting knowledge representations
from “regular” data structures in computation.

Knowledge Representation vs. Data Structures


 Idea: Representation as structure and function.

 the representation determines the content theory (what is the data?)


 the function determines the process model (what do we do with the data?)
 Question: Why do we use the term “knowledge representation” rather than
 data structures? (sets, lists, ... above)
 information representation? (it is information)
 Answer: No good reason other than AI practice, with the intuition that
 data is simple and general (supports many algorithms)
 knowledge is complex (has distinguished process model)

Michael Kohlhase: Artificial Intelligence 1 474 2025-02-06

As knowledge is such a central notion in artificial intelligence, it is not surprising that there are
multiple approaches to dealing with it. We will only deal with the first one and leave the others
to self-study.
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 315

Some Paradigms for Knowledge Representation in AI/NLP

 GOFAI (good old-fashioned AI)


 symbolic knowledge representation, process model based on heuristic search
 Statistical, corpus-based approaches.

 symbolic representation, process model based on machine learning


 knowledge is divided into symbolic- and statistical (search) knowledge
 The connectionist approach
 sub-symbolic representation, process model based on primitive processing ele-
ments (nodes) and weighted links
 knowledge is only present in activation patters, etc.

Michael Kohlhase: Artificial Intelligence 1 475 2025-02-06

When assessing the relative strengths of the respective approaches, we should evaluate them with
respect to a pre-determined set of criteria.

KR Approaches/Evaluation Criteria
 Definition 16.1.1. The evaluation criteria for knowledge representation approaches
are:
 Expressive adequacy: What can be represented, what distinctions are supported.
 Reasoning efficiency: Can the representation support processing that generates
results in acceptable speed?
 Primitives: What are the primitive elements of representation, are they intuitive,
cognitively adequate?
 Meta representation: Knowledge about knowledge
 Completeness: The problems of reasoning with knowledge that is known to be
incomplete.

Michael Kohlhase: Artificial Intelligence 1 476 2025-02-06

16.1.2 Semantic Networks


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27280.
To get a feeling for early knowledge representation approaches from which description logics
developed, we take a look at “semantic networks” and contrast them to logical approaches.
Semantic networks are a very simple way of arranging knowledge about objects and concepts and
their relationships in a graph.

Semantic Networks [CQ69]


 Definition 16.1.2. A semantic network is a directed graph for representing knowl-
edge:
316 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

 nodes represent objects and concepts (classes of objects)


(e.g. John (object) and bird (concept))
 edges (called links) represent relations between these (isa, father_of,
belongs_to)

 Example 16.1.3. A semantic network for birds and persons:

bird Jack Person


isa inst
inst inst
has_part robin owner_of Mary
loves
wings John

 Problem: How do we derive new information from such a network?


 Idea: Encode taxonomic information about objects and concepts in special links
(“isa” and “inst”) and specify property inheritance along them in the process model.

Michael Kohlhase: Artificial Intelligence 1 477 2025-02-06

Even though the network in ?? is very intuitive (we immediately understand the concepts de-
picted), it is unclear how we (and more importantly a machine that does not associate meaning
with the labels of the nodes and edges) can draw inferences from the “knowledge” represented.

Deriving Knowledge Implicit in Semantic Networks


 Observation 16.1.4. There is more knowledge in a semantic network than is
explicitly written down.
 Example 16.1.5. In the network below, we “know” that robins have wings and in
particular, Jack has wings.

bird Jack Person


isa inst
inst inst
has_part robin owner_of Mary
loves
wings John

 Idea: Links labeled with “isa” and “inst” are special: they propagate properties
encoded by other links.
 Definition 16.1.6. We call links labeled by
 “isa” an inclusion or isa link (inclusion of concepts)
 “inst” instance or inst link (concept membership)

Michael Kohlhase: Artificial Intelligence 1 478 2025-02-06

We now make the idea of “propagating properties” rigorous by defining the notion of derived
relations, i.e. the relations that are left implicit in the network, but can be added without changing
its meaning.
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 317

Deriving Knowledge Semantic Networks


 Definition 16.1.7 (Inference in Semantic Networks). We call all link labels
except “inst” and “isa” in a semantic network relations.
isa R
Let N be a semantic network and R a relation in N such that A −→ B −→ C or
inst R R
A −→ B −→ C, then we can derive a relation A −→ C in N .
The process of deriving new concepts and relations from existing ones is called
inference and concepts/relations that are only available via inference implicit (in a
semantic network).
 Intuition: Derived relations represent knowledge that is implicit in the network;
they could be added, but usually are not to avoid clutter.

 Example 16.1.8. Derived relations in ??

isa
bird / Jack Person
isa inst
inst inst
has_part robin owner_of Mary
has_part
has_part loves
wings John

 Slogan: Get out more knowledge from a semantic networks than you put in.

Michael Kohlhase: Artificial Intelligence 1 479 2025-02-06

Note that ?? does not quite allow to derive that Jack is a bird (did you spot that “isa” is not a
relation that can be inferred?), even though we know it is true in the world. This shows us that
inference in semantic networks has be to very carefully defined and may not be “complete”, i.e.
there are things that are true in the real world that our inference procedure does not capture.
Dually, if we are not careful, then the inference procedure might derive properties that are not
true in the real world even if all the properties explicitly put into the network are. We call such
an inference procedure unsound or incorrect.
These are two general phenomena we have to keep an eye on.
Another problem is that semantic networks (e.g. in ??) confuse two kinds of concepts: individuals
(represented by proper names like John and Jack) and concepts (nouns like robin and bird). Even
though the isa and inst link already acknowledge this distinction, the “has_part” and “loves”
relations are at different levels entirely, but not distinguished in the networks.

Terminologies and Assertions


 Remark 16.1.9. We should distinguish concepts from objects.
 Definition 16.1.10. We call the subgraph of a semantic network N spanned by the
isa links and relations between concepts the terminology (or TBox, or the famous
Isa Hierarchy) and the subgraph spanned by the inst links and relations between
objects, the assertions (together the ABox) of N .
 Example 16.1.11. In this semantic network we keep objects concept apart nota-
tionally:
318 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

can
animal move
TBox isa isa
amoeba
has_part higher animal has_part
legs head
isa isa
pattern eat color
striped tiger elephant gray

inst inst inst color

eat
ABox Roy eat Rex Clyde

In particular we have objects “Rex”, “Roy”, and “Clyde”, which have (derived) rela-
tions (e.g. Clyde is gray).

Michael Kohlhase: Artificial Intelligence 1 480 2025-02-06

But there are severe shortcomings of semantic networks: the suggestive shape and node names
give (humans) a false sense of meaning, and the inference rules are only given in the process model
(the implementation of the semantic network processing system).
This makes it very difficult to assess the strength of the inference system and make assertions
e.g. about completeness.

Limitations of Semantic Networks


 What is the meaning of a link?
 link labels are very suggestive (misleading for humans)
 meaning of link types defined in the process model (no denotational semantics)
 Problem: No distinction of optional and defining traits!

 Example 16.1.12. Consider a robin that has lost its wings in an accident:

has_part has_part
bird wings bird wings
isa isa
robin robin cancel
inst inst
jack joe

“Cancel-links” have been proposed, but their status and process model are debatable.

Michael Kohlhase: Artificial Intelligence 1 481 2025-02-06

To alleviate the perceived drawbacks of semantic networks, we can contemplate another notation
that is more linear and thus more easily implemented: function/argument notation.

Another Notation for Semantic Networks


 Definition 16.1.13. Function/argument notation for semantic networks
 interprets nodes as arguments (reification to individuals)
 interprets links as functions (predicates actually)
 Example 16.1.14.
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 319

bird Jack Person isa(robin,bird)


isa inst
inst inst haspart(bird,wings)
has_part robin owner_of Mary inst(Jack,robin)
owner_of(John, robin)
loves
wings John loves(John,Mary)

 Evaluation:
+ linear notation (equivalent, but better to implement on a computer)
+ easy to give process model by deduction (e.g. in Prolog)
– worse locality properties (networks are associative)

Michael Kohlhase: Artificial Intelligence 1 482 2025-02-06

Indeed the function/argument notation is the immediate idea how one would naturally represent
semantic networks for implementation.
This notation has been also characterized as subject/predicate/object triples, alluding to simple
(English) sentences. This will play a role in the “semantic web” later.
Building on the function/argument notation from above, we can now give a formal semantics for
semantic network: we translate them into first-order logic and use the semantics of that.

A Denotational Semantics for Semantic Networks


 Observation: If we handle isa and inst links specially in function/argument nota-
tion
bird Jack Person robin ⊆ bird
isa inst
inst inst haspart(bird,wings)
has_part robin owner_of Mary Jack ∈ robin
owner_of(John, Jack)
loves
wings John loves(John,Mary)
it looks like first-order logic, if we take
 a ∈ S to mean S(a) for an object a and a concept S.
 A ⊆ B to mean ∀X.A(X) ⇒ B(X) and concepts A and B
 R(A, B) to mean ∀X.A(X) ⇒ (∃Y .B(Y ) ∧ R(X, Y )) for a relation R.
 Idea: Take first-order deduction as process model (gives inheritance for free)

Michael Kohlhase: Artificial Intelligence 1 483 2025-02-06

Indeed, the semantics induced by the translation to first-order logic, gives the intuitive meaning to
the semantic networks. Note that this only holds only for the features of semantic networks that
are representable in this way, e.g. the “cancel links” shown above are not (and that is a feature,
not a bug).
But even more importantly, the translation to first-order logic gives a first process model: we
can use first-order inference to compute the set of inferences that can be drawn from a semantic
network.
Before we go on, let us have a look at an important application of knowledge representation
technologies: the semantic web.
320 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

16.1.3 The Semantic Web


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27281.
We will now define the term semantic web and discuss the pertinent ideas involved. There are two
central ones, we will cover here:
• Information and data come in different levels of explicitness; this is usually visualized by a
“ladder” of information.
• if information is sufficiently machine-understandable, then we can automate drawing conclu-
sions.

The Semantic Web


 Definition 16.1.15. The semantic web is the result including of semantic content
in web pages with the aim of converting the WWW into a machine-understandable
“web of data”, where inference based services can add value to the ecosystem.
 Idea: Move web content up the ladder, use inference to make connections.

 Example 16.1.16. Information not explicitly represented (in one place)


Query: Who was US president when Barak Obama was born?
Google: . . . BIRTH DATE: August 04, 1961. . .
Query: Who was US president in 1961?
Google: President: Dwight D. Eisenhower [. . . ] John F. Kennedy (starting Jan. 20.)

Humans understand the text and combine the information to get the answer. Ma-
chines need more than just text ; semantic web technology.

Michael Kohlhase: Artificial Intelligence 1 484 2025-02-06

The term “semantic web” was coined by Tim Berners Lee in analogy to semantic networks, only
applied to the world wide web. And as for semantic networks, where we have inference processes
that allow us the recover information that is not explicitly represented from the network (here the
world-wide-web).
To see that problems have to be solved, to arrive at the semantic web, we will now look at a
concrete example about the “semantics” in web pages. Here is one that looks typical enough.

What is the Information a User sees?


 Example 16.1.17. Take the following web-site with a conference announcement

WWW2002
The eleventh International World Wide Web Conference
Sheraton Waikiki Hotel
Honolulu, Hawaii, USA
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 321

7-11 May 2002

Registered participants coming from


Australia, Canada, Chile Denmark, France, Germany, Ghana, Hong Kong, In-
dia,
Ireland, Italy, Japan, Malta, New Zealand, The Netherlands, Norway,
Singapore, Switzerland, the United Kingdom, the United States, Vietnam, Zaire

On the 7th May Honolulu will provide the backdrop of the eleventh
International World Wide Web Conference.

Speakers confirmed
Tim Berners-Lee: Tim is the well known inventor of the Web,
Ian Foster: Ian is the pioneer of the Grid, the next generation internet.

Michael Kohlhase: Artificial Intelligence 1 485 2025-02-06

But as for semantic networks, what you as a human can see (“understand” really) is deceptive, so
let us obfuscate the document to confuse your “semantic processor”. This gives an impression of
what the computer “sees”.

What the machine sees


 Example 16.1.18. Here is what the machine “sees” from the conference announce-
ment:
WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉
S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕
H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA
7↖∞∞M⊣†∈′′∈

R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉

O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨
I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙

S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊⇔
I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔↙

Michael Kohlhase: Artificial Intelligence 1 486 2025-02-06

Obviously, there is not much the computer understands, and as a consequence, there is not a lot
the computer can support the reader with. So we have to “help” the computer by providing some
meaning. Conventional wisdom is that we add some semantic/functional markup. Here we pick
XML without loss of generality, and characterize some fragments of text e.g. as dates.
322 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

Solution: XML markup with “meaningful” Tags


 Example 16.1.19. Let’s annotate (parts of) the meaning via XML markup
<title>WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉</title>
<place>S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA</place>
<date>7↖∞∞M⊣†∈′′∈</date>
<participants>R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
</participants>
<introduction>O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇↖
\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙</introduction>
<program>S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
<speaker>T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊</speaker>
<speaker>I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔<speaker>
</program>

Michael Kohlhase: Artificial Intelligence 1 487 2025-02-06

But does this really help? Is conventional wisdom correct?

What can we do with this?


 Example 16.1.20. Consider the following fragments:

ℜ⊔⟩⊔↕⌉⊤WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉ℜ∝⊔⟩⊔↕⌉⊤
ℜ√↕⊣⌋⌉⊤S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USAℜ∝√↕⊣⌋⌉⊤
ℜ⌈⊣⊔⌉⊤7↖∞∞M⊣†∈′′∈ℜ∝⌈⊣⊔⌉⊤

Given the markup above, a machine agent can

 parse 7∞∞M⊣†∈′′∈ as the date May 7 11 2002 and add this to the user’s calendar,
 parse S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA as a destination and find flights.
 But: do not be deceived by your ability to understand English!

Michael Kohlhase: Artificial Intelligence 1 488 2025-02-06

To understand what a machine can understand we have to obfuscate the markup as well, since it
does not carry any intrinsic meaning to the machine either.

What the machine sees of the XML


 Example 16.1.21. Here is what the machine sees of the XML
<title>WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉</⊔⟩⊔↕⌉>
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 323

<√↕⊣⌋⌉>S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA</√↕⊣⌋⌉>
<⌈⊣⊔⌉>7↖∞∞M⊣†∈′′∈</⌈⊣⊔⌉>
<√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
</√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >
<⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣↖
⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙</⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>
<√∇≀}∇⊣⇕>S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
<∫√⌉⊣∥⌉∇>T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊</∫√⌉⊣∥⌉∇>
<∫√⌉⊣∥⌉∇>I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔<∫√⌉⊣∥⌉∇>
</√∇≀}∇⊣⇕>

Michael Kohlhase: Artificial Intelligence 1 489 2025-02-06

So we have not really gained much either with the markup, we really have to give meaning to the
markup as well, this is where techniques from semenatic web come into play.
To understand how we can make the web more semantic, let us first take stock of the current status
of (markup on) the web. It is well-known that world-wide-web is a hypertext, where multimedia
documents (text, images, videos, etc. and their fragments) are connected by hyperlinks. As we
have seen, all of these are largely opaque (non-understandable), so we end up with the following
situation (from the viewpoint of a machine).

The Current Web


 Resources: identified by
URIs, untyped

 Links: href, src, . . . limited,


non-descriptive
 User: Exciting world - se-
mantics of the resource, how-
ever, gleaned from content

 Machine: Very little infor-


mation available - significance
of the links only evident from
the context around the anchor.

Michael Kohlhase: Artificial Intelligence 1 490 2025-02-06

Let us now contrast this with the envisioned semantic web.

The Semantic Web


324 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

 Resources: Globally iden-


tified by URIs or Locally
scoped (Blank), Extensible,
Relational.
 Links: Identified by URIs, Ex-
tensible, Relational.
 User: Even more exciting
world, richer user experience.
 Machine: More processable
information is available (Data
Web).
 Computers and peo-
ple: Work, learn and
exchange knowledge effec-
tively.

Michael Kohlhase: Artificial Intelligence 1 491 2025-02-06

Essentially, to make the web more machine-processable, we need to classify the resources by the
concepts they represent and give the links a meaning in a way, that we can do inference with that.
The ideas presented here gave rise to a set of technologies jointly called the “semantic web”, which
we will now summarize before we return to our logical investigations of knowledge representation
techniques.

Towards a “Machine-Actionable Web”


 Recall: We need external agreement on meaning of annotation tags.
 Idea: standardize them in a community process (e.g. DIN or ISO)
 Problem: Inflexible, Limited number of things can be expressed

 Better: Use ontologies to specify meaning of annotations


 Ontologies provide a vocabulary of terms
 New terms can be formed by combining existing ones
 Meaning (semantics) of such terms is formally specified
 Can also specify relationships between terms in multiple ontologies
 Inference with annotations and ontologies (get out more than you put in!)
 Standardize annotations in RDF [KC04] or RDFa [Her+13b] and ontologies on
OWL [OWL09]
 Harvest RDF and RDFa in to a triplestore or OWL reasoner.
 Query that for implied knowledge (e.g. chaining multiple facts from Wikipedia)
SPARQL: Who was US President when Barack Obama was Born?
DBPedia: John F. Kennedy (was president in August 1961)

Michael Kohlhase: Artificial Intelligence 1 492 2025-02-06


16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 325

16.1.4 Other Knowledge Representation Approaches


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27282.
Now that we know what semantic networks mean, let us look at a couple of other approaches
that were influential for the development of knowledge representation. We will just mention them
for reference here, but not cover them in any depth.

Frame Notation as Logic with Locality

 Predicate Logic: (where is the locality?)


catch_22 ∈ catch_object There is an instance of catching
catcher(catch_22, jack_2) Jack did the catching
caught(catch_22, ball_5) He caught a certain ball

 Definition 16.1.22. Frames (group everything around the object)


(catch_object catch_22
(catcher jack_2)
(caught ball_5))

+ Once you have decided on a frame, all the information is local


+ easy to define schemes for concept (aka. types in feature structures)
– how to determine frame, when to choose frame (log/chair)

Michael Kohlhase: Artificial Intelligence 1 493 2025-02-06

KR involving Time (Scripts [Shank ’77])


 Idea: Organize typical event sequences, actors and props into representation.

 Definition 16.1.23. A script is a struc-


tured representation describing a stereotyped
sequence of events in a particular con-
text. Structurally, scripts are very much like make appointment
frames, except the values that fill the slots
must be ordered. go into beauty parlor

 Example 16.1.24. getting your hair cut (at tell receptionist you’re here
a beauty parlor)
Beautician cuts hair
 props, actors as “script variables”
pay
 events in a (generalized) sequence
happy unhappy
 use script material for
big tip small tip
 anaphora, bridging references
 default common ground
 to fill in missing material into situations

Michael Kohlhase: Artificial Intelligence 1 494 2025-02-06


326 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

Other Representation Formats (not covered)

 Procedural Representations (production systems)


 Analogical representations (interesting but not here)
 Iconic representations (interesting but very difficult to formalize)

 If you are interested, come see me off-line

Michael Kohlhase: Artificial Intelligence 1 495 2025-02-06

16.2 Logic-Based Knowledge Representation


A Video Nugget covering this section can be found at https://fau.tv/clip/id/27297.
We now turn to knowledge representation approaches that are based on some kind of logical
system. These have the advantage that we know exactly what we are doing: as they are based
on symbolic representations and declaratively given inference calculi as process models, we can
inspect them thoroughly and even prove facts about them.

Logic-Based Knowledge Representation


 Logic (and related formalisms) have a well-defined semantics
 explicitly (gives more understanding than statistical/neural methods)
 transparently (symbolic methods are monotonic)
 systematically (we can prove theorems about our systems)

 Problems with logic-based approaches


 Where does the world knowledge come from? (Ontology problem)
 How to guide search induced by logical calculi (combinatorial explosion)
 One possible answer: description logics. (next couple of times)

Michael Kohlhase: Artificial Intelligence 1 496 2025-02-06

But of course logic-based approaches have big drawbacks as well. The first is that we have to obtain
the symbolic representations of knowledge to do anything – a non-trivial challenge, since most
knowledge does not exist in this form in the wild, to obtain it, some agent has to experience the
word, pass it through its cognitive apparatus, conceptualize the phenomena involved, systematize
them sufficiently to form symbols, and then represent those in the respective formalism at hand.
The second drawback is that the process models induced by logic-based approaches (inference
with calculi) are quite intractable. We will see that all inferences can be played back to satisfiability
tests in the underlying logical system, which are exponential at best, and undecidable or even
incomplete at worst.
Therefore a major thrust in logic-based knowledge representation is to investigate logical sys-
tems that are expressive enough to be able to represent most knowledge, but still have a decidable
– and maybe even tractable in practice – satisfiability problem. Such logics are called “description
logics”. We will study the basics of such logical systems and their inference procedures in the
following.
16.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 327

16.2.1 Propositional Logic as a Set Description Language


Before we look at “real” description logics in ??, we will make a “dry run” with a logic we
already understand: propositional logic, which we will re-interpret the system as a set description
language by giving a new, non-standard semantics. This allows us to already preview most of
the inference procedures and knowledge services of knowledge representation systems in the next
subsection.
To establish propositional logic as a set description language, we use a different interpretation than
usual. We interpret propositional variables as names of sets and the connectives as set operations,
which is why we give them a different – more suggestive – syntax.

Propositional Logic as Set Description Language

 Idea: Use propositional logic as a set description language: (variant


syntax/semantics)
 Definition 16.2.1. Let PL0DL be given by the following grammar for the PL0DL
concepts. (formulae)

L::=C | ⊤ | ⊥ | L | L ⊓ L | L ⊔ L | L ⊑ L | L ≡ L

i.e. PL0DL formed from


 atomic formulae (=
b propositional variables)
 concept intersection (⊓) (=
b conjunction ∧)
 concept complement (·) (=
b negation ¬)
 concept union (⊔), subsumption (⊑), and equivalence (≡) defined from these.
(=
b ∨, ⇒, and ⇔)
 Definition 16.2.2 (Formal Semantics). Let D be a given set (called the domain
of discourse) and φ : V0 → P(D), then we define

 [ P ] :=φ(P ), (remember φ(P ) ⊆ D).


 
 [ A ⊓ B]] := [ A]] ∩ [ B]] and A :=D\ [ A]] . . .

We call this construction the set description semantics of PL0 .


 Note: ⟨PL0DL , S, [ ·]]⟩, where S is the class of possible domains forms a logical
system.

Michael Kohlhase: Artificial Intelligence 1 497 2025-02-06

The main use of the set-theoretic semantics for PL0 is that we can use it to give meaning to concept
axioms, which we use to describe the “world”.

Concept Axioms

 Observation: Set-theoretic semantics of ‘true’ and ‘false’ (⊤ := φ ⊔ φ


⊥ := φ ⊓ φ)

[ ⊤]] = [ p]] ∪ [ p]] = [ p]] ∪ D\ [ p]] = D Analogously: [ ⊥]] = ∅

 Idea: Use logical axioms to describe the world (Axioms restrict the class of
328 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

admissible domain structures)


 Definition 16.2.3. A concept axiom is a PL0DL formula A that is assumed to be
true in the world.

 Definition 16.2.4 (Set-Theoretic Semantics of Axioms). A is true in domain


of discourse D iff [ A]] = D.
 Example 16.2.5. A world with three concepts and no concept axioms

concepts Set Semantics

sons daughters
child
daughter
son
children

Michael Kohlhase: Artificial Intelligence 1 498 2025-02-06

Concept axioms are used to restrict the set of admissible domains to the intended ones. In our
situation, we require them to be true – as usual – which here means that they denote the whole
domain D.
Let us fortify our intuition about concept axioms with a simple example about the sibling relation.
We give four concept axioms and study their effect on the admissible models by looking at the
respective Venn diagrams. In the end we see that in all admissible models, the denotations of the
concepts son and daughter are disjointq, and child is the union of the two – just as intended.

Effects of Axioms to Siblings


 Example 16.2.6. We can use concept axioms to describe the world from ??.

Axioms Semantics
son ⊑ child
iff [ son]] ∪ [ child]] = D
iff [ son]] ⊆ [ child]]
sons daughters
daughter
 ⊑child
iff daughter ∪ [ child]] = D
iff [ daughter]] ⊆ [ child]] children

son ⊓ daughter sons daughters


child ⊑ son ⊔ daughter

Michael Kohlhase: Artificial Intelligence 1 499 2025-02-06

The set-theoretic semantics introduced above is compatible with the regular semantics of proposi-
tional logic, therefore we have the same propositional identities. Their validity can be established
16.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 329

directly from the settings in ??.

Propositional Identities
Name for ⊓ for ⊔
Idempot. φ⊓φ=φ φ⊔φ=φ
Identity φ⊓⊤=φ φ⊔⊥=φ
Absorpt. φ⊔⊤=⊤ φ⊓⊥=⊥
Commut. φ⊓ψ =ψ⊓φ φ⊔ψ =ψ⊔φ
Assoc. φ ⊓ (ψ ⊓ θ) = (φ ⊓ ψ) ⊓ θ φ ⊔ (ψ ⊔ θ) = (φ ⊔ ψ) ⊔ θ
Distrib. φ ⊓ (ψ ⊔ θ) = (φ ⊓ ψ) ⊔ (φ ⊓ θ) φ ⊔ (ψ ⊓ θ) = (φ ⊔ ψ) ⊓ (φ ⊔ θ)
Absorpt. φ ⊓ (φ ⊔ θ) = φ φ⊔φ⊓θ =φ⊓θ
Morgan φ⊓ψ =φ⊔ψ φ⊔ψ =φ⊓ψ
dneg φ=φ

Michael Kohlhase: Artificial Intelligence 1 500 2025-02-06

There is another way we can approach the set description interpretation of propositional logic: by
translation into a logic that can express knowledge about sets – first-order logic.

Set-Theoretic Semantics and Predicate Logic


 Definition 16.2.7. Translation into PL1 (borrow semantics from that)

 recursively add argument variable x


 change back ⊓, ⊔, ⊑, ≡ to ∧, ∨, ⇒, ⇔
 universal closure for x at formula level.

Definition Comment
pfo(x) := p(x)
fo(x) fo(x)
A := ¬A
fo(x) fo(x) fo(x)
A⊓B := A ∧B ∧ vs. ⊓
fo(x) fo(x) fo(x)
A⊔B := A ∨B ∨ vs. ⊔
fo(x) fo(x) fo(x)
A⊑B := A ⇒B ⇒ vs. ⊑
fo(x) fo(x) fo(x)
A=B := A ⇔B ⇔ vs. =
fo fo(x)
A := (∀x.A ) for formulae

Michael Kohlhase: Artificial Intelligence 1 501 2025-02-06

Normally, we embed PL0 into PL1 by mapping propositional


variables to atomic first-order propositions and the connec-
tives to themselves. The purpose of this embedding is to “talk PL1 undecideable
about truth/falsity of assertions”. For “talking about sets” we decideable
use a non-standard embedding: propositional variables in PL0 φ
are mapped to first-order predicates, and the connectives to  
 Xo 7→ pα→o 
corresponding set operations. This uses the convention that a φ := ∧ 7→ ⊓
set S is represented by a unary predicate pS (its characteristic ¬ 7→ ·
 
0
PL
predicate), and set membership a ∈ S as pS (a).
330 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

Translation Examples
 Example 16.2.8. We translate the concept axioms from ?? to fortify our intuition:
fo
son ⊑ child = ∀x.son(x) ⇒ child(x)
fo
daughter ⊑ child = ∀x.daughter(x) ⇒ child(x)
fo
son ⊓ daughter = ∀x.son(x) ∧ daughter(x)
fo
child ⊑ son ⊔ daughter = ∀x.child(x) ⇒ (son(x) ∨ daughter(x))

 What are the advantages of translation to PL1 ?

 theoretically: A better understanding of the semantics


 computationally: Description Logic Framework, but NOTHING for PL0
 we can follow this pattern for richer description logics.
0
 many tests are decidable for PL , but not for PL . (Description Logics?)
1

Michael Kohlhase: Artificial Intelligence 1 502 2025-02-06

16.2.2 Ontologies and Description Logics


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27298.
We have seen how sets of concept axioms can be used to describe the “world” by restricting the set
of admissible models. We want to call such concept descriptions “ontologies” – formal descriptions
of (classes of) objects and their relations.

Ontologies aka. “World Descriptions”


 Definition 16.2.9 (Classical). An ontology is a representation of the types, prop-
erties, and interrelationships of the entities that really or fundamentally exist for a
particular domain of discourse.
 Remark: ?? is very general, and depends on what we mean by “representation”,
“entities”, “types”, and “interrelationships”.
This may be a feature, and not a bug, since we can use the same intuitions across
a variety of representations.
 Definition 16.2.10. An ontology consists of a formal system ⟨L, C , K, ⊨⟩ with
concept axiom (expressed in L) about
 individuals: concrete entities in a domain of discourse,
 concepts: particular collections of individuals that share properties and aspects
– the instances of the concept, and
 relations: ways in which individuals can be related to one another.
 Example 16.2.11. Semantic networks are ontologies. (relatively informal)

 Example 16.2.12. PL0DL is an ontology format. (formal, but relatively weak)

 Example 16.2.13. PL1 is an ontology format as well. (formal, expressive)


16.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 331

Michael Kohlhase: Artificial Intelligence 1 503 2025-02-06

As we will see, the situation for PL0DL is typical for formal ontologies (even though it only offers
concepts), so we state the general description logic paradigm for ontologies. The important idea
is that having a formal system as an ontology format allows us to capture, study, and implement
ontological inference.

The Description Logic Paradigm

 Idea: Build a whole family of logics for describing sets and their relations. (tailor
their expressivity and computational properties)
 Definition 16.2.14. A description logic is a formal system for talking about col-
lections of objects and their relations that is at least as expressive as PL0 with
set-theoretic semantics and offers individuals and relations.
A description logic has the following four components:
 a formal language L with logical con-
stants ⊓, ·, ⊔, ⊑, and ≡, PL1 undecideable
ψ decideable
 a set-theoretic semantics ⟨D, [ ·]]⟩,
 
C 7→ p ∈ Σp 1
DL
 a translation into first-order logic that is ψ := ⊓ 7→ ∩
· 7→ D\·
compatible with ⟨D, [ ·]]⟩, and φ  
X ∈ V0 7→ C
 a calculus for L that induces a decision φ := ∧ 7→ ⊓
PL0 ¬ 7→ ·
procedure for L-satisfiability.

 Definition 16.2.15. Given a description logic D, a D ontology consists of

 a terminology (or TBox): concepts and roles and a set of concept axioms that
describe them, and
 assertions (or ABox): a set of individuals and statements about concept mem-
bership and role relationships for them.

Michael Kohlhase: Artificial Intelligence 1 504 2025-02-06

For convenience we add concept definitions as a mechanism for defining new concepts from old
ones. The so-defined concepts inherit the properties from the concepts they are defined from.

TBoxes in Description Logics


 Let D be a description logic with concepts C.
 Definition 16.2.16. A concept definition is a pair c=C, where c is a new concept
name and C ∈ C is a D-formula.
 Example 16.2.17. We can define mother=woman ⊓ has_child.

 Definition 16.2.18. A concept definition c=C is called recursive, iff c occurs in


C.
 Definition 16.2.19. An TBox is a finite set of concept definitions and concept
axioms. It is called acyclic, iff it does not contain recursive definitions.
332 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

 Definition 16.2.20. A formula A is called normalized wrt. an TBox T , iff it does


not contain concepts defined in T . (convenient)
 Definition 16.2.21 (Algorithm). (for arbitrary DLs)
Input: A formula A and a TBox T .

 While [A contains concept c and T a concept definition c=C]


 substitute c by C in A.
 Lemma 16.2.22. This algorithm terminates for acyclic TBoxes, but results can be
exponentially large.

Michael Kohlhase: Artificial Intelligence 1 505 2025-02-06

As PL0DL does not offer any guidance on this, we will leave the discussion of ABoxes to ?? when
we have introduced our first proper description logic ALC.

16.2.3 Description Logics and Inference


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27299.
Now that we have established the description logic paradigm, we will have a look at the
inference services that can be offered on this basis.
Before we go into details of particular description logics, we must ask ourselves what kind of
inference support we would want for building systems that support knowledge workers in building,
maintaining and using ontologies. An example of such a system is the Protégé system [Pro], which
can serve for guiding our intuition.

Kinds of Inference in Description Logics


 Definition 16.2.23. Ontology systems employ three main reasoning services:

 Consistency test: is a concept definition satisfiable?


 Subsumption test: does a concept subsume another?
 Instance test: is an individual an example of a concept?
 Problem: decidability, complexity, algorithm

Michael Kohlhase: Artificial Intelligence 1 506 2025-02-06

We will now through these inference-based tests separately.


The consistency test checks for concepts that do not/cannot have instances. We want to avoid such
concepts in our ontologies, since they clutter the namespace and do not contribute any meaningful
contribution.

Consistency Test
 Definition 16.2.24. We call a concept C consistent, iff there is no concept A,
with both C ⊑ A and C ⊑ A.
 Or equivalently:

 Definition 16.2.25. A concept C is called inconsistent, iff [ C]] = ∅ for all D.


 Example 16.2.26 (T-Box in PL0DL ).
16.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 333

man = person ⊓ has_Y person with y-chromosome


woman = person ⊓ has_Y person without y-chromosome
hermaphrodite = man ⊓ woman man and woman

This specification is inconsistent, i.e. [ hermaphrodite]] = ∅ for all D.

 Algorithm: Satisfiability test (usually NP complete)


we know how to do this, e.g. tableaux, resolution, DPLL in PL0DL .

Michael Kohlhase: Artificial Intelligence 1 507 2025-02-06

Even though consistency in our example seems trivial, large ontologies can make machine support
necessary. This is even more true for ontologies that change over time. Say that an ontology
initially has the concept definitions woman=person⊓long_hair and man=person⊓bearded, and then
is modernized to a more biologically correct state. In the initial version the concept hermaphrodite
is consistent, but becomes inconsistent after the renovation; the authors of the renovation should
be made aware of this by the system.
The subsumption test determines whether the sets denoted by two concepts are in a subset relation.
The main justification for this is that humans tend to be aware of concept subsumption, and tend
to think in taxonomic hierarchies. To cater to this, the subsumption test is useful.

Subsumption Test
 Example 16.2.27. In this case trivial

axiom entailed subsumption relation


man = person ⊓ has_Y man ⊑ person
woman = person ⊓ has_Y woman ⊑ person

 Definition 16.2.28. A subsumes B (modulo a set A of concept axioms), iff


[ B]] ⊆ [ A]] for all interpretations D that satisfy A.

 Observation: Or equivalently, iff A ⊑ B ⊑ A = ⊤


 Reduction to consistency test: (need to implement only one)
In PL0 , A ⇒ (A ⇒ B) is valid iff A ∧ A ∧ ¬B is inconsistent.
 In our example: The concept person subsumes woman and man.

Michael Kohlhase: Artificial Intelligence 1 508 2025-02-06

The good news is that we can reduce the subsumption test to the consistency test, so we can
re-use our existing implementation.
The main user-visible service of the subsumption test is to compute the actual taxonomy induced
by an ontology.

Classification
 The subsumption relation among all concepts (subsumption graph)
 Visualization of the subsumption graph for inspection (plausibility)
 Definition 16.2.29. Classification is the computation of the subsumption graph.
334 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

 Example 16.2.30. (not always so trivial)

object

person

man woman student professor child

male_student female_student boy girl

Michael Kohlhase: Artificial Intelligence 1 509 2025-02-06

Instance Test: Inferring Concept Membership


 Definition 16.2.31. An instance test computes whether given an ontology an
individual is a member of a given concept.
 Remark: This is not something we can do in PL0DL , which is a TBox-only system.
PL1 (where concepts are predicate constants an assertions are atoms) suffices.
 Example 16.2.32. If we define a concept “mother” as “woman who has a child”,
and have the assertions “Mary is a woman” and “Jesus is a child of Mary”, then we
can infer that “Mary” is a “Mother”, e.g. in the ND1 :

∀x.m(x) ⇔ w(x) ∧ (∃y.hc(x, y)), w(M ), hc(M , J)⊢ND1 m(M )

 Remark: This only works in the presence of concept definitions, not in a purely
descriptive framework like semantic networks:
can
animal move
TBox isa isa
amoeba
has_part higher animal has_part
legs head
isa isa
pattern eat color
striped tiger elephant gray

inst inst inst color

eat
ABox Roy eat Rex Clyde

Michael Kohlhase: Artificial Intelligence 1 510 2025-02-06

If we take stock of what we have developed so far, then we can see PL0DL as a rational reconstruction
of semantic networks restricted to the “isa” relation. We relegate the “instance” relation to ??.
This reconstruction can now be used as a basis on which we can extend the expressivity and
inference procedures without running into problems.

16.3 A simple Description Logic: ALC


In this section, we instantiate the description-logic paradigm further with the prototypical logic
ALC, which we will introduce now.
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 335

16.3.1 Basic ALC: Concepts, Roles, and Quantification


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27300.
In this subsection, we instantiate the description-logic paradigm with the prototypical logic
ALC, which we will develop now.

Motivation for ALC (Prototype Description Logic)

 Propositional logic (PL0 ) is not expressive enough!


 Example 16.3.1. “mothers are women that have a child”

 Reason: There are no quantifiers in PL0 (existential (∃) and universal (∀))
 Idea: Use first-order predicate logic (PL1 )

∀x.mother(x) ⇔ woman(x) ∧ (∃y.has_child(x, y))

 Problem: Complex algorithms, non-termination (PL1 is too expressive)


 Idea: Try to travel the middle ground
More expressive than PL0 (quantifiers) but weaker than PL1 . (still tractable)
 Technique: Allow only “restricted quantification”, where quantified variables only
range over values that can be reached via a binary relation like has_child.

Michael Kohlhase: Artificial Intelligence 1 511 2025-02-06

ALC extends the concept operators of PL0DL with binary relations (called “roles” in ALC). This
gives ALC the expressive power we had for the basic semantic networks from ??.

Syntax of ALC
 Definition 16.3.2 (Concepts). (aka. “predicates” in PL1 or “propositional
variables” in PL0DL )
Concepts in DLs represent collections of objects.

 . . . like classes in OOP.


 Definition 16.3.3 (Special Concepts). The top concept ⊤ (for “true” or “all”)
and the bottom concept ⊥ (for “false” or “none”).
 Example 16.3.4. person, woman, man, mother, professor, student, car, BMW,
computer, computer program, heart attack risk, furniture, table, leg of a chair, . . .
 Definition 16.3.5. Roles represent binary relations (like in PL1 )
 Example 16.3.6. has_child, has_son, has_daughter, loves, hates, gives_course,
executes_computer_program, has_leg_of_table, has_wheel, has_motor, . . .

 Definition 16.3.7 (Grammar). The formulae of ALC are given by the following
grammar: FALC ::=C | ⊤ | ⊥ | FALC | FALC ⊓ FALC | FALC ⊔ FALC | ∃R.FALC | ∀R.FALC

Michael Kohlhase: Artificial Intelligence 1 512 2025-02-06

ALC restricts the quantification to range all individuals reachable as role successor. The distinction
336 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

between universal and existential quantifiers clarifies an implicit ambiguity in semantic networks.

Syntax of ALC: Examples


 Example 16.3.8. person ⊓ ∃has_child.student
b The set of persons that have a child which is a student
=
b parents of students
=
 Example 16.3.9. person ⊓ ∃has_child.∃has_child.student
b grandparents of students
=

 Example 16.3.10. person ⊓ ∃has_child.∃has_child.(student ⊔ teacher)


b grandparents of students or teachers
=
 Example 16.3.11. person ⊓ ∀has_child.student
b parents whose children are all students
=

 Example 16.3.12. person ⊓ ∀haschild.∃has_child.student


b grandparents, whose children all have at least one child that is a student
=

Michael Kohlhase: Artificial Intelligence 1 513 2025-02-06

More ALC Examples

 Example 16.3.13. car ⊓ ∃has_part.∃made_in.EU


b cars that have at least one part that has not been made in the EU
=
 Example 16.3.14. student ⊓ ∀audits_course.graduatelevelcourse
b students, that only audit graduate level courses
=
 Example 16.3.15. house⊓∀has_parking.off_street =
b houses with off-street park-
ing
 Note: p ⊑ q can still be used as an abbreviation for p ⊔ q.

 Example 16.3.16. student ⊓ ∀audits_course.(∃hastutorial.⊤ ⊑ ∀has_TA.woman)


b students that only audit courses that either have no tutorial or tutorials that are
=
TAed by women

Michael Kohlhase: Artificial Intelligence 1 514 2025-02-06

As before we allow concept definitions so that we can express new concepts from old ones, and
obtain more concise descriptions.

ALC Concept Definitions


 Idea: Define new concepts from known ones.
 Definition 16.3.17. A concept definition is a pair consisting of a new concept
name (the definiendum) and an ALC formula (the definiens). Concepts that are not
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 337

definienda are called primitive.


 We extend the ALC grammar from ?? by the production

CDALC ::=C = FALC

 Example 16.3.18.

Definition rec?
man = person ⊓ ∃has_chrom.Y_chrom -
woman = person ⊓ ∀has_chrom.Y_chrom -
mother = woman ⊓ ∃has_child.person -
father = man ⊓ ∃has_child.person -
grandparent = person ⊓ ∃has_child.(mother ⊔ father) -
german = person ⊓ ∃has_parents.german +
number_list = empty_list ⊔ ∃is_first.number ⊓ ∃is_rest.number_list +

Michael Kohlhase: Artificial Intelligence 1 515 2025-02-06

As before, we can normalize a TBox by definition expansion if it is acyclic. With the introduction
of roles and quantification, concept definitions in ALC have a more “interesting” way to be cyclic
as ?? shows.

TBox Normalization in ALC


 Definition 16.3.19. We call an ALC formula φ normalized wrt. a set of concept
definitions, iff all concepts occurring in φ are primitive.
 Definition 16.3.20. Given a set D of concept definitions, normalization is the
process of replacing in an ALC formula φ all occurrences of definienda in D with
their definientia.
 Example 16.3.21 (Normalizing grandparent).
grandparent
7→ person ⊓ ∃has_child.(mother ⊔ father)
7→ person ⊓ ∃has_child.(woman ⊓ ∃has_child.person ⊓ man ⊓ ∃has_child.person)
7→ person ⊓ ∃has_child.(person ⊓ ∃has_chrom.Y_chrom ⊓ ∃has_child.person ⊓ person ⊓ ∃has_chrom.Y_chrom ⊓ ∃has_child.person)

 Observation 16.3.22. Normalization results can be exponential. (contain


redundancies)
 Observation 16.3.23. Normalization need not terminate on cyclic TBoxes.
 Example 16.3.24.

german 7→ person ⊓ ∃has_parents.german


7→ person ⊓ ∃has_parents.(person ⊓ ∃has_parents.german)
7→ . . .

Michael Kohlhase: Artificial Intelligence 1 516 2025-02-06

Now that we have motivated and fixed the syntax of ALC, we will give it a formal semantics.
The semantics of ALC is an extension of the set-theoretic semantics for PL0 , thus the interpretation
[[·]] assigns subsets of the domain of discourse to concepts and binary relations over the domain
338 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

of discourse to roles.

Semantics of ALC
 ALC semantics is an extension of the set-semantics of propositional logic.
 Definition 16.3.25. A model for ALC is a pair ⟨D, [[·]]⟩, where D is a non-empty
set called the domain of discourse and [[·]] a mapping called the interpretation, such
that

Op. formula semantics


[ c]] ⊆ D = [ ⊤]] [ ⊥]] = ∅ [ r]] ⊆ D × D
· [ φ]] = [ φ]] = D\ [ φ]]
⊓ [ φ ⊓ ψ]] = [ φ]] ∩ [ ψ]]
⊔ [ φ ⊔ ψ]] = [ φ]] ∪ [ ψ]]
∃R. [ ∃R.φ]] = {x ∈ D | ∃y.⟨x, y⟩ ∈ [ R]] and y ∈ [ φ]]}
∀R. [ ∀R.φ]] = {x ∈ D | ∀y.if ⟨x, y⟩ ∈ [ R]] then y ∈ [ φ]]}

 Alternatively we can define the semantics of ALC by translation into PL1 .

 Definition 16.3.26. The translation of ALC into PL1 extends the one from ?? by
the following quantifier rules:
fo(x) fo(x)
∀R.φ := (∀y.R(x, y) ⇒ φfo(y) ) ∃R.φ := (∃y.R(x, y) ∧ φfo(y) )

 Observation 16.3.27. The set-theoretic semantics from ?? and the “semantics-


by-translation” from ?? induce the same notion of satisfiability.

Michael Kohlhase: Artificial Intelligence 1 517 2025-02-06

We can now use the ALC identities above to establish a useful normal form for ALC. This will
play a role in the inference procedures we study next.
The following identitieswill be useful later on. They can be proven directly with the settings from
??; we carry this out for one of them below.

ALC Identities
1 ∃R.φ = ∀R.φ 3 ∀R.φ = ∃R.φ

2 ∀R.(φ ⊓ ψ) = ∀R.φ ⊓ ∀R.ψ 4 ∃R.(φ ⊔ ψ) = ∃R.φ ⊔ ∃R.ψ

 Proof of 1
 
∃R.φ = D\ [ ∃R.φ]] = D\{x ∈ D | ∃y.(⟨x, y⟩ ∈ [ R]]) and (y ∈ [ φ]])}
= {x ∈ D | not ∃y.(⟨x, y⟩ ∈ [ R]]) and (y ∈ [ φ]])}
= {x ∈ D | ∀y.if (⟨x, y⟩ ∈ [ R]]) then (y ̸∈ [ φ]])}
= {x ∈ D | ∀y.if (⟨x, y⟩ ∈ [ R]]) then (y ∈ (D\ [ φ]]))}
= {x ∈ D | ∀y.if (⟨x, y⟩ ∈ [ R]]) then (y ∈ [ φ]])}
= [ ∀R.φ]]

Michael Kohlhase: Artificial Intelligence 1 518 2025-02-06

The form of the identities (interchanging quantification with connectives) is reminiscient of iden-
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 339

tities in PL1 ; this is no coincidence as the “semantics by translation” of ?? shows.

Negation Normal Form


 Definition 16.3.28 (NNF). An ALC formula is in negation normal form (NNF),
iff complement (·) is only applied to primitive concept.
 Use the ALC identities as rules to compute it. (in linear time)
 Example 16.3.29.

example by rule
∃R.(∀S.e ⊓ ∀S.d)
7→ ∀R.∀S.e ⊓ ∀S.d ∃R.φ 7→ ∀R.φ
7→ ∀R.(∀S.e ⊔ ∀S.d) φ ⊓ ψ 7→ φ ⊔ ψ
7→ ∀R.(∃S.e ⊔ ∀S.d) ∀R.φ 7→ ∃R.φ
7→ ∀R.(∃S.e ⊔ ∀S.d) φ 7→ φ

Michael Kohlhase: Artificial Intelligence 1 519 2025-02-06

Finally, we extend ALC with an ABox component. This mainly means that we define two new
assertions in ALC and specify their semantics and PL1 translation.

ALC with Assertions about Individuals


 Definition 16.3.30. We define the ABox assertions for ALC:
 Role assertionsa:φ (a is a φ)
 aRb (a stands in relation R to b)

assertions make up the ABox in ALC.


 Definition 16.3.31. Let ⟨D, [[·]]⟩ be a model for ALC, then we define
 [ a:φ]] = T, iff [ a]] ∈ [ φ]], and
 [ a R b]] = T, iff ( [ a]] , [ b]] ) ∈ [ R]].

 Definition 16.3.32. We extend the PL1 translation of ALC to ALC assertions:


 a:φfo := φfo(a) , and
fo
 aRb := R(a, b).

Michael Kohlhase: Artificial Intelligence 1 520 2025-02-06

If we take stock of what we have developed so far, then we see that ALC as a rational recon-
struction of semantic networks restricted to the “isa” and “instance” relations – which are the only
ones that can really be given a denotational and operational semantics.

16.3.2 Inference for ALC


Video Nuggets covering this subsection can be found at https://fau.tv/clip/id/27301 and
https://fau.tv/clip/id/27302.
340 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

In this subsection we make good on the motivation from ?? that description logics enjoy tractable
inference procedures: We present a tableau calculus for ALC, show that it is a decision procedure,
and study its complexity.

TALC : A Tableau-Calculus for ALC


 Recap Tableaux: A tableau calculus develops an initial tableau in a tree-formed
scheme using tableau extension rules.
A saturated tableau (no rules applicable) constitutes a refutation, if all branches are
closed (end in ⊥).
 Definition 16.3.33. The ALC tableau calculus TALC acts on assertions:
 x:φ (x inhabits concept φ)
 xRy (x and y are in relation R)

where φ is a normalized ALC concept in negation normal form with the following
rules:
x:c x:∀R.φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R.φ
T⊥ T⊓ T⊔ T∀ T∃
⊥ x:φ x:φ x:ψ y:φ xRy
x:ψ y:φ

 To test consistency of a concept φ, normalize φ to ψ, initialize the tableau with


x:ψ, saturate. Open branches ; consistent. (x arbitrary)

Michael Kohlhase: Artificial Intelligence 1 521 2025-02-06

In contrast to the tableau tableau calculi for theorem proving we have studied earlier, TALC is run
in “model generation mode”. Instead of initializing the tableau with the axioms and the negated
conjecture and hope that all branches will close, we initialize the TALC tableau with axioms and
the “membership-conjecture” that a given concept φ is satisfiable – i.e. φ h as a member x, and
hope for branches that are open, i.e. that make the conjecture true (and at the same time give a
model).
Let us now work through two very simple examples; one unsatisfiable, and a satisfiable one.

TALC Examples
 Example 16.3.34 (Tableau Proofs). We have two similar conjectures about
children.
 x:∀has_child.man ⊓ ∃has_child.man (all sons, but a daughter)
x:∀has_child.man ⊓ ∃has_child.man initial
x:∀has_child.man T⊓
x:∃has_child.man T⊓
x has_child y T∃
y:man T∃
⊥ T⊥
inconsistent
 x:∀has_child.man ⊓ ∃has_child.man (only sons, and at least one)
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 341

x:∀has_child.man ⊓ ∃has_child.man initial


x:∀has_child.man T⊓
x:∃has_child.man T⊓
x has_child y T∃
y:man T∃
open
This tableau shows a model: there are two persons, x and y. y is the only child
of x, y is a man.

Michael Kohlhase: Artificial Intelligence 1 522 2025-02-06

Another example: this one is more complex, but the concept is satisfiable.

Another TALC Example

 Example 16.3.35. ∀has_child.(ugrad ⊔ grad) ⊓ ∃has_child.ugrad is satisfiable.


 Let’s try it on the board
 Tableau proof for the notes
1 x:∀has_child.(ugrad ⊔ grad) ⊓ ∃has_child.ugrad initial
2 x:∀has_child.(ugrad ⊔ grad) T⊓
3 x:∃has_child.ugrad T⊓
4 x has_child y T∃
5 y:ugrad T∃
6 y:ugrad ⊔ grad T∀

7 y:ugrad y:grad T⊔
8 ⊥ open
The left branch is closed, the right one represents a model: y is a child of x, y
is a graduate student, x hat exactly one child: y.

Michael Kohlhase: Artificial Intelligence 1 523 2025-02-06

After we got an intuition about TALC , we can now study the properties of the calculus to determine
that it is a decision procedure for ALC.

Properties of Tableau Calculi


 We study the following properties of a tableau calculus C:
 Termination: there are no infinite sequences of inference rule applications.
 Soundness: If φ is satisfiable, then C terminates with an open branch.
 Completeness: If φ is in unsatisfiable, then C terminates and all branches are
closed.
 complexity of the algorithm (time and space complexity).

 Additionally, we are interested in the complexity of satisfiability itself (as a


benchmark)
342 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

Michael Kohlhase: Artificial Intelligence 1 524 2025-02-06

The soundness result for TALC is as usual: we start with a model of x:φ and show that an TALC
tableau must have an open branch.

Correctness
 Lemma 16.3.36. If φ satisfiable, then TALC terminates on x:φ with open branch.
 Proof: Let M := ⟨D, [ ·]]⟩ be a model for φ and w ∈ [ φ]].
M|=(x:ψ) iff [[x]] ∈ [[ψ]]
1. We define [ x]] := w and M|=x R y iff ⟨x, y⟩ ∈ [[R]]
M|=S iff I|=c for all c ∈ S
2. This gives us M|=(x:φ) (base case)
3. If the branch is satisfiable, then either
 no rule applicable to leaf, (open branch)
 or rule applicable and one new branch satisfiable. (inductive case: next)
4. There must be an open branch. (by termination)

Michael Kohlhase: Artificial Intelligence 1 525 2025-02-06

We complete the proof by looking at all the TALC inference rules in turn.

Case analysis on the rules


T⊓ applies then M|=(x:φ ⊓ ψ), i.e. [ x]] ∈ [ φ ⊓ ψ]]
so [ x]] ∈ [ φ]] and [ x]] ∈ [ ψ]], thus M|=(x:φ) and M|=(x:ψ).
T⊔ applies then M|=(x:φ ⊔ ψ), i.e [ x]] ∈ [ φ ⊔ ψ]]
so [ x]] ∈ [ φ]] or [ x]] ∈ [ ψ]], thus M|=(x:φ) or M|=(x:ψ),
wlog. M|=(x:φ).
T∀ applies then M|=(x:∀R.φ) and M|=x R y, i.e. [ x]] ∈ [ ∀R.φ]] and ⟨x, y⟩ ∈ [ R]], so
[ y]] ∈ [ φ]]
T∃ applies then M|=(x:∃R.φ), i.e [ x]] ∈ [ ∃R.φ]],
so there is a v ∈ D with ⟨ [ x]] , v⟩ ∈ [ R]] and v ∈ [ φ]].
We define [ y]] := v, then M|=x R y and M|=(y:φ)

Michael Kohlhase: Artificial Intelligence 1 526 2025-02-06

For the completeness result for TALC we have to start with an open tableau branch and construct a
model that satisfies all judgments in the branch. We proceed by building a Herbrand model, whose
domain consists of all the individuals mentioned in the branch and which interprets all concepts
and roles as specified in the branch. Not surprisingly, the model thus constructed satisfies (all
judgments on) the branch.

Completeness of the Tableau Calculus


 Lemma 16.3.37. Open saturated tableau branches for φ induce models for φ.
 Proof: construct a model for the branch and verify for φ
1. Let B be an open, saturated branch
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 343

 we define

D : = {x | x:ψ ∈ B or z R x ∈ B}
[ c]] : = {x | x:c ∈ B}
[ R]] : = {⟨x, y⟩ | x R y ∈ B}

well-defined since never x:c, x:c ∈ B (otherwise T⊥ applies)


M satisfies all assertions x:c, x:c and x R y, (by construction)
2. M|=(y:ψ), for all y:ψ ∈ B (on k = size(ψ) next slide)
3. M|=(x:φ).

Michael Kohlhase: Artificial Intelligence 1 527 2025-02-06

We complete the proof by looking at all the TALC inference rules in turn.

Case Analysis for Induction

case y:ψ = y:ψ 1 ⊓ ψ 2 Then {y:ψ 1 , y:ψ 2 } ⊆ B (T⊓ -rule, saturation)


so M|=(y:ψ 1 ) and M|=(y:ψ 2 ) and M|=(y:ψ 1 ⊓ ψ 2 ) (IH, Definition)
case y:ψ = y:ψ 1 ⊔ ψ 2 Then y:ψ 1 ∈ B or y:ψ 2 ∈ B (T⊔ , saturation)
so M|=(y:ψ 1 ) or M|=(y:ψ 2 ) and M|=(y:ψ 1 ⊔ ψ 2 ) (IH, Definition)

case y:ψ = y:∃R.θ then {y R z, z:θ} ⊆ B (z new variable) (T∃ -rules, saturation)
so M|=(z:θ) and M|=y R z, thus M|=(y:∃R.θ). (IH, Definition)
case y:ψ = y:∀R.θ Let ⟨ [ y]] , v⟩ ∈ [ R]] for some r ∈ D
then v = z for some variable z with y R z ∈ B (construction of [ R]])
So z:θ ∈ B and M|=(z:θ). (T∀ -rule, saturation, Def)
As v was arbitrary we have M|=(y:∀R.θ).

Michael Kohlhase: Artificial Intelligence 1 528 2025-02-06

Termination
 Theorem 16.3.38. TALC terminates.
 To prove termination of a tableau algorithm, find a well-founded measure (function)
that is decreased by all rules
x:c x:∀R.φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R.φ
T⊥ T⊓ T⊔ T∀ T∃
⊥ x:φ x:φ x:ψ y:φ xRy
x:ψ y:φ

 Proof: Sketch (full proof very technical)


1. Any rule except T∀ can only be applied once to x:ψ.
2. Rule T∀ applicable to x:∀R.ψ at most as the number of R-successors of x.
(those y with x R y above)
344 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

3. The R-successors are generated by x:∃R.θ above, (number bounded by size of


input formula)
4. Every rule application to x:ψ generates constraints z:ψ ′ , where ψ ′ a proper
sub-formula of ψ.

Michael Kohlhase: Artificial Intelligence 1 529 2025-02-06

We can turn the termination result into a worst-case complexity result by examining the sizes of
branches.

Complexity of TALC
 Idea: Work off tableau branches one after the other. (Branch size =
b space
complexity)

 Observation 16.3.39. The size of the branches is polynomial in the size of the
input formula:

branchsize = #(input formulae) + #(∃-formulae) · #(∀-formulae)

 Proof sketch: Re-examine the termination proof and count: the first summand
comes from ??, the second one from ?? and ??
 Theorem 16.3.40. The satisfiability problem for ALC is in PSPACE.
 Theorem 16.3.41. The satisfiability problem for ALC is PSPACE-Complete.

 Proof sketch: Reduce a PSPACE-complete problem to ALC-satisfiability


 Theorem 16.3.42 (Time Complexity). The ALC satisfiability problem is in
EXPTIME.
 Proof sketch: There can be exponentially many branches (already for PL0 )

Michael Kohlhase: Artificial Intelligence 1 530 2025-02-06

In summary, the theoretical complexity of ALC is the same as that for PL0 , but in practice ALC is
much more expressive. So this is a clear win.
But the description of the tableau algorithm TALC is still quite abstract, so we look at an exemplary
implementation in a functional programming language.

The functional Algorithm for ALC

 Observation: (leads to a better treatment for ∃)


 the T∃ -rule generates the constraints x R y and y:ψ from x:∃R.ψ
 this triggers the T∀ -rule for x:∀R.θi , which generate y:θ1 , . . . , y:θn
 for y we have y:ψ and y:θ1 , . . . , y:θn . (do all of this in a single step)
 we are only interested in non-emptiness, not in particular witnesses (leave them
out)
 Definition 16.3.43. The functional algorithm for TALC is
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 345

consistent(S) =
if {c, c} ⊆ S then false
elif ‘φ ⊓ ψ ′ ∈ S and (‘φ′ ̸∈ S or ‘ψ ′ ̸∈ S)
then consistent(S ∪ {φ, ψ})
elif ‘φ ⊔ ψ ′ ∈ S and {φ, ψ} ̸∈ S
then consistent(S ∪ {φ}) or consistent(S ∪ {ψ})
elif forall ‘∃R.ψ ′ ∈ S
consistent({ψ} ∪ {θ ∈ θ | ‘∀R.θ′ ∈ S})
else true

 Relatively simple to implement. (good implementations optimized)


 But: This is restricted to ALC. (extension to other DL difficult)

Michael Kohlhase: Artificial Intelligence 1 531 2025-02-06

Note that we have (so far) only considered an empty TBox: we have initialized the tableau
with a normalized concept; so we did not need to include the concept definitions. To cover “real”
ontologies, we need to consider the case of concept axioms as well.
We now extend TALC with concept axioms. The key idea here is to realize that the concept axioms
apply to all individuals. As the individuals are generated by the T∃ rule, we can simply extend
that rule to apply all the concept axioms to the newly introduced individual.

Extending the Tableau Algorithm by Concept Axioms


 concept axioms, e.g. child ⊑ son ⊔ daughter cannot be handled in TALC yet.

 Idea: Whenever a new variable y is introduced (by T∃ -rule) add the information
that axioms hold for y.
 Initialize tableau with {x:φ} ∪ CA (CA : = set of concept axioms)
x:∃R.φ CA = {α1 , . . ., αn } ∃
 New rule for ∃: TCA (instead of T∃ )
y:φ
xRy
y:α1
..
.
y:αn

 Problem: CA := {∃R.c} and start tableau with x:d (non-termination)

Michael Kohlhase: Artificial Intelligence 1 532 2025-02-06

The problem of this approach is that it spoils termination, since we cannot control the number of
rule applications by (fixed) properties of the input formulae. The example shows this very nicely.
We only sketch a path towards a solution.

Non-Termination of TALC with Concept Axioms

 Problem: CA := {∃R.c} and start tableau with x:d. (non-termination)


346 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

x:d start
x:∃R.c in CA Solution: Loop-Check:
x R y1 T∃
 Instead of a new variable y take an old
y 1 :c T∃
variable z, if we can guarantee that what-
y 1 :∃R.c TC∃A
ever holds for y already holds for z.
y1 R y2 T∃
y 2 :c T∃  We can only do this, iff the T∀ -rule has
y 2 :∃R.c TC∃A been exhaustively applied.
...

 Theorem 16.3.44. The consistency problem of ALC with concept axioms is decid-
able.
Proof sketch: TALC with a suitable loop check terminates.

Michael Kohlhase: Artificial Intelligence 1 533 2025-02-06

16.3.3 ABoxes, Instance Testing, and ALC


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/27303.
Now that we have a decision problem for ALC with concept axioms, we can go the final step to
the general case of inference in description logics: we add an ABox with assertional axioms that
describe the individuals.
We will now extend the description logic ALC with assertions that can express concept member-
ship.

 Instance Test: Concept Membership


 Definition 16.3.45. An instance test computes whether given an ALC ontology an
individual is a member of a given concept.
 Example 16.3.46 (An Ontology).

TBox (terminological Box) ABox (assertional Box, data base)


woman = person ⊓ has_Y tony:person Tony is a person
man = person ⊓ has_Y tony:has_Y Tony has a y-chrom

This entails: tony:man (Tony is a man).


 Problem: Can we compute this?

Michael Kohlhase: Artificial Intelligence 1 534 2025-02-06

If we combine classification with the instance test, then we get the full picture of how concepts
and individuals relate to each other. We see that we get the full expressivity of semantic networks
in ALC.

Realization
 Definition 16.3.47. Realization is the computation of all instance relations be-
tween ABox objects and TBox concepts.
 Observation: It is sufficient to remember the lowest concepts in the subsumption
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 347

graph. (rest by subsumption)

object

person

man woman student professor child

male_student female_student girl boy

Tony Terry Timmy

 Example 16.3.48. If tony:male_student is known, we do not need tony:man.

Michael Kohlhase: Artificial Intelligence 1 535 2025-02-06

Let us now get an intuition on what kinds of interactions between the various parts of an ontology.

ABox Inference in ALC: Phenomena


 There are different kinds of interactions between TBox and ABox in ALC and in
description logics in general.
 Example 16.3.49.

property example
internally inconsistent tony:student, tony:student
TBox: student ⊓ prof
inconsistent with a TBox
ABox: tony:student, tony:prof
ABox: tony:∀has_grad.genius
implicit info that is not explicit tony has_grad mary
|= mary:genius
TBox: happy_prof = prof ⊓ ∀has_grad.genius
ABox: tony:happy_prof,
information that can be com-
tony has_grad mary
bined with TBox info
|= mary:genius

Michael Kohlhase: Artificial Intelligence 1 536 2025-02-06

Again, we ask ourselves whether all of these are computable.


Fortunately, it is very simple to add assertions to TALC . In fact, we do not have to change anything,
as the judgments used in the tableau are already of the form of ABox assertion.

Tableau-based Instance Test and Realization


 Query: Do the ABox and TBox together entail a:φ? (a ∈ φ?)

 Algorithm: Test a:φ for consistency with ABox and TBox. (use our tableau
algorithm)
 Necessary changes: (no big deal)
 Normalize ABox wrt. TBox. (definition expansion)
 Initialize the tableau with ABox in NNF. (so it can be used)
348 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

 Example 16.3.50.

Example: add mary:genius to determine ABox, T Box |= mary:genius


TBox happy_prof = prof ⊓
∀has_grad.genius tony:prof ⊓ ∀has_grad.genius TBox
tony has_grad mary ABox
mary:genius Query
tony:prof T⊓
tony:happy_prof tony:∀has_grad.genius T⊓
ABox tony has_grad mary mary:genius T∀
⊥ T⊥

 Note: The instance test is the base for realization. (remember?)

 Idea: Extend to more complex ABox queries. (e.g. give me all instances of φ)

Michael Kohlhase: Artificial Intelligence 1 537 2025-02-06

This completes our investigation of inference for ALC. We summarize that ALC is a logic-based on-
tology language where the inference problems are all decidable/computable via TALC . But of course,
while we have reached the expressivity of basic semantic networks, there are still things that we
cannot express in ALC, so we will try to extend ALC without losing decidability/computability.

16.4 Description Logics and the Semantic Web


A Video Nugget covering this section can be found at https://fau.tv/clip/id/27289.
In this section we discuss how we can apply description logics in the real world, in particular,
as a conceptual and algorithmic basis of the semantic web. That tries to transform the World
Wide Web from a human-understandable web of multimedia documents into a “web of machine-
understandable data”. In this context, “machine-understandable” means that machines can draw
inferences from data they have access to. Note that the discussion in this digression is not a
full-blown introduction to RDF and OWL, we leave that to [SR14; Her+13a; Hit+12] and the
respective W3C recommendations. Instead we introduce the ideas behind the mappings from a
perspective of the description logics we have discussed above.
The most important component of the semantic web is a standardized language that can represent
“data” about information on the Web in a machine-oriented way.

Resource Description Framework


 Definition 16.4.1. The Resource Description Framework (RDF) is a framework for
describing resources on the web. It is an XML vocabulary developed by the W3C.

 Note: RDF is designed to be read and understood by computers, not to be


displayed to people. (it shows)
 Example 16.4.2. RDF can be used for describing (all “objects on the WWW”)
 properties for shopping items, such as price and availability
 time schedules for web events
 information about web pages (content, author, created and modified date)
 content and rating for web pictures
 content for search engines
 electronic libraries
16.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 349

Michael Kohlhase: Artificial Intelligence 1 538 2025-02-06

Note that all these examples have in common that they are about “objects on the Web”, which is
an aspect we will come to now.
“Objects on the Web” are traditionally called “resources”, rather than defining them by their
intrinsic properties – which would be ambitious and prone to change – we take an external property
to define them: everything that has a URI is a web resource. This has repercussions on the design
of RDF.

Resources and URIs


 RDF describes resources with properties and property values.
 RDF uses Web identifiers (URIs) to identify resources.

 Definition 16.4.3. A resource is anything that can have a URI, such as http:
//www.fau.de.
 Definition 16.4.4. A property is a resource that has a name, such as author
or homepage, and a property value is the value of a property, such as Michael
Kohlhase or http://kwarc.info/kohlhase. (a property value can be another
resource)
 Definition 16.4.5. A RDF statement s (also known as a triple) consists of a
resource (the subject of s), a property (the predicate of s), and a property value
(the object of s). A set of RDF triples is called an RDF graph.

 Example 16.4.6. Statements: [This slide]subj has been [author]pred ed by [Michael


Kohlhase]obj

Michael Kohlhase: Artificial Intelligence 1 539 2025-02-06

The crucial observation here is that if we map “subjects” and “objects” to “individuals”, and
“predicates” to “relations”, the RDF triples are just relational ABox statements of description
logics. As a consequence, the techniques we developed apply.
Note: Actually, a RDF graph is technically a labeled multigraph, which allows multiple edges
between any two nodes (the resources) and where nodes and edges are labeled by URIs.
We now come to the concrete syntax of RDF. This is a relatively conventional XML syntax that
combines RDF statements with a common subject into a single “description” of that resource.

XML Syntax for RDF


 RDF is a concrete XML vocabulary for writing statements
 Example 16.4.7. The following RDF document could describe the slides as a
resource
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22−rdf−syntax−ns#"
xmlns:dc= "http://purl.org/dc/elements/1.1/">
<rdf:Description about="https://.../CompLog/kr/en/rdf.tex">
<dc:creator>Michael Kohlhase</dc:creator>
<dc:source>http://www.w3schools.com/rdf</dc:source>
</rdf:Description>
</rdf:RDF>
350 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

This RDF document makes two statements:


 The subject of both is given in the about attribute of the rdf:Description element
 The predicates are given by the element names of its children
 The objects are given in the elements as URIs or literal content.
 Intuitively: RDF is a web-scalable way to write down ABox information.

Michael Kohlhase: Artificial Intelligence 1 540 2025-02-06

Note that XML namespaces play a crucial role in using element to encode the predicate URIs.
Recall that an element name is a qualified name that consists of a namespace URI and a proper
element name (without a colon character). Concatenating them gives a URI in our example the
predicate URI induced by the dc:creator element is http://purl.org/dc/elements/1.1/creator.
Note that as URIs go RDF URIs do not have to be URLs, but this one is and it references (is
redirected to) the relevant part of the Dublin Core elements specification [DCM12].
RDF was deliberately designed as a standoff markup format, where URIs are used to annotate
web resources by pointing to them, so that it can be used to give information about web resources
without having to change them. But this also creates maintenance problems, since web resources
may change or be deleted without warning.
RDFa gives authors a way to embed RDF triples into web resources and make keeping RDF
statements about them more in sync.

RDFa as an Inline RDF Markup Format

 Problem: RDF is a standoff markup format (annotate by URIs pointing into


other files)
Definition 16.4.8. RDFa (RDF annotations) is a markup scheme for inline anno-
tation (as XML attributes) of RDF triples.

 Example 16.4.9.
<div xmlns:dc="http://purl.org/dc/elements/1.1/" id="address">
<h2 about="#address" property="dc:title">RDF as an Inline RDF Markup Format</h2>
<h3 about="#address" property="dc:creator">Michael Kohlhase</h3>
<em about="#address" property="dc:date" datatype="xsd:date"
content="2009−11−11">November 11., 2009</em>
</div>

https://svn.kwarc.info/.../CompLog/kr/slides/rdfa.tex

http://purl.org/dc/elements/1.1/title
http://purl.org/dc/elements/1.1/date
http://purl.org/dc/elements/1.1/creator
RDFa as an Inline RDF Markup Format
2009−11−11 (xsd:date)
Michael Kohlhase

Michael Kohlhase: Artificial Intelligence 1 541 2025-02-06

In the example above, the about and property attributes are reserved by RDFa and specify the
subject and predicate of the RDF statement. The object consists of the body of the element,
unless otherwise specified e.g. by the content and datatype attributes for literals content.
Let us now come back to the fact that RDF is just an XML syntax for ABox statements.
16.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 351

RDF as an ABox Language for the Semantic Web


 Idea: RDF triples are ABox entries h R s or h:φ.
 Example 16.4.10. h is the resource for Ian Horrocks, s is the resource for Ulrike
Sattler, R is the relation “hasColleague”, and φ is the class foaf:Person
<rdf:Description about="some.uri/person/ian_horrocks">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<hasColleague resource="some.uri/person/uli_sattler"/>
</rdf:Description>
 Idea: Now, we need an similar language for TBoxes (based on ALC)

Michael Kohlhase: Artificial Intelligence 1 542 2025-02-06

In this situation, we want a standardized representation language for TBox information; OWL
does just that: it standardizes a set of knowledge representation primitives and specifies a variety
of concrete syntaxes for them. OWL is designed to be compatible with RDF, so that the two
together can form an ontology language for the web.

OWL as an Ontology Language for the Semantic Web


 Task: Complement RDF (ABox) with a TBox language.
 Idea: Make use of resources that are values in rdf:type. (called Classes)
 Definition 16.4.11. OWL (the ontology web language) is a language for encoding
TBox information about RDF classes.

 Example 16.4.12 (A concept definition for “Mother”). Mother=Woman ⊓


Parent is represented as

XML Syntax Functional Syntax


<EquivalentClasses> EquivalentClasses(
<Class IRI="Mother"/> :Mother
<ObjectIntersectionOf> ObjectIntersectionOf(
<Class IRI="Woman"/> :Woman
<Class IRI="Parent"/> :Parent
</ObjectIntersectionOf> )
</EquivalentClasses> )

Michael Kohlhase: Artificial Intelligence 1 543 2025-02-06

But there are also other syntaxes in regular use. We show the functional syntax which is inspired
by the mathematical notation of relations.

Extended OWL Example in Functional Syntax

 Example 16.4.13. The semantic network from ?? can be expressed in OWL (in
functional syntax)
352 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

bird Jack Person


isa inst
inst inst
has_part robin owner_of Mary
loves
wings John

ClassAssertion (:Jack :robin)


ClassAssertion(:John :person)
ClassAssertion (:Mary :person)
ObjectPropertyAssertion(:loves :John :Mary)
ObjectPropertyAssertion(:owner :John :Jack)
SubClassOf(:robin :bird)
SubClassOf (:bird ObjectSomeValuesFrom(:hasPart :wing))

 ClassAssertion formalizes the “inst” relation,


 ObjectPropertyAssertion formalizes relations,
 SubClassOf formalizes the “isa” relation,
 for the “has_part” relation, we have to specify that all birds have a part that
is a wing or equivalently the class of birds is a subclass of all objects that
have some wing.

Michael Kohlhase: Artificial Intelligence 1 544 2025-02-06

We have introduced the ideas behind using description logics as the basis of a “machine-oriented
web of data”. While the first OWL specification (2004) had three sublanguages “OWL Lite”, “OWL
DL” and “OWL Full”, of which only the middle was based on description logics, with the OWL2
Recommendation from 2009, the foundation in description logics was nearly universally accepted.
The semantic web hype is by now nearly over, the technology has reached the “plateau of
productivity” with many applications being pursued in academia and industry. We will not go
into these, but briefly instroduce one of the tools that make this work.

SPARQL an RDF Query language


 Definition 16.4.14. SPARQL, the “SPARQL Protocol and RDF Query Language”
is an RDF query language, able to retrieve and manipulate data stored in RDF.
The SPARQL language was standardized by the World Wide Web Consortium in
2008 [PS08].
 SPARQL is pronounced like the word “sparkle”.
 Definition 16.4.15. A system is called a SPARQL endpoint, iff it answers SPARQL
queries.
 Example 16.4.16. Query for person names and their e-mails from a triplestore
with FOAF data.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?email
WHERE {
?person a foaf:Person.
?person foaf:name ?name.
?person foaf:mbox ?email.
}
16.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 353

Michael Kohlhase: Artificial Intelligence 1 545 2025-02-06

SPARQL end-points can be used to build interesting applications, if fed with the appropriate data.
An interesting – and by now paradigmatic – example is the DBPedia project, which builds a large
ontology by analyzing Wikipedia fact boxes. These are in a standard HTML form which can be
analyzed e.g. by regular expressions, and their entries are essentially already in triple form: The
subject is the Wikipedia page they are on, the predicate is the key, and the object is either the
URI on the object value (if it carries a link) or the value itself.

SPARQL Applications: DBPedia


 Typical Application: DBPedia screen-scrapes
Wikipedia fact boxes for RDF triples and uses SPARQL
for querying the induced triplestore.
 Example 16.4.17 (DBPedia Query). People who
were born in Erlangen before 1900
(http://dbpedia.org/snorql)
SELECT ?name ?birth ?death ?person WHERE {
?person dbo:birthPlace :Erlangen .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900−01−01"^^xsd:date) .
}
ORDER BY ?name

 The answers include Emmy Noether and Georg Simon


Ohm.

Michael Kohlhase: Artificial Intelligence 1 546 2025-02-06

A more complex DBPedia Query


 Demo: DBPedia http://dbpedia.org/snorql/
Query: Soccer players born in a country with more than 10 M inhabitants, who play
as goalie in a club that has a stadium with more than 30.000 seats.
Answer: computed by DBPedia from a SPARQL query
354 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB

Michael Kohlhase: Artificial Intelligence 1 547 2025-02-06

We conclude our survey of the semantic web technology stack with the notion of a triplestore,
which refers to the database component, which stores vast collections of ABox triples.

Triple Stores: the Semantic Web Databases


 Definition 16.4.18. A triplestore or RDF store is a purpose-built database for
the storage RDF graphs and retrieval of RDF triples usually through variants of
SPARQL.

 Common triplestores include


 Virtuoso: https://virtuoso.openlinksw.com/ (used in DBpedia)
 GraphDB: http://graphdb.ontotext.com/ (often used in WissKI)
 blazegraph: https://blazegraph.com/ (open source; used in WikiData)

 Definition 16.4.19. A description logic reasoner implements of reaonsing services


based on a satisfiabiltiy test for description logics.
 Common description logic reasoners include
 FACT++: http://owl.man.ac.uk/factplusplus/
 HermiT: http://www.hermit-reasoner.com/

 Intuition: Triplestores concentrate on querying very large ABoxes with partial


consideration of the TBox, while DL reasoners concentrate on the full set of ontology
inference services, but fail on large ABoxes.
16.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 355

Michael Kohlhase: Artificial Intelligence 1 548 2025-02-06


356 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
Part IV

Planning & Acting

357
359

This part covers the AI subfield of “planning”, i.e. search-based problem solving with a structured
representation language for environment state and actions — in planning, the focus is on the latter.
We first introduce the framework of planning (structured representation languages for problems
and actions) and then present algorithms and complexity results. Finally, we lift some of the
simplifying assumptions – deterministic, fully observable environments – we made in the previous
parts of the course.
360
Chapter 17

Planning I: Framework

Reminder: Classical Search Problems


 Example 17.0.1 (Solitaire as a Search Problem).

 States: Card positions (e.g. position_Jspades=Qhearts).


 Actions: Card moves (e.g. move_Jspades_Qhearts_freecell4).
 Initial state: Start configuration.
 Goal states: All cards “home”.
 Solutions: Card moves solving this game.

Michael Kohlhase: Artificial Intelligence 1 549 2025-02-06

Planning
 Ambition: Write one program that can solve all classical search problems.
 Idea: For CSP, going from “state/action-level search” to “problem-description level
search” did the trick.
 Definition 17.0.2. Let Π be a search problem (see ??)

 The blackbox description of Π is an API providing functionality allowing to


construct the state space: InitialState(), GoalTest(s), . . .

361
362 CHAPTER 17. PLANNING I: FRAMEWORK

 “Specifying the problem” =


b programming the API.
 The declarative description of Π comes in a problem description language. This
allows to implement the API, and much more.
 “Specifying the problem” =
b writing a problem description.

 Here, “problem description language” =


b planning language. (up next)
 But Wait: Didn’t we do this already in the last chapter with logics? (For the
Wumpus?)

Michael Kohlhase: Artificial Intelligence 1 550 2025-02-06

17.1 Logic-Based Planning


Before we go into the planning framework and its particular methods, let us see what we would
do with the methods from ?? if we were to develop a “logic-based language” for describing states
and actions. We will use the Wumpus world from ?? as a running example.

Fluents: Time-Dependent Knowledge in Planning


 Recall from ??: We can represent the Wumpus rules in logical systems.
(propositional/first-order/ALC)
 Use inference systems to deduce new world knowledge from percepts and actions.
 Problem: Representing (changing) percepts immediately leads to contradictions!
 Example 17.1.1. If the agent moves and a cell with a draft at (a perceived breeze)
is followed by one without.
 Obvious Idea: Make representations of percepts time-dependent
 Example 17.1.2. Dt for t ∈ N for PL0 and draft(t) in PL1 and PLnq .
 Definition 17.1.3. We use the word fluent to refer an aspect of the world that
changes, all others we call atemporal.

Michael Kohlhase: Artificial Intelligence 1 551 2025-02-06

Let us recall the agent-based setting we were using for the inference procedures from ??. We will
elaborate this further in this section.

Recap: Logic-Based Agents


 Recall: A model-based agent uses inference to model the environment, percepts,
and actions.
17.1. LOGIC-BASED
Section 2.4. PLANNING
The Structure of Agents 51 363

Sensors
State
How the world evolves What the world
is like now

Environment
What my actions do

Condition-action rules What action I


should do now

Agent Actuators

Figure 2.11 A model-based reflex agent.


function KB−AGENT (percept) returns an action
persistent:function
KB,MaODEL knowledge
-BASED -R EFLEX base
-AGENT( percept ) returns an action
a counter,
t,persistent: state, theinitially 0,conception
agent’s current indicating time
of the world state
model , a description of how the next state depends on current state and action
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))
rules, a set of condition–action rules
action := ASK(KB, MAKE−ACTION−QUERY(t))
action, the most recent action, initially none
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
state ← U PDATE -S TATE (state, action , percept , model )
t := t+1 rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action return action

Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
 Still Unspecified:
using an internal model. It then chooses an action in the same way as the reflex agent. (up next)
 MAKE−PERCEPT−SENTENCE:
is responsible for creating the new internalthe
state effects
description.ofThe
percepts.
details of how models and
states are represented vary widely depending on the type of environment and the particular
 MAKE−ACTION−QUERY: what
technology used in the agent design. is the
Detailed bestofnext
examples modelsaction?
and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
 MAKE−ACTION−SENTENCE: the effects
Regardless of the kind of representation ofseldom
used, it is thatpossible
action. for the agent to
determine the current state of a partially observable environment exactly. Instead, the box
In particular,labeled
we “what
will look atisthe
the world effect
like now” of time/change.
(Figure (neglected
2.11) represents the agent’s “best guess” (or so far)
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess about what may be causing the
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
Michael
to makeKohlhase: Artificial Intelligence 1
a decision. 552 2025-02-06
A perhaps less obvious point about the internal “state” maintained by a model-based
agent is that it does not have to describe “what the world is like now” in a literal sense. For
Now that we have the notion of fluents to represent the percepts at a given time point, let us try
to model how they influence the agent’s world model.

Fluents: Modeling the Agent’s Sensors


 Idea: Relate percept fluents to atemporal cell attributes.
 Example 17.1.4. E.g., if the agent perceives a draft at at time t, when it is in cell
[x, y], then there must be a breeze there:

∀t, x, y.Ag@(t, x, y) ⇒ (draft(t) ⇔ breeze(x, y))

 Axioms like these model the agent’s sensors – here that they are totally reliable:
there is a breeze, iff the agent feels a draft at.
 Definition 17.1.5. We call fluents that describe the agent’s sensors sensor axioms.

 Problem: Where do fluents like Ag@(t, x, y) come from?

Michael Kohlhase: Artificial Intelligence 1 553 2025-02-06

You may have noticed that for the sensor axioms we have only used first-order logic. There is a
general story to tell here: If we have finite domains (as we do in the Wumpus cave) we can always
“compile first-order logic into propositional logic”; if domains are infinite, we usually cannot.
We will develop this here before we go on with the Wumpus models.
364 CHAPTER 17. PLANNING I: FRAMEWORK

Digression: Fluents and Finite Temporal Domains


 Observation: Fluents like ∀t, x, y.Ag@(t, x, y) ⇒ (draft(t) ⇔ breeze(x, y)) from
?? are best represented in first-order logic. In PL0 and PLnq we would have to use
concrete instances like Ag@(7, 2, 1) ⇒ (draft(7) ⇔ breeze(2, 1)) for all suitable t,
x, and y.
 Problem: Unless we restrict ourselves to finite domains and an end time tend
we have infinitely many axioms. Even then, formalization in PL0 and PLnq is very
tedious.

 Solution: Formalize in first-order logic and then compile down:


1. enumerate ranges of bound variables, instantiate body, (; PLnq )
2. translate PLnq atoms to propositional variables. (; PL0 )
 In Practice: The choice of domain, end time, and logic is up to agent designer,
weighing expressivity vs. efficiency of inference.

 WLOG: We will use PL1 in the following. (easier to read)

Michael Kohlhase: Artificial Intelligence 1 554 2025-02-06

We now continue to our logic-based agent models: Now we focus on effect axioms to model the
effects of an agent’s actions.

Fluents: Effect Axioms for the Transition Model


 Problem: Where do fluents like Ag@(t, x, y) come from?
 Thus: We also need fluents to keep track of the agent’s actions. (The transition
model of the underlying search problem).
 Idea: We also use fluents for the representation of actions.

 Example 17.1.6. The action of “going forward” at time t is captured by the fluent
forw(t).
 Definition 17.1.7. Effect axioms describe how the environment changes under an
agent’s actions.
 Example 17.1.8. If the agent is in cell [1, 1] facing east at time 0 and goes forward,
she is in cell [2, 1] and no longer in [1, 1]:

Ag@(0, 1, 1) ∧ faceeast(0) ∧ forw(0) ⇒ Ag@(1, 2, 1) ∧ ¬Ag@(1, 1, 1)

Generally: (barring exceptions for domain border cells)

∀t, x, y.Ag@(t, x, y)∧faceeast(t)∧forw(t)⇒Ag@(t+1, x+1, y)∧¬Ag@(t+1, x, y)

This compiles down to 16 · tend PLnq /PL0 axioms.

Michael Kohlhase: Artificial Intelligence 1 555 2025-02-06

Unfortunately, the percept fluents, sensor axioms, and effect axioms are not enough, as we will
show in ??. We will see that this is a more general problem – the famous frame problem that
17.1. LOGIC-BASED PLANNING 365

needs to be considered whenever we deal with change in environments.

Frames and Frame Axioms


 Problem: Effect axioms are not enough.
 Example 17.1.9. Say that the agent has an arrow at time 0, and then moves
forward at into [2, 1], perceives a glitter, and knows that the Wumpus is ahead.
To evaluate the action shoot(1) the corresponding effect axiom needs to know
havarrow(1), but cannot prove it from havarrow(0).
Problem: The information of having an arrow has been lost in the move forward.
 Definition 17.1.10. The frame problem describes that for a representation of
actions we need to formalize their effects on the aspects they change, but also their
non-effect on the static frame of reference.
 Partial Solution: (there are many many more; some better)
Frame axioms formalize that particular fluents are invariant under a given action.
 Problem: For an agent with n actions and an environment with m fluents, we
need O(nm) frame axioms.
Representing and reasoning with them easily drowns out the sensor and transition
models.

Michael Kohlhase: Artificial Intelligence 1 556 2025-02-06

We conclude our discussion with a relatively complete implementation of a logic-based Wumpus


agent, building on the schema from slide 552.

A Hybrid Agent for the Wumpus World


 Example 17.1.11 (A Hybrid Agent). This agent uses
 logic inference for sensor and transition modeling,
 special code and A∗ for action selection & route planning.

function HYBRID−WUMPUS−AGENT(percept) returns an action


inputs: percept, a list, [stench,breeze,glitter,bump,scream]
persistent: KB, a knowledge base, initially the atemporal
"wumpus physics"
t, a counter, initially 0, indicating time
plan, an action sequence, initially empty
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))

then some special code for action selection, and then (up next)
action := POP(plan)
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
t := t + 1
return action

So far, not much new over our original version.

Michael Kohlhase: Artificial Intelligence 1 557 2025-02-06


366 CHAPTER 17. PLANNING I: FRAMEWORK

Now look at the “special code” we have promised.

A Hybrid Agent: Custom Action Selection


 Example 17.1.12 (A Hybrid Agent (continued)). So that we can plan the best
strategy:
TELL(KB, the temporal "physics" sentences for time t)
saf e := {[x, y] | ASK(KB,OK(t, x, y))=T}
if ASK(KB,glitter(t)) = T then
plan := [grab] + PLAN−ROUTE(current,{[1, 1]},saf e) + [exit]
if plan is empty then
unvisited := {[x, y] | ASK(KB,Ag@(t′ , x, y))=F} for all t′ ≤ t
plan := PLAN−ROUTE(current,unvisited ∪ saf e,saf e)
if plan is empty and ASK(KB,havarrow(t)) = T then
possible_wumpus := {x, y | [x, y]}ASK(KB,¬wumpus(t, x, y)) = F
plan := PLAN−SHOT(current,possible_wumpus,saf e)
if plan is empty then // no choice but to take a risk
not_unsaf e := {[x, y] | ASK(KB,¬OK(t, x, y)) = F}
plan := PLAN−ROUTE(current,unvisited ∪ not_unsaf e,saf e)
if plan is empty then
plan := PLAN−ROUTE(current,{[1, 1]},saf e) + [exit]

Note that OK wumpus, and glitter are fluents, since the Wumpus might have died
or the gold might have been grabbed.

Michael Kohlhase: Artificial Intelligence 1 558 2025-02-06

And finally the route planning part of the code. This is essentially just A∗ search.

A Hybrid Agent: Custom Action Selection

 Example 17.1.13 (Action Selection). And the code for PLAN−ROUTE


(PLAN−SHOT similar)
function PLAN−ROUTE(curr,goals,allowed) returns an action sequence
inputs: curr, the agent’s current position
goals, a set of squares;
try to plan a route to one of them
allowed, a set of squares that can form part of the route
problem := ROUTE−PROBLEM(curr,goals,allowed)
return A∗ (problem)

 Evaluation: Even though this works for the Wumpus world, it is not the “universal,
logic-based problem solver” we dreamed of!

 Planning tries to solve this with another representation of actions. (up next)

Michael Kohlhase: Artificial Intelligence 1 559 2025-02-06

17.2 Planning: Introduction


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26892.
17.2. PLANNING: INTRODUCTION 367

How does a planning language describe a problem?


 Definition 17.2.1. A planning language is a way of describing the components of
a search problem via formulae of a logical system. In particular the
 states (vs. blackbox: data structures). (E.g.: predicate Eq(., .).)
 initial state I (vs. data structures). (E.g.: Eq(x, 1).)
 goal states G (vs. a goal test). (E.g.: Eq(x, 2).)
 set A of actions in terms of preconditions and effects (vs. functions returning
applicable actions and successor states). (E.g.: “increment x: pre Eq(x, 1), iff
Eq(x ∧ 2) ∧ ¬Eq(x, 1)”.)
A logical description of all of these is called a planning task.

 Definition 17.2.2. Solution (plan) =


b sequence of actions from A, transforming I
into a state that satisfies G. (E.g.: “increment x”.)
The process of finding a plan given a planning task is called planning.

Michael Kohlhase: Artificial Intelligence 1 560 2025-02-06

Planning Language Overview


 Disclaimer: Planning languages go way beyond classical search problems. There
are variants for inaccessible, stochastic, dynamic, continuous, and multi-agent set-
tings.
 We focus on classical search for simplicity (and practical relevance).
 For a comprehensive overview, see [GNT04].

Michael Kohlhase: Artificial Intelligence 1 561 2025-02-06

Application: Natural Language Generation


S:e {sleep(e,r1)}

NP:r1 ↓ VP:e S:e

V:e NP:r1 VP:e

sleeps the N:r1 V:e

NP:r1 N:r1 white rabbit sleeps


the
N:r1 white N:r1 *

{rabbit(r1)} {white(r1)}
rabbit

 Input: Tree-adjoining grammar, intended meaning.


 Output: Sentence expressing that meaning.
368 CHAPTER 17. PLANNING I: FRAMEWORK

Application: Business
Business Process
Process Templates
Templates at
at SAP
SAP
Michael Kohlhase: Artificial Intelligence 1 562 2025-02-06

Application: Business Process Templates at SAP


Approval:
Approval:
Necessary
Necessary

Approval: Decide CQ
Approval:
not Decide CQ
not
Necessary Approval
Create CQ Necessary Approval
Create CQ

Submit CQ
Submit CQ
Check CQ Check CQ
Check CQ Check CQ
Completeness Consistency
Completeness Consistency
Mark CQ as
Mark CQ as
Accepted
Accepted

Create Follow-
Create Follow-
Up for CQ
Up for CQ
Check CQ
Check CQ
Approval
Approval
Status
Status Archive CQ
Archive CQ

I  Input: model of behavior of activities on business objects, process endpoint.


I Input:
Input: SAP-scale
SAP-scale model
model of
of behavior
behavior of
of activities
activities on
on Business
Business Objects,
Objects, process
process
endpoint.
Output: Process template leading to this point.
endpoint.
I
I Output:
Output: Process
Process template
template leading
leading to
to this
this point.
point.
Michael Kohlhase: Artificial Intelligence 1 563 2025-02-06

Kohlhase: Künstliche Intelligenz 1 484 July 5, 2018


Kohlhase: Künstliche Intelligenz 1 484 July 5, 2018

Application: Automatic Hacking

DMZ
Web Server Application Server

Internet

Router
Firewall

Attacker

Workstation

DB Server

SENSITIVE USERS

DMZ
Web Server Application Server

Internet

Router
Firewall

Attacker

Workstation

DB Server

SENSITIVE USERS
17.2. PLANNING: INTRODUCTION 369

DMZ
Web Server Application Server

Internet

Router
Firewall

Attacker

Workstation

DB Server

SENSITIVE USERS

DMZ
Web Server Application Server

Internet

Router
Firewall

Attacker

Workstation

DB Server

SENSITIVE USERS

 Input: Network configuration, location of sensible data.


 Output: Sequence of exploits giving access to that data.

Michael Kohlhase: Artificial Intelligence 1 564 2025-02-06

Reminder: General Problem Solving, Pros and Cons


 Powerful: In some applications, generality is absolutely necessary. (E.g. SAP)

 Quick: Rapid prototyping: 10s lines of problem description vs. 1000s lines of C++
code. (E.g. language generation)
 Flexible: Adapt/maintain the description. (E.g. network security)
 Intelligent: Determines automatically how to solve a complex problem efficiently!
(The ultimate goal, no?!)

 Efficiency loss: Without any domain-specific knowledge about chess, you don’t
beat Kasparov . . .
 Trade-off between “automatic and general” vs. “manual work but efficient”.
 Research Question: How to make fully automatic algorithms efficient?

Michael Kohlhase: Artificial Intelligence 1 565 2025-02-06


370 CHAPTER 17. PLANNING I: FRAMEWORK

Search vs. planning


 Consider the task get milk, bananas, and a cordless drill.
 Standard search algorithms seem to fail miserably:

After-the-fact heuristic/goal test inadequate

Michael Kohlhase: Artificial Intelligence 1 566 2025-02-06

Search vs. planning (cont.)


 Planning systems do the following:
1. open up action and goal representation to allow selection
2. divide-and-conquer by subgoaling
 relax requirement for sequential construction of solutions

Search Planning
States Lisp data structures Logical sentences
Actions Lisp code Preconditions/outcomes
Goal Lisp code Logical sentence (conjunction)
Plan Sequence from S0 Constraints on actions

Michael Kohlhase: Artificial Intelligence 1 567 2025-02-06

Reminder: Greedy Best-First Search and A∗

 Recall: Our heuristic search algorithms (duplicate pruning omitted for simplicity)
function Greedy_Best−First_Search (problem)
returns a solution, or failure
17.2. PLANNING: INTRODUCTION 371

n := node with n.state=problem.InitialState


f rontier := priority queue ordered by ascending h, initially [n]
loop do
if Empty?(f rontier) then return failure
n := Pop(f rontier)
if problem.GoalTest(n.state) then return Solution(n)
for each action a in problem.Actions(n.state) do
n′ := ChildNode(problem,n,a)
Insert(n′ , h(n′ ), f rontier)

For A∗
 order f rontier by g + h instead of h (line 4)
′ ′ ′
 insert g(n ) + h(n ) instead of h(n ) to f rontier (last line)

 Is greedy best-first search optimal? No ; satisficing planning.


 Is A∗ optimal? Yes, but only if h is admissible ; optimal planning, with such h.

Michael Kohlhase: Artificial Intelligence 1 568 2025-02-06

ps. “Making Fully Automatic Algorithms Efficient”


 Example 17.2.3.
 n blocks, 1 hand.
 A single action either takes a block with the hand or puts a
block we’re holding onto some other block/the table.

blocks states blocks states


1 1 9 4596553
2 3 10 58941091
3 13 11 824073141
4 73 12 12470162233
5 501 13 202976401213
6 4051 14 3535017524403
7 37633 15 65573803186921
8 394353 16 1290434218669921

 Observation 17.2.4. State spaces typically are huge even for simple problems.

 In other words: Even solving “simple problems” automatically (without help from
a human) requires a form of intelligence.
 With blind search, even the largest super computer in the world won’t scale beyond
20 blocks!

Michael Kohlhase: Artificial Intelligence 1 569 2025-02-06

Algorithmic Problems in Planning


 Definition 17.2.5. We speak of satisficing planning if
372 CHAPTER 17. PLANNING I: FRAMEWORK

Input: A planning task Π.


Output: A plan for Π, or “unsolvable” if no plan for Π exists.
and of optimal planning if
Input: A planning task Π.
Output: An optimal plan for Π, or “unsolvable” if no plan for Π exists.
 The techniques successful for either one of these are almost disjoint. And satisficing
planning is much more efficient in practice.
 Definition 17.2.6. Programs solving these problems are called (optimal) planner,
planning system, or planning tool.

Michael Kohlhase: Artificial Intelligence 1 570 2025-02-06

Our Agenda for This Topic


 Now: Background, planning languages, complexity.
 Sets up the framework. Computational complexity is essential to distinguish
different algorithmic problems, and for the design of heuristic functions. (see
next)
 Next: How to automatically generate a heuristic function, given planning language
input?

 Focussing on heuristic search as the solution method, this is the main question
that needs to be answered.

Michael Kohlhase: Artificial Intelligence 1 571 2025-02-06

Our Agenda for This Chapter


1. The History of Planning: How did this come about?
 Gives you some background, and motivates our choice to focus on heuristic
search.
2. The STRIPS Planning Formalism: Which concrete planning formalism will we be
using?
 Lays the framework we’ll be looking at.

3. The PDDL Language: What do the input files for off-the-shelf planning software
look like?
 So you can actually play around with such software. (Exercises!)
4. Planning Complexity: How complex is planning?

 The price of generality is complexity, and here’s what that “price” is, exactly.

Michael Kohlhase: Artificial Intelligence 1 572 2025-02-06


17.3. PLANNING HISTORY 373

17.3 The History of Planning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26894.

Planning History: In the Beginning . . .


 In the beginning: Man invented Robots:
 “Planning” as in “the making of plans by an autonomous robot”.
 Shakey the Robot (Full video here)

 In a little more detail:


 [NS63] introduced general problem solving.
 . . . not much happened (well not much we still speak of today) . . .
 1966-72, Stanford Research Institute developed a robot named “Shakey”.
 They needed a “planning” component taking decisions.
 They took inspiration from general problem solving and theorem proving, and
called the resulting algorithm STRIPS.

Michael Kohlhase: Artificial Intelligence 1 573 2025-02-06

History of Planning Algorithms


 Compilation into Logics/Theorem Proving:
 e.g. ∃s0 , a, s1 .at(A, s0 ) ∧ execute(s0 , a, s1 ) ∧ at(B, s1 )
 Popular when: Stone Age – 1990.
 Approach: From planning task description, generate PL1 formula φ that is
satisfiable iff there exists a plan; use a theorem prover on φ.
 Keywords/cites: Situation calculus, frame problem, . . .
 Partial order planning

 e.g. open = {at(B)}; apply move(A, B); ; open = {at(A)} . . .


 Popular when: 1990 – 1995.
 Approach: Starting at goal, extend partially ordered set of actions by inserting
achievers for open sub-goals, or by adding ordering constraints to avoid conflicts.
 Keywords/cites: UCPOP [PW92], causal links, flaw selection strategies, . . .

Michael Kohlhase: Artificial Intelligence 1 574 2025-02-06

History of Planning Algorithms, ctd.


 GraphPlan
 e.g. F0 = at(A); A0 = {move(A, B)}; F1 = {at(B)};
mutex A0 = {move(A, B), move(A, C)}.
374 CHAPTER 17. PLANNING I: FRAMEWORK

 Popular when: 1995 – 2000.


 Approach: In a forward phase, build a layered “planning graph” whose “time
steps” capture which pairs of action can achieve which pairs of facts; in a back-
ward phase, search this graph starting at goals and excluding options proved to
not be feasible.
 Keywords/cites: [BF95; BF97; Koe+97], action/fact mutexes, step-optimal
plans, . . .
 Planning as SAT:
 SAT variables at(A)0 , at(B)0 , move(A, B)0 , move(A, C)0 , at(A)1 , at(B)1 ;
F T
clauses to encode transition behavior e.g. at(B)1 ∨move(A, B)0 ; unit clauses
T T T
to encode initial state at(A)0 , at(B)0 ; unit clauses to encode goal at(B)1 .
 Popular when: 1996 – today.
 Approach: From planning task description, generate propositional CNF formula
φk that is satisfiable iff there exists a plan with k steps; use a SAT solver on φk ,
for different values of k.
 Keywords/cites: [KS92; KS98; RHN06; Rin10], SAT encoding schemes, Black-
Box, . . .

Michael Kohlhase: Artificial Intelligence 1 575 2025-02-06

History of Planning Algorithms, ctd.


 Planning as Heuristic Search:

 init at(A); apply move(A, B); generates state at(B); . . .


 Popular when: 1999 – today.
 Approach: Devise a method R to simplify (“relax”) any planning task Π; given
Π, solve R(Π) to generate a heuristic function h for informed search.
 Keywords/cites: [BG99; HG00; BG01; HN01; Ede01; GSS03; Hel06; HHH07;
HG08; KD09; HD09; RW10; NHH11; KHH12a; KHH12b; KHD13; DHK15],
critical path heuristics, ignoring delete lists, relaxed plans, landmark heuristics,
abstractions, partial delete relaxation, . . .

Michael Kohlhase: Artificial Intelligence 1 576 2025-02-06

The International Planning Competition (IPC)


 Definition 17.3.1. The International Planning Competition (IPC) is an event for
benchmarking planners (http://ipc.icapsconference.org/)
 How: Run competing planners on a set of benchmarks.
 When: Runs every two years since 2000, annually since 2014.
 What: Optimal track vs. satisficing track; others: uncertainty, learning, . . .

 Prerequisite/Result:
17.4. STRIPS PLANNING 375

 Standard representation language: PDDL [McD+98; FL03; HE05; Ger+09]


 Problem Corpus: ≈ 50 domains, ≫ 1000 instances, 74 (!!) planners in 2011

Michael Kohlhase: Artificial Intelligence 1 577 2025-02-06

International Planning Competition


 Question: If planners x and y compete in IPC’YY, and x wins, is x “better than”
y?
 Answer: reserved for the plenary sessions ; be there!

 Generally: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 1 578 2025-02-06

Planning History, p.s.: Planning is Non-Trivial!


 Example 17.3.2. The Sussman anomaly is a simple blocksworld planning problem:

A
C B
A B C

Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:

 If we pursue on(A, B) by unstacking C, and


moving A onto B, we achieve the first subgoal,
but cannot achieve the second without undoing
the first.
 If we pursue on(B, C) by moving B onto C, we
achieve the second subgoal, but cannot achieve
the first without undoing the second.

Michael Kohlhase: Artificial Intelligence 1 579 2025-02-06

17.4 The STRIPS Planning Formalism


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26896.

STRIPS Planning
 Definition 17.4.1. STRIPS = Stanford Research Institute Problem Solver.
376 CHAPTER 17. PLANNING I: FRAMEWORK

STRIPS is the simplest possible (reasonably expressive) logics based planning


language.
 STRIPS has only propositional variables as atomic formulae.

 Its preconditions/effects/goals are as canonical as imaginable:


 Preconditions, goals: conjunctions of atoms.
 Effects: conjunctions of literals
 We use the common special-case notation for this simple formalism.

 I’ll outline some extensions beyond STRIPS later on, when we discuss PDDL.
 Historical note: STRIPS [FN71] was originally a planner (cf. Shakey), whose
language actually wasn’t quite that simple.

Michael Kohlhase: Artificial Intelligence 1 580 2025-02-06

STRIPS Planning: Syntax


 Definition 17.4.2. A STRIPS task is a quadruple ⟨P , A, I , G⟩ where:
 P is a finite set of facts: atomic proposition in PL0 or PLnq .
 A is a finite set of actions; each a ∈ A is a triple a = ⟨prea , adda , dela ⟩ of
subsets of P referred to as the action’s preconditions, add list, and delete list
respectively; we require that adda ∩ dela = ∅.
 I ⊆ P is the initial state.
 G ⊆ P is the goal state.

We will often give each action a ∈ A a name (a string), and identify a with that
name.
 Note: We assume, for simplicity, that every action has cost 1. (Unit costs, cf.
??)

Michael Kohlhase: Artificial Intelligence 1 581 2025-02-06

“TSP” in Australia
 Example 17.4.3 (Salesman Travelling in Australia).
17.4. STRIPS PLANNING 377

Strictly speaking, this is not actually a TSP problem instance; simplified/adapted


for illustration.

Michael Kohlhase: Artificial Intelligence 1 582 2025-02-06

STRIPS Encoding of “TSP”


 Example 17.4.4 (continuing).

 Facts P : {at(x), vis(x) | x ∈ {Sy, Ad, Br, Pe, Da}}.


 Initial state I: {at(Sy), vis(Sy)}.
 Goal state G:{at(Sy)} ∪ {vis(x) | x ∈ {Sy, Ad, Br, Pe, Da}}.
 Actions a ∈ A: drv(x, y) where x and y have a road.
Preconditions prea : {at(x)}.
Add list adda : {at(y), vis(y)}.
Delete list dela : {at(x)}.
 Plan: ⟨drv(Sy, Br), drv(Br, Sy), drv(Sy, Ad), drv(Ad, Pe), drv(Pe, Ad), . . .
. . . , drv(Ad, Da), drv(Da, Ad), drv(Ad, Sy)⟩

Michael Kohlhase: Artificial Intelligence 1 583 2025-02-06

STRIPS Planning: Semantics


378 CHAPTER 17. PLANNING I: FRAMEWORK

 Idea: We define a plan for a STRIPS task Π as a solution to an induced search


problem ΘΠ . (save work by reduction)
 Definition 17.4.5. Let Π := ⟨P , A, I , G⟩ be a STRIPS task. The search problem
induced by Π is ΘΠ = ⟨S P , A, T , I, S G ⟩ where:

 The states (also world state) S P := P(P ) are the subsets of P .


 A is just Π’s action. (so we can define plans easily)
a
 The transition model T A is {s − → apply(s, a) | prea ⊆ s}.
If prea ⊆ s, then a ∈ A is applicable in s and apply(s, a) := (s ∪ adda )\dela .
If prea ̸⊆s, then apply(s, a) is undefined.
 I is Π’s initial state.
 The goal states S G = {s ∈ S P | G ⊆ s} are those that satisfy Π’s goal state.
An (optimal) plan for Π is an (optimal) solution for ΘΠ , i.e., a path from s to some
s′ ∈ S G . Π is solvable if a plan for Π exists.
 Definition 17.4.6. For a plan a = ⟨a1 , . . ., an ⟩, we define

apply(s, a) := apply(. . . apply(apply(s, a1 ), a2 ) . . . , an )

if each ai is applicable in the respective state; else, apply(s, a) is undefined.

Michael Kohlhase: Artificial Intelligence 1 584 2025-02-06

STRIPS Encoding of Simplified TSP


 Example 17.4.7 (Simplified traveling salesman problem in Australia).

Let TSP− be the STRIPS task, ⟨P , A, I , G⟩, where

 Facts P : {at(x), vis(x) | x ∈ {Sy, Ad, Br}}.


 Initial state state I: {at(Sy), vis(Sy)}.
 Goal state G: {vis(x) | x ∈ {Sy, Ad, Br}} (note: noat(Sy))
 Actions A: a ∈ A: drv(x, y) where x y have a road.
 preconditions prea : {at(x)}.
 add list adda : {at(y), vis(y)}.
 delete list dela : {at(x)}.

Michael Kohlhase: Artificial Intelligence 1 585 2025-02-06


17.4. STRIPS PLANNING 379

Questionaire: State Space of TSP−


 The state space of the search problem ΘTSP− induced by TSP− from ?? is

at(Ad)
at(Br) drv(Br, Sy) at(Sy) drv(Sy, Ad) vis(Sy)
vis(Sy) vis(Sy)
vis(Br)
vis(Br) vis(Br)
vis(Ad)
drv(Ad, Sy)
drv(Sy, Br)
at(Sy)
at(Sy) vis(Sy)
vis(Sy) vis(Ad)
vis(Br)
drv(Sy, Ad)
drv(Br, Sy)
at(Br)
at(Ad) at(Sy)
vis(Sy)
vis(Sy) vis(Sy)
drv(Ad, Sy) drv(Sy, Br) vis(Ad)
vis(Ad) vis(Ad)
vis(Br)

 Question: Are there any plans for TSP− in this graph?

 Answer: Yes, two – plans for TSP− are solutions for ΘTSP− , dashed node =
b I,
thick nodes =
b G:
 drv(Sy, Br), drv(Br, Sy), drv(Sy, Ad) (upper path)
 drv(Sy, Ad), drv(Ad, Sy), drv(Sy, Br). (lower path)

 Question: Is the graph above actually the state space induced by ?


 Answer: No, only the part reachable from I. The state space of ΘTSP− also
includes e.g. the states {vis(Sy)} and {at(Sy), at(Br)}.

Michael Kohlhase: Artificial Intelligence 1 586 2025-02-06

The Blocksworld
 Definition 17.4.8. The blocks world is a simple planning domain: a set of wooden
blocks of various shapes and colors sit on a table. The goal is to build one or more
vertical stacks of blocks. Only one block may be moved at a time: it may either be
placed on the table or placed atop another block.

 Example 17.4.9.
E
D C B
E A B C A D

Initial State Goal State

 Facts: on(x, y), onTable(x), clear(x), holding(x), armEmpty.


 initial state: {onTable(E), clear(E), . . . , onTable(C), on(D, C), clear(D), armEmpty}.
 Goal state: {on(E, C), on(C, A), on(B, D)}.
 Actions: stack(x, y), unstack(x, y), putdown(x), pickup(x).
 stack(x, y)?
380 CHAPTER 17. PLANNING I: FRAMEWORK

pre : {holding(x), clear(y)}


add : {on(x, y), armEmpty, clearx}
del : {holding(x), clear(y)}.

Michael Kohlhase: Artificial Intelligence 1 587 2025-02-06

STRIPS for the Blocksworld


 Question: Which are correct encodings (ones that are part of some correct overall
model) of the STRIPS Blocksworld pickup(x) action schema?

{onTable(x), clear(x), armEmpty} {onTable(x), clear(x), armEmpty}


(A) {holding(x)} (B) {holding(x)}
{onTable(x)} {armEmpty}
{onTable(x), clear(x), armEmpty} {onTable(x), clear(x), armEmpty}
(C) {holding(x)} (D) {holding(x)}
{onTable(x), armEmpty, clear(x)} {onTable(x), armEmpty}

Recall: an actions a represented by a tuple ⟨prea , adda , dela ⟩ of lists of facts.


 Hint: The only differences between them are the delete lists
 Answer: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 1 588 2025-02-06

The next example for a planning task is not obvious at first sight, but has been quite influential,
showing that many industry problems can be specified declaratively by formalizing the domain
and the particular planning tasks in PDDL and then using off-the-shelf planners to solve them.
[KS00] reports that this has significantly reduced labor costs and increased maintainability of the
implementation.

Miconic-10: A Real-World Example


 Example 17.4.10. Elevator control as a planning problem; details at [KS00]
Specify mobility needs before boarding, let a planner schedule/otimize trips
17.5. PARTIAL ORDER PLANNING 381

VIP D

 VIP: Served first.


 D: Lift may only go down when inside; sim- ???
U
ilar for U.
B

 NA: Never-alone

AT: Attendant.
AT


 A, B: Never together in the same elevator NA

 P: Normal passenger

Michael Kohlhase: Artificial Intelligence 1 589 2025-02-06

17.5 Partial Order Planning


In this section we introduce a new and different planning algorithm: partial order planning that
works on several subgoals independently without having to specify in which order they will be
pursued and later combines them into a global plan. A Video Nugget covering this section can
be found at https://fau.tv/clip/id/28843.
To fortify our intuitions about partial order planning let us have another look at the Sussman
anomaly, where pursuing two subgoals independently and then reconciling them is a prerequi-
site.

Planning History, p.s.: Planning is Non-Trivial!


 Example 17.5.1. The Sussman anomaly is a simple blocksworld planning problem:

A
C B
A B C

Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
382 CHAPTER 17. PLANNING I: FRAMEWORK

 If we pursue on(A, B) by unstacking C, and


moving A onto B, we achieve the first subgoal,
but cannot achieve the second without undoing
the first.

 If we pursue on(B, C) by moving B onto C, we


achieve the second subgoal, but cannot achieve
the first without undoing the second.

Michael Kohlhase: Artificial Intelligence 1 590 2025-02-06

Before we go into the details, let us try to understand the main ideas of partial order planning.

Partial Order Planning


 Definition 17.5.2. Any algorithm that can place two actions into a plan without
specifying which comes first is called as partial order planning.

 Ideas for partial order planning:


 Organize the planning steps in a DAG that supports multiple paths from initial
to goal state
 nodes (steps) are labeled with actions (actions can occur multiply)
 edges with propositions added by source and presupposed by target
acyclicity of the graph induces a partial ordering on steps.
 additional temporal constraints resolve subgoal interactions and induce a linear
order.
 Advantages of partial order planning:
 problems can be decomposed ; can work well with non-cooperative environ-
ments.
 efficient by least-commitment strategy
 causal links (edges) pinpoint unworkable subplans early.

Michael Kohlhase: Artificial Intelligence 1 591 2025-02-06

We now make the ideas discussed above concrete by giving a mathematical formulation. It is
advantageous to cast a partially ordered plan as a labeled DAG rather than a partial ordering
since it draws the attention to the difference between actions and steps.

Partially Ordered Plans


 Definition 17.5.3. Let ⟨P , A, I , G⟩ be a STRIPS task, then a partially ordered
plan P = ⟨V , E ⟩ is a labeled DAG, where the nodes in V (called steps) are labeled
with actions from A, or are a

 start step, which has label “effect” I, or a


 finish step, which has label “precondition” G.
17.5. PARTIAL ORDER PLANNING 383

Every edge (S,T ) ∈ E is either labeled by:


 A non-empty set p ⊆ P of facts that are effects of the action of S and the
preconditions of that of T . We call such a labeled edge a causal link and write
p
it S −→ T.
 ≺, then call it a temporal constraint and write it as S ≺ T .

An open condition is a precondition of a step not yet causally linked.


 Definition 17.5.4. Let Π be a partially ordered plan, then we call a step U possibly
p
intervening in a causal link S −→ T , iff Π ∪ {S ≺ U , U ≺ T } is acyclic.

 Definition 17.5.5. A precondition is achieved iff it is the effect of an earlier step


and no possibly intervening step undoes it.
 Definition 17.5.6. A partially ordered plan Π is called complete iff every precon-
dition is achieved.

 Definition 17.5.7. Partial order planning is the process of computing complete


and acyclic partially ordered plans for a given planning task.

Michael Kohlhase: Artificial Intelligence 1 592 2025-02-06

A Notation for STRIPS Actions


 Definition 17.5.8 (Notation). In diagrams, we often write STRIPS actions into
boxes with preconditions above and effects below.
 Example 17.5.9.
 Actions: Buy(x)
At(p) Sells(p, x)
 Preconditions: At(p), Sells(p, x)
Buy(x)
 Effects: Have(x) Have(x)

p
 Notation: A causal link S −→ T can also be denoted by a direct arrow between the
effects p of S and the preconditions p of T in the STRIPS action notation above.
Show temporal constraints as dashed arrows.

Michael Kohlhase: Artificial Intelligence 1 593 2025-02-06

Planning Process
 Definition 17.5.10. Partial order planning is search in the space of partial plans
via the following operations:
 add link from an existing action to an open precondition,
 add step (an action with links to other steps) to fulfil an open precondition,
 order one step wrt. another (by adding temporal constraints) to remove possible
conflicts.
384 CHAPTER 17. PLANNING I: FRAMEWORK

 Idea: Gradually move from incomplete/vague plans to complete, correct plans.


backtrack if an open condition is unachievable or if a conflict is unresolvable.

Michael Kohlhase: Artificial Intelligence 1 594 2025-02-06

Example: Shopping for Bananas, Milk, and a Cordless Drill


 Example 17.5.11.
17.5. PARTIAL ORDER PLANNING 385

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
386 CHAPTER 17. PLANNING I: FRAMEWORK

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

At(HW S) Sell(HW S, Drill)


Buy(Drill)

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
17.5. PARTIAL ORDER PLANNING 387

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

At(Home)
Go(HW S)

At(HW S) Sell(HW S, Drill)


Buy(Drill)

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
388 CHAPTER 17. PLANNING I: FRAMEWORK

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

At(Home)
Go(HW S)

At(HW S) Sell(HW S, Drill)


Buy(Drill)

At(X)
Go(SM )

At(SM ) Sell(SM, M ilk)


Buy(M ilk)

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
17.5. PARTIAL ORDER PLANNING 389

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

At(Home)
Go(HW S)

At(HW S) Sell(HW S, Drill)


Buy(Drill)

At(X)
Go(SM )

At(SM ) Sell(SM, M ilk) At(SM )Sell(SM, Ban


Buy(M ilk) Buy(Ban)

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
390 CHAPTER 17. PLANNING I: FRAMEWORK

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

At(Home)
Go(HW S)

At(HW S) Sell(HW S, Drill)


Buy(Drill)

At(X)
Go(SM )

At(SM ) Sell(SM, M ilk) At(SM )Sell(SM, Ban


Buy(M ilk) Buy(Ban)

At(SM )
Go(Home)

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
17.5. PARTIAL ORDER PLANNING 391

Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM

At(Home)
Go(HW S)

At(HW S) Sell(HW S, Drill)


Buy(Drill)

At(HW S)
Go(SM )

At(SM ) Sell(SM, M ilk) At(SM )Sell(SM, Ban


Buy(M ilk) Buy(Ban)

At(SM )
Go(Home)

Have(M ilk)At(Home) Have(Ban)H ave(Drill)


Finish
Michael Kohlhase: Artificial Intelligence 1 595 2025-02-06

Here we show a successful search for a partially ordered plan. We start out by initializing the plan
by with the respective start and finish steps. Then we consecutively add steps to fulfill the open
preconditions – marked in red – starting with those of the finish step.
In the end we add three temporal constraints that complete the partially ordered plan.
The search process for the links and steps is relatively plausible and standard in this example, but
we do not have any idea where the temporal constraints should systematically come from. We
look at this next.
392 CHAPTER 17. PLANNING I: FRAMEWORK

Clobbering and Promotion/Demotion


 Definition 17.5.12. In a partially ordered plan, a step C clobbers a causal link
p
L := S −→ T , iff it destroys the condition p achieved by L.
p
 Definition 17.5.13. If C clobbers S −→ T in a partially ordered plan Π, then we
can solve the induced conflict by

 demotion: add a temporal constraint C ≺ S to Π, or


 promotion: add T ≺ C to Π.
 Example 17.5.14. Go(Home) clobbers At(Supermarket):

Go(SM )
At(SM )
demotion =
b put before

Go(Home)
At(Home)
At(SM ) promotion =
b put after
Buy(M ilk)

Michael Kohlhase: Artificial Intelligence 1 596 2025-02-06

POP algorithm sketch


 Definition 17.5.15. The POP algorithm for constructing complete partially or-
dered plans:
function POP (initial, goal, operators) : plan
plan:= Make−Minimal−Plan(initial, goal)
loop do
if Solution?(goal,plan) then return plan
Sneed , c := Select−Subgoal(plan)
Choose−Operator(plan, operators, Sneed ,c)
Resolve−Threats(plan)
end
function Select−Subgoal (plan, Sneed , c)
pick a plan step Sneed from Steps(plan)
with a precondition c that has not been achieved
return Sneed , c

Michael Kohlhase: Artificial Intelligence 1 597 2025-02-06

POP algorithm contd.


17.5. PARTIAL ORDER PLANNING 393

 Definition 17.5.16. The missing parts for the POP algorithm.


function Choose−Operator (plan, operators, Sneed , c)
choose a step Sadd from operators or Steps(plan) that has c as an effect
if there is no such step then fail
add the causal−link Sadd −→ c
Sneed to Links(plan)
add the temporal−constraint Sadd ≺ Sneed to Orderings(plan)
if Sadd is a newly added \step from operators then
add Sadd to Steps(plan)
add Start ≺ Sadd ≺ F inish to Orderings(plan)
function Resolve−Threats (plan)
for each Sthreat that threatens a causal−link Si −→
c
Sj in Links(plan) do
choose either
demotion: Add Sthreat ≺ Si to Orderings(plan)
promotion: Add Sj ≺ Sthreat to Orderings(plan)
if not Consistent(plan) then fail

Michael Kohlhase: Artificial Intelligence 1 598 2025-02-06

Properties of POP
 Nondeterministic algorithm: backtracks at choice points on failure:

 choice of Sadd to achieve Sneed ,


 choice of demotion or promotion for clobberer,
 selection of Sneed is irrevocable.
 Observation 17.5.17. POP is sound, complete, and systematic i.e. no repetition

 There are extensions for disjunction, universals, negation, conditionals.


 It can be made efficient with good heuristics derived from problem description.
 Particularly good for problems with many loosely related subgoals.

Michael Kohlhase: Artificial Intelligence 1 599 2025-02-06

Example: Solving the Sussman Anomaly


394 CHAPTER 17. PLANNING I: FRAMEWORK

Michael Kohlhase: Artificial Intelligence 1 600 2025-02-06

Example: Solving the Sussman Anomaly (contd.)


 Example 17.5.18. Solving the Sussman anomaly

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Initializing the partial order plan with with Start and Finish.

On(A, B)On(A, B) On(B, C)On(B, C)


F inish
17.5. PARTIAL ORDER PLANNING 395

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Refining for the subgoal On(B, C).

Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)

On(A, B)On(A, B) On(B, C)On(B, C)


F inish

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Refining for the subgoal ON (A, C).

Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)

On(A, B)On(A, B) On(B, C)On(B, C)


F inish
396 CHAPTER 17. PLANNING I: FRAMEWORK

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Cl(C) Refining for the subgoal Cl(A).


M ove(C, T )
Cl(A) On(C, T )
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)

On(A, B)On(A, B) On(B, C)On(B, C)


F inish

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Cl(C) M ove(A, B) clobbers Cl(B) ; demote.


M ove(C, T )
Cl(A) On(C, T )
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)

On(A, B)On(A, B) On(B, C)On(B, C)


F inish
17.6. PDDL LANGUAGE 397

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Cl(C) M ove(B, C) clobbers Cl(C) ; demote.


M ove(C, T )
Cl(A) On(C, T )
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)

On(A, B)On(A, B) On(B, C)On(B, C)


F inish

Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)

Cl(C) A totally ordered plan.


M ove(C, T )
Cl(A) On(C, T )
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)

On(A, B)On(A, B) On(B, C)On(B, C)


F inish

Michael Kohlhase: Artificial Intelligence 1 601 2025-02-06

17.6 The PDDL Language


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26897.

PDDL: Planning Domain Description Language


 Definition 17.6.1. The Planning Domain Description Language (PDDL) is a stan-
dardized representation language for planning benchmarks in various extensions of
the STRIPS formalism.

 Definition 17.6.2. PDDL is not a propositional language


398 CHAPTER 17. PLANNING I: FRAMEWORK

 Representation is lifted, using object variables to be instantiated from a finite


set of objects. (Similar to predicate logic)
 Action schemas parameterized by objects.
 Predicates to be instantiated with objects.

 Definition 17.6.3. A PDDL planning task comes in two pieces


 The problem file gives the objects, the initial state, and the goal state.
 The domain file gives the predicates and the actions.

Michael Kohlhase: Artificial Intelligence 1 602 2025-02-06

History and Versions:


• Used in the International Planning Competition (IPC).

• 1998: PDDL [McD+98].


• 2000: “PDDL subset for the 2000 competition” [Bac00].
• 2002: PDDL2.1, Levels 1-3 [FL03].
• 2004: PDDL2.2 [HE05].

• 2006: PDDL3 [Ger+09].

The Blocksworld in PDDL: Domain File

E
D C B
E A B C A D

Initial State Goal State

(define (domain blocksworld)


(:predicates (clear ?x) (holding ?x) (on ?x ?y)
(on−table ?x) (arm−empty))
(:action stack
:parameters (?x ?y)
:precondition (and (clear ?y) (holding ?x))
:effect (and (arm−empty) (on ?x ?y)
(not (clear ?y)) (not (holding ?x))))
. . .)

Michael Kohlhase: Artificial Intelligence 1 603 2025-02-06

The Blocksworld in PDDL: Problem File


17.6. PDDL LANGUAGE 399

E
D C B
E A B C A D

Initial State Goal State

(define (problem bw−abcde)


(:domain blocksworld)
(:objects a b c d e)
(:init (on−table a) (clear a)
(on−table b) (clear b)
(on−table e) (clear e)
(on−table c) (on d c) (clear d)
(arm−empty))
(:goal (and (on e c) (on c a) (on b d))))

Michael Kohlhase: Artificial Intelligence 1 604 2025-02-06

Miconic-ADL “Stop” Action Schema in PDDL


(:action stop (imply
:parameters (?f − floor) (exists
:precondition (and (lift−at ?f) (?p − never−alone)
(imply (or (and (origin ?p ?f)
(exists (not (served ?p)))
(?p − conflict−A) (and (boarded ?p)
(or (and (not (served ?p)) (not (destin ?p ?f)))))
(origin ?p ?f)) (exists
(and (boarded ?p) (?q − attendant)
(not (destin ?p ?f))))) (or (and (boarded ?q)
(forall (not (destin ?q ?f)))
(?q − conflict−B) (and (not (served ?q))
(and (or (destin ?q ?f) (origin ?q ?f)))))
(not (boarded ?q))) (forall
(or (served ?q) (?p − going−nonstop)
(not (origin ?q ?f)))))) (imply (boarded ?p) (destin ?p ?f)))
(imply (exists (or (forall
(?p − conflict−B) (?p − vip) (served ?p))
(or (and (not (served ?p)) (exists
(origin ?p ?f)) (?p − vip)
(and (boarded ?p) (or (origin ?p ?f) (destin ?p ?f))))
(not (destin ?p ?f))))) (forall
(forall (?p − passenger)
(?q − conflict−A) (imply
(and (or (destin ?q ?f) (no−access ?p ?f) (not (boarded ?p)))))
(not (boarded ?q))) )
(or (served ?q)
(not (origin ?q ?f))))))

Michael Kohlhase: Artificial Intelligence 1 605 2025-02-06

Planning Domain Description Language


 Question: What is PDDL good for?
(A) Nothing.
(B) Free beer.
(C) Those AI planning guys.
(D) Being lazy at work.
400 CHAPTER 17. PLANNING I: FRAMEWORK

 Answer: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 1 606 2025-02-06

17.7 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26900.

Summary
 General problem solving attempts to develop solvers that perform well across a large
class of problems.
 Planning, as considered here, is a form of general problem solving dedicated to the
class of classical search problems. (Actually, we also address inaccessible, stochastic,
dynamic, continuous, and multi-agent settings.)

 Heuristic search planning has dominated the International Planning Competition


(IPC). We focus on it here.
 STRIPS is the simplest possible, while reasonably expressive, language for our pur-
poses. It uses Boolean variables (facts), and defines actions in terms of precondition,
add list, and delete list.

 PDDL is the de-facto standard language for describing planning problems.


 Plan existence (bounded or not) is PSPACE-complete to decide for STRIPS. If
we bound plans polynomially, we get down to NP-completeness.

Michael Kohlhase: Artificial Intelligence 1 607 2025-02-06

Suggested Reading:
• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].

– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
• A good source for modern information (some of which we covered in the course) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 18

Planning II: Algorithms

18.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26901.

Reminder: Our Agenda for This Topic


 ??: Background, planning languages, complexity.
 Sets up the framework. computational complexity is essential to distinguish
different algorithmic problems, and for the design of heuristic functions.

 This Chapter: How to automatically generate a heuristic function, given planning


language input?
 Focussing on heuristic search as the solution method, this is the main question
that needs to be answered.

Michael Kohlhase: Artificial Intelligence 1 608 2025-02-06

Reminder: Search Search


General
Reminder: Search
From the initial state, produce all successive states step
 Starting
I Starting by at initial
step  state,
searchproduce
tree.all successor states step by step:
at initial state, produce all successor states step by step:
(a) initial state (3,3,1)

(b) after expansion (3,3,1)


of (3,3,1)

(2,3,0) (3,2,0) (2,2,0) (1,3,0) (3,1,0)

(c) after expansion (3,3,1)


of (3,2,0)

(2,3,0) (3,2,0) (2,2,0) (1,3,0) (3,1,0)

(3,3,1)
03/23
In planning, this is referred to as forward search, or forward state-space search.

401
Kohlhase: Künstliche Intelligenz 1 532 July 5, 2018
402 CHAPTER 18. PLANNING II: ALGORITHMS

In planning, this is referred to as forward search, or forward state-space search.

Michael Kohlhase: Artificial Intelligence 1 609 2025-02-06

Search in the State Space?


Search in the State Space?

Use heuristic function to guide the search towards the goal!


I Use heuristic function to guide the search towards the goal!

Michael Kohlhase: Artificial Intelligence 1 610 2025-02-06


Kohlhase: Künstliche Intelligenz 1 533 July 5, 2018

Reminder: Informed Search

cos
t est
ateim
h
cost esti
mate h goal
init
mate h
cost esti
h
ate
e s tim
t
cos

 Heuristic function h estimates the cost of an optimal path from a state s to the
goal state; search prefers to expand states s with small h(s).
 Live Demo vs. Breadth-First Search:

http://qiao.github.io/PathFinding.js/visual/

Michael Kohlhase: Artificial Intelligence 1 611 2025-02-06


18.2. HOW TO RELAX 403

Reminder: Heuristic Functions


 Definition 18.1.1. Let Π be a STRIPS task with states S. A heuristic function,
short heuristic, for Π is a function h : S → N ∪ {∞} so that h(s) = 0 whenever s is
a goal state.

 Exactly like our definition from ??. Except, because we assume unit costs here, we
use N instead of R+ .
 Definition 18.1.2. Let Π be a STRIPS task with states S. The perfect heuristic
h∗ assigns every s ∈ S the length of a shortest path from s to a goal state, or ∞
if no such path exists. A heuristic h for Π is admissible if, for all s ∈ S, we have
h(s) ≤ h∗ (s).
 Exactly like our definition from ??, except for path length instead of path cost (cf.
above).
 In all cases, we attempt to approximate h∗ (s), the length of an optimal plan for s.
Some algorithms guarantee to lower bound h∗ (s).

Michael Kohlhase: Artificial Intelligence 1 612 2025-02-06

Our (Refined) Agenda for This Chapter


 How to Relax: How to relax a problem?
 Basic principle for generating heuristic functions.
 The Delete Relaxation: How to relax a planning problem?

 The delete relaxation is the most successful method for the automatic generation
of heuristic functions. It is a key ingredient to almost all IPC winners of the last
decade. It relaxes STRIPS tasks by ignoring the delete lists.
 The h+ Heuristic: What is the resulting heuristic function?

 h+ is the “ideal” delete relaxation heuristic.


 Approximating h+ : How to actually compute a heuristic?
 Turns out that, in practice, we must approximate h+ .

Michael Kohlhase: Artificial Intelligence 1 613 2025-02-06

18.2 How to Relax in Planning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26902.
We will now instantiate our general knowledge about heuristic search to the planning domain. As
always, the main problem is to find good heuristics. We will follow the intuitions of our discussion
in ?? and consider full solutions to relaxed problems as a source for heuristics.

How to Relax
404 CHAPTER 18. PLANNING II: ALGORITHMS

 Recall: We introduced the concept of a relaxed search problem (allow cheating)


to derive heuristics from them.
 Observation: This can be generalized to arbitrary problem solving.

 Definition 18.2.1 (The General Case).

P N ∪ {∞}
h∗P

P′ h∗P ′
R

1. You have a class P of problems, whose perfect heuristic h∗P you wish to estimate.
2. You define a class P ′ of simpler problems, whose perfect heuristic h∗P ′ can be
used to estimate h∗P .
3. You define a transformation – the relaxation mapping R – that maps instances
Π ∈ P into instances Π′ ∈ P ′ .
4. Given Π ∈ P, you let Π′ := R(Π), and estimate h∗ P (Π) by h∗ P ′ (Π′ ).

 Definition 18.2.2. For planning tasks, we speak of relaxed planning.

Michael Kohlhase: Artificial Intelligence 1 614 2025-02-06

Reminder: Heuristic Functions from Relaxed Problems

 Problem Π: Find a route from Saarbrücken to Edinburgh.

Michael Kohlhase: Artificial Intelligence 1 615 2025-02-06

Reminder: Heuristic Functions from Relaxed Problems


18.2. HOW TO RELAX 405

 Relaxed Problem Π′ : Throw away the map.

Michael Kohlhase: Artificial Intelligence 1 616 2025-02-06

Reminder: Heuristic Functions from Relaxed Problems

 Heuristic function h: Straight line distance.

Michael Kohlhase: Artificial Intelligence 1 617 2025-02-06

Relaxation in Route-Finding
406 CHAPTER 18. PLANNING II: ALGORITHMS

 Problem class P: Route finding.


 Perfect heuristic h∗P for P: Length of a shortest route.

 Simpler problem class P ′ : Route finding on an empty map.


 Perfect heuristic h∗P ′ for P ′ : Straight-line distance.
 Transformation R: Throw away the map.

Michael Kohlhase: Artificial Intelligence 1 618 2025-02-06

How to Relax in Planning? (A Reminder!)


 Example 18.2.3 (Logistics).

 facts P : {truck(x) | x ∈ {A, B, C, D}} ∪ {pack(x) | x ∈ {A, B, C, D, T }}.


 initial state I: {truck(A), pack(C)}.
 goal state G: {truck(A), pack(D)}.
 actions A: (Notated as “precondition ⇒ adds, ¬ deletes”)
 drive(x, y), where x and y have a road: “truck(x) ⇒ truck(y), ¬truck(x)”.
 load(x): “truck(x), pack(x) ⇒ pack(T ), ¬pack(x)”.
 unload(x): “truck(x), pack(T ) ⇒ pack(x), ¬pack(T )”.

 Example 18.2.4 (“Only-Adds” Relaxation). Drop the preconditions and deletes.


 “drive(x, y): ⇒ truck(y)”;
 “load(x): ⇒ pack(T )”;
 “unload(x): ⇒ pack(x)”.
 Heuristics value for I is?
 hR (I) = 1: A plan for the relaxed task is ⟨unload(D)⟩.

Michael Kohlhase: Artificial Intelligence 1 619 2025-02-06

We will start with a very simple relaxation, which could be termed “positive thinking”: we do not
18.2. HOW TO RELAX 407

consider preconditions of actions and leave out the delete lists as well.

How to Relax During Search: Overview


 Attention: Search uses the real (un-relaxed) Π. The relaxation is applied (e.g.,
in Only-Adds, the simplified actions are used) only within the call to h(s)!!!

Problem Π Heuristic search on Π Solution to Π

state s h(s) = h∗ P ′ (R(Πs ))

R h∗P ′
R(Πs )

 Here, Πs is Π with initial state replaced by s, i.e., Π := ⟨P , A, I , G⟩ changed


to Πs := ⟨P , A, {s}, G⟩: The task of finding a plan for search state s.
 A common student error is to instead apply the relaxation once to the whole
problem, then doing the whole search “within the relaxation”.
 The next slide illustrates the correct search process in detail.

Michael Kohlhase: Artificial Intelligence 1 620 2025-02-06

How to Relax During Search: Only-Adds


Real problem:
 Initial state I: AC; goal G: AD.
 Actions A: pre, add, del.
 drXY, loX, ulX.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

AC
Relaxed problem:
 State s: AC; goal G: AD.
 Actions A: add.
 hR (s) =1: ⟨ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1
AC
408 CHAPTER 18. PLANNING II: ALGORITHMS

Relaxed problem:
 State s: AC; goal G: AD.
 Actions A: add.
 hR (s) =1: ⟨ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1
AC
Real problem:
 State s: BC; goal G: AD.
 Actions A: pre, add, del.
drAB
 AC −−−−→ BC.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB
AC BC
Relaxed problem:
 State s: BC; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB
AC BC
Relaxed problem:
 State s: BC; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB 2
AC BC
18.2. HOW TO RELAX 409

Real problem:
 State s: CC; goal G: AD.
 Actions A: pre, add, del.
drBC
 BC −−−−→ CC.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB 2 drBC
AC BC CC
Relaxed problem:
 State s: CC; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB 2 drBC
AC BC CC
Relaxed problem:
 State s: CC; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB 2 drBC 2
AC BC CC
Real problem:
 State s: AC; goal G: AD.
 Actions A: pre, add, del.
drBA
 BC −−−−→ AC.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB 2 drBC 2
dr
AC BC B ACC

AC
410 CHAPTER 18. PLANNING II: ALGORITHMS

Real problem:
 State s: AC; goal G: AD.
 Actions A: pre, add, del.
 Duplicate state, prune.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

1 drAB 2 drBC 2
dr
AC BC B ACC
D
AC
Real problem:
 State s: DC; goal G: AD.
 Actions A: pre, add, del.
drCD
 CC −−−−→ DC.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

D DC
r C
1 drAB 2 drBC 2 d
dr
AC BC B ACC
D
AC
Relaxed problem:
 State s: DC; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

D DC
rC
1 drAB 2 drBC 2 d
dr
AC BC B ACC
D
AC
Relaxed problem:
 State s: DC; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
18.2. HOW TO RELAX 411

We are here
2
D DC
rC
1 drAB 2 drBC 2 d
dr
AC BC B ACC
D
AC
Real problem:
 State s: CT ; goal G: AD.
 Actions A: pre, add, del.
loC
 CC −−→ CT .
Greedy best-first search: (tie-breaking: alphabetic)

We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC
dr
AC BC B ACC CT
D
AC
Relaxed problem:
 State s: CT ; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC
dr
AC BC B ACC CT
D
AC
Relaxed problem:
 State s: CT ; goal G: AD.
 Actions A: add.
 hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
412 CHAPTER 18. PLANNING II: ALGORITHMS

We are here
2
D DC
rC
1 drAB 2 drBC 2 dloC 2
dr
AC BC B ACC CT
D
AC
Real problem:
 State s: BC; goal G: AD.
 Actions A: pre, add, del.
drCB
 CC −−−−→ BC.
Greedy best-first search: (tie-breaking: alphabetic)

We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC 2
dr dr
AC BC B ACC C BCT
D
AC BC
Real problem:
 State s: BC; goal G: AD.
 Actions A: pre, add, del.
 Duplicate state, prune.
Greedy best-first search: (tie-breaking: alphabetic)

We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC 2
dr dr
AC BC B ACC C BCT
D D
AC BC
Real problem:
 State s: CT ; goal G: AD.
 Actions A: pre, add, del.
 Successors: BT , DT , CC.
Greedy best-first search: (tie-breaking: alphabetic)
18.2. HOW TO RELAX 413

We are here
2 2
D DC B BT
rC C
dr
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
 State s: BT ; goal G: AD.
 Actions A: pre, add, del.
 Successors: AT , BB, CT .
Greedy best-first search: (tie-breaking: alphabetic)

1
AT
2

A
drB
We are here B BB
2 ul
2 drBC D
D B
rC DCdrC BT CT
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
 State s: AT ; goal G: AD.
 Actions A: pre, add, del.
 Successors: AA, BT .
Greedy best-first search: (tie-breaking: alphabetic)

1 ulA 1
dr
AT ABAA
2 D
A
drB

We are here B BB BT
2 ul
2 drBC D
D B
r C DC rC BT CT
d
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
 State s: AA; goal G: AD.
 Actions A: pre, add, del.
 Successors: BA, AT .
Greedy best-first search: (tie-breaking: alphabetic)
414 CHAPTER 18. PLANNING II: ALGORITHMS

1 ulA 1 drAB 2
dr l
AT ABAA oA BA
2 D D

A
dr B
We are here B BB BT AT
2 ul
2 drBC D
D B
d rC DCdrC BT CT
1 drAB 2 drBC 2 loC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
 State s: BA; goal G: AD.
 Actions A: pre, add, del.
 Successors: CA, AA.
Greedy best-first search: (tie-breaking: alphabetic)

1 ulA 1 drAB 2 drBC 2


dr l dr
AT ABAA oA BA B ACA
2 D D D
A
drB

We are here B BB BT AT AA
2 ul
2 drBC D
D B
rC DCdrC BT CT
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
 State s: BA; goal G: AD.
 Actions A: pre, add, del.
 Successors: CA, AA.
Greedy best-first search: (tie-breaking: alphabetic)

1 ulA 1 drAB 2 drBC 2


dr l dr
AT ABAA oA BA B ACA
2 D D D
A
drB

We are here B BB BT AT AA
2 ul
2 drBC D
D B
d rC DCdrC BT CT
1 drAB 2 drBC 2 loC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC

Michael Kohlhase: Artificial Intelligence 1 621 2025-02-06


18.3. DELETE RELAXATION 415

Only-Adds is a “Native” Relaxation


 Definition 18.2.5 (Native Relaxations). Confusing special case where P ′ ⊆ P.

P N ∪ {∞}
h∗P

P′ ⊆ P h∗P ′
R

 Problem class P: STRIPS tasks.


 Perfect heuristic h∗P for P: Length h∗ of a shortest plan.
 Transformation R: Drop the preconditions and delete lists.
 Simpler problem class P ′ is a special case of P, P ′ ⊆ P: STRIPS tasks with
empty preconditions and delete lists.
 Perfect heuristic for P ′ : Shortest plan for only-adds STRIPS task.

Michael Kohlhase: Artificial Intelligence 1 622 2025-02-06

18.3 The Delete Relaxation


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26903.
We turn to a more realistic relaxation, where we only disregard the delete list.

How the Delete Relaxation Changes the World (I)


 Relaxation mapping R saying that:

“When the world changes, its previous state remains true as well.”
Real world: (before)

Real world:
(after)

Relaxed
world: (before)
416 CHAPTER 18. PLANNING II: ALGORITHMS

Relaxed
world: (after)

Michael Kohlhase: Artificial Intelligence 1 623 2025-02-06

How the Delete Relaxation Changes the World (II)


 Relaxation mapping R saying that:

Real world: (before)

Real world: (after)

Relaxed world: (before)

Relaxed world: (after)


18.3. DELETE RELAXATION 417

Michael Kohlhase: Artificial Intelligence 1 624 2025-02-06

How the Delete Relaxation Changes the World (III)


 Relaxation mapping R saying that:

Real world:

Relaxed world:

Michael Kohlhase: Artificial Intelligence 1 625 2025-02-06

The Delete Relaxation


 Definition 18.3.1 (Delete Relaxation). Let Π := ⟨P , A, I , G⟩ be a STRIPS task.
The delete relaxation of Π is the task Π+ = ⟨P , A+ , I, G⟩ where A+ :={a+ | a ∈ A}
with prea+ :=prea , adda+ :=adda , and dela+ :=∅.
418 CHAPTER 18. PLANNING II: ALGORITHMS

 In other words, the class of simpler problems P ′ is the set of all STRIPS tasks with
empty delete lists, and the relaxation mapping R drops the delete lists.
 Definition 18.3.2 (Relaxed Plan). Let Π := ⟨P , A, I , G⟩ be a STRIPS task, and
let s be a state. A relaxed plan for s is a plan for ⟨P , A, s, G⟩+ . A relaxed plan for
I is called a relaxed plan for Π.
 A relaxed plan for s is an action sequence that solves s when pretending that all
delete lists are empty.
 Also called delete-relaxed plan: “relaxation” is often used to mean delete relaxation
by default.

Michael Kohlhase: Artificial Intelligence 1 626 2025-02-06

A Relaxed Plan for “TSP” in Australia

1. Initial state: {at(Sy), vis(Sy)}.


+
2. drv(Sy, Br) : {at(Br), vis(Br), at(Sy), vis(Sy)}.
+
3. drv(Sy, Ad) : {at(Ad), vis(Ad), at(Br), vis(Br), at(Sy), vis(Sy)}.
+
4. drv(Ad, Pe) : {at(Pe), vis(Pe), at(Ad), vis(Ad), at(Br), vis(Br), at(Sy), vis(Sy)}.
+
5. drv(Ad, Da) : {at(Da), vis(Da), at(Pe), vis(Pe), at(Ad), vis(Ad), at(Br), vis(Br), at(Sy), vis(Sy)}.

Michael Kohlhase: Artificial Intelligence 1 627 2025-02-06

A Relaxed Plan for “Logistics”

 Facts P : {truck(x) | x ∈ {A, B, C, D}} ∪ {pack(x) | x ∈ {A, B, C, D, T }}.


 Initial state I: {truck(A), pack(C)}.
 Goal G: {truck(A), pack(D)}.

 Relaxed actions A+ : (Notated as “precondition ⇒ adds”)


+
 drive(x, y) : “truck(x) ⇒ truck(y)”.
18.3. DELETE RELAXATION 419

+
 load(x) : “truck(x), pack(x) ⇒ pack(T )”.
+
 unload(x) : “truck(x), pack(T ) ⇒ pack(x)”.
Relaxed plan:
+ + + + +
⟨drive(A, B) , drive(B, C) , load(C) , drive(C, D) , unload(D) ⟩

 We don’t need to drive the truck back, because “it is still at A”.

Michael Kohlhase: Artificial Intelligence 1 628 2025-02-06

PlanEx+
 Definition 18.3.3 (Relaxed Plan Existence Problem). By PlanEx+ , we denote
the problem of deciding, given a STRIPS task Π := ⟨P , A, I , G⟩, whether or not
there exists a relaxed plan for Π.

 This is easier than PlanEx for general STRIPS!


 PlanEx+ is in P.
 Proof: The following algorithm decides PlanEx+
1.
var F := I
while G ̸⊆ F doS
F ′ := F ∪ a∈A:prea ⊆F adda
if F ′ = F then return ‘‘unsolvable’’ endif (∗)
F := F ′
endwhile
return ‘‘solvable’’
2. The algorithm terminates after at most |P | iterations, and thus runs in poly-
nomial time.
3. Correctness: See slide 632

Michael Kohlhase: Artificial Intelligence 1 629 2025-02-06

Deciding PlanEx+ in “TSP” in Australia

Iterations on F :
420 CHAPTER 18. PLANNING II: ALGORITHMS

1. {at(Sy), vis(Sy)}
2. ∪ {at(Ad), vis(Ad), at(Br), vis(Br)}
3. ∪ {at(Da), vis(Da), at(Pe), vis(Pe)}

Michael Kohlhase: Artificial Intelligence 1 630 2025-02-06

Deciding PlanEx+ in “Logistics”


 Example 18.3.4 (The solvable Case).
Iterations on F :
1. {truck(A), pack(C)}

2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{truck(D), pack(T )}
5. ∪{pack(A), pack(B), pack(D)}

 Example 18.3.5 (The unsolvable Case).


Iterations on F :
1. {truck(A), pack(C)}

2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{pack(T )}
5. ∪{pack(A), pack(B)}

6. ∪∅

Michael Kohlhase: Artificial Intelligence 1 631 2025-02-06

PlanEx+ Algorithm: Proof


Proof: To show: The algorithm returns “solvable” iff there is a relaxed plan for Π.
1. Denote by Fi the content of F after the ith iteration of the while-loop,
2. All a ∈ A0 are applicable in I, all a ∈ A1 are applicable in apply(I, A+ 0 ), and so
forth.
3. Thus Fi = apply(I, ⟨A+ + +
0 , . . . , Ai−1 ⟩). (Within each Aj , we can sequence the
actions in any order.)
4. Direction “⇒” If “solvable” is returned after iteration n then G ⊆ Fn = apply(I, ⟨A+ +
0 , . . . , An−1 ⟩)
so ⟨A+ +
0 , . . . , An−1 ⟩ can be sequenced to a relaxed plan which shows the claim.
5. Direction “⇐”
5.1. Let ⟨a+ + + +
0 , . . . , an−1 ⟩ be a relaxed plan, hence G ⊆ apply(I, ⟨a0 , . . . , an−1 ⟩).
5.2. Assume, for the moment, that we drop line (*) from the algorithm. It is then
18.4. THE h+ HEURISTIC 421

easy to see that ai ∈ Ai and apply(I, ⟨a+ +


0 , . . . , ai−1 ⟩) ⊆ Fi , for all i.
5.3. We get G ⊆ apply(I, ⟨a+ +
0 , . . . , an−1 ⟩) ⊆ Fn , and the algorithm returns “solv-
able” as desired.
5.4. Assume to the contrary of the claim that, in an iteration i < n, (*) fires.
Then G̸⊆F and F = F ′ . But, with F = F ′ , F = Fj for all j > i, and we get
G̸⊆Fn in contradiction.

Michael Kohlhase: Artificial Intelligence 1 632 2025-02-06

18.4 The h+ Heuristic


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26905.

Hold on a Sec – Where are we?


P N ∪ {∞}
h∗P

P′ ⊆ P h∗P ′
R

 P: STRIPS tasks; h∗P : Length h∗ of a shortest plan.


 P ′ ⊆ P: STRIPS tasks with empty delete lists.
 R: Drop the delete lists.

 Heuristic function: Length of a shortest relaxed plan (h∗ ◦ R).


 PlanEx+ is not actually what we’re looking for. PlanEx+ =
b relaxed plan exis-
tence; we want relaxed plan length h∗ ◦ R.

Michael Kohlhase: Artificial Intelligence 1 633 2025-02-06

h+ : The Ideal Delete Relaxation Heuristic


 Definition 18.4.1 (Optimal Relaxed Plan). Let ⟨P , A, I , G⟩ be a STRIPS
task, and let s be a state. A optimal relaxed plan for s is an optimal plan for
⟨P , A, {s}, G⟩+ .
 Same as slide 626, just adding the word “optimal”.
 Here’s what we’re looking for:
 Definition 18.4.2. Let Π := ⟨P , A, I , G⟩ be a STRIPS task with states S. The
ideal delete relaxation heuristic h+ for Π is the function h+ : S → N ∪ {∞} where
h+ (s) is the length of an optimal relaxed plan for s if a relaxed plan for s exists,
and h+ (s) = ∞ otherwise.
 In other words, h+ = h∗ ◦ R, cf. previous slide.
422 CHAPTER 18. PLANNING II: ALGORITHMS

Michael Kohlhase: Artificial Intelligence 1 634 2025-02-06

h+ is Admissible
 Lemma 18.4.3. Let Π := ⟨P , A, I , G⟩ be a STRIPS task, and let s be a state. If
⟨a1 , . . ., an ⟩ is a plan for Πs := ⟨P , A, {s}, G⟩, then ⟨a+ + +
1 , . . ., an ⟩ is a plan for Π .

 Proof sketch: Show by induction over 0 ≤ i ≤ n that


apply(s, ⟨a1 , . . . , ai ⟩) ⊆ apply(s, ⟨a+ +
1 , . . . , ai ⟩).

 If we ignore deletes, the states along the plan can only get bigger.
 Theorem 18.4.4. h+ is Admissible.

 Proof:
1. Let Π := ⟨P , A, I , G⟩ be a STRIPS task with states P , and let s ∈ P .
s .
2. h+ (s) is defined as optimal plan length in Π+
3. With the lemma above, any plan for Π also constitutes a plan for Π+ s .
4. Thus optimal plan length in Π+ s can only be shorter than that in Πs i, and the
claim follows.

Michael Kohlhase: Artificial Intelligence 1 635 2025-02-06

How to Relax During Search: Ignoring Deletes


Real problem:

 Initial state I: AC; goal G:


AD.
 Actions A: pre, add, del.
 drXY, loX, ulX.

Greedy best-first search: (tie-breaking: alphabetic)

We are here

AC
Relaxed problem:

 State s: AC; goal G: AD.


 Actions A: pre, add.
 h+ (s) =5: e.g.
⟨drAB, drBC, drCD, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
18.4. THE h+ HEURISTIC 423

We are here

AC
Relaxed problem:

 State s: AC; goal G: AD.


 Actions A: pre, add.
 h+ (s) =5: e.g.
⟨drAB, drBC, drCD, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

5
AC
Real problem:
 State s: BC; goal G: AD.

 Actions A: pre, add, del.


drAB
 AC −−−−→ BC.

Greedy best-first search: (tie-breaking: alphabetic)

We are here

5 drAB
AC BC
Relaxed problem:
 State s: BC; goal G: AD.
 Actions A: pre, add.

 h+ (s) =5: e.g.


⟨drBA, drBC, drCD, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
424 CHAPTER 18. PLANNING II: ALGORITHMS

We are here

5 drAB
AC BC
Relaxed problem:

 State s: BC; goal G: AD.


 Actions A: pre, add.
 h+ (s) =5: e.g.
⟨drBA, drBC, drCD, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

5 drAB 5
AC BC
Real problem:
 State s: CC; goal G: AD.

 Actions A: pre, add, del.


drBC
 BC −−−−→ CC.

Greedy best-first search: (tie-breaking: alphabetic)

We are here

5 drAB 5 drBC
AC BC CC
Relaxed problem:
 State s: CC; goal G: AD.
 Actions A: pre, add.

 h+ (s) =5: e.g.


⟨drCB, drBA, drCD, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
18.4. THE h+ HEURISTIC 425

We are here

5 drAB 5 drBC
AC BC CC
Relaxed problem:

 State s: CC; goal G: AD.


 Actions A: pre, add.
 h+ (s) =5: e.g.
⟨drCB, drBA, drCD, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

drAB 5 drBC 5
AC BC CC
Real problem:
 State s: AC; goal G: AD.

 Actions A: pre, add, del.


drBA
 BC −−−−→ AC.

Greedy best-first search: (tie-breaking: alphabetic)

We are here

5 drAB 5 drBC 5
dr
AC BC B ACC

AC
Real problem:

 State s: AC; goal G: AD.


 Actions A: pre, add, del.
 Duplicate state, prune.

Greedy best-first search: (tie-breaking: alphabetic)


426 CHAPTER 18. PLANNING II: ALGORITHMS

We are here

5 drAB 5 drBC 5
dr
AC BC B ACC
D
AC
Real problem:
 State s: DC; goal G: AD.
 Actions A: pre, add, del.
drCD
 CC −−−−→ DC.

Greedy best-first search: (tie-breaking: alphabetic)

We are here

D DC
rC
5 drAB 5 drBC 5 d
dr
AC BC B ACC
D
AC
Relaxed problem:
 State s: DC; goal G: AD.

 Actions A: pre, add.


 h+ (s) =5: e.g.
⟨drDC, drCB, drBA, loC, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here

D DC
rC
5 drAB 5 drBC 5 d
dr
AC BC B ACC
D
AC
Relaxed problem:

 State s: DC; goal G: AD.


 Actions A: pre, add.
 h+ (s) =5: e.g.
⟨drDC, drCB, drBA, loC, ulD⟩.
18.4. THE h+ HEURISTIC 427

Greedy best-first search: (tie-breaking: alphabetic)

We are here
5
D DC
rC
5 drAB 5 drBC 5 d
dr
AC BC B ACC
D
AC
Real problem:
 State s: CT ; goal G: AD.
 Actions A: pre, add, del.
loC
 CC −−→ CT .

Greedy best-first search: (tie-breaking: alphabetic)

We are here
5
D DC
rC
5 drAB 5 drBC 5 dloC
dr
AC BC B ACC CT
D
AC
Relaxed problem:
 State s: CT ; goal G: AD.
 Actions A: pre, add.
 h+ (s) =4: e.g.
⟨drCB, drBA, drCD, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here
5
D
drC DC
5 drAB 5 drBC 5 loC
dr
AC BC B ACC CT
D
AC
428 CHAPTER 18. PLANNING II: ALGORITHMS

Relaxed problem:
 State s: CT ; goal G: AD.
 Actions A: pre, add.

 h+ (s) =4: e.g.


⟨drCB, drBA, drCD, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)

We are here
5
D DC
rC
5 drAB 5 drBC 5 dloC 4
dr
AC BC B ACC CT
D
AC
Real problem:
 State s: BC; goal G: AD.
 Actions A: pre, add, del.
drCB
 CC −−−−→ BC.

Greedy best-first search: (tie-breaking: alphabetic)

We are here
5
D
rC DC
5 drAB 5 drBC 5 dloC 4
dr dr
AC BC B ACC C BCT
D
AC BC
Real problem:

 State s: BC; goal G: AD.


 Actions A: pre, add, del.
 Duplicate state, prune.

Greedy best-first search: (tie-breaking: alphabetic)


18.4. THE h+ HEURISTIC 429

We are here
5
D DC
rC
5 drAB 5 drBC 5 dloC 4
dr dr
AC BC B ACC C BCT
D D
AC BC
Real problem:
 State s: CT ; goal G: AD.
 Actions A: pre, add, del.

 Successors: BT , DT , CC.

Greedy best-first search: (tie-breaking: alphabetic)

We are here
5 4
D DC B BT
rC C
dr
5 drAB 5 drBC 5 dloC 4 drCD 4
dr dr u
AC BC B ACC C lC DT
BCT
D D D
AC BC CC
Real problem:
 State s: BT ; goal G: AD.
 Actions A: pre, add, del.

 Successors: AT , BB, CT .

Greedy best-first search: (tie-breaking: alphabetic)

4
AT
5
A
drB

We are here B BB
5 ul
4 drBC D
D B
r C DC rC BT CT
d
5 drAB 5 drBC 5 dloC 4 drCD 4
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
430 CHAPTER 18. PLANNING II: ALGORITHMS

Real problem:
 State s: AT ; goal G: AD.
 Actions A: pre, add, del.

 Successors: AA, BT .

Greedy best-first search: (tie-breaking: alphabetic)

4 ulA 5
dr
AT ABAA
5 D

A
drB
We are here B BB BT
5 ul
4 drBC D
D B
d rC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
 State s: DT ; goal G: AD.
 Actions A: pre, add, del.
 Successors: DD, CT .

Greedy best-first search: (tie-breaking: alphabetic)

4 ulA 5
dr
AT ABAA
5 D
A
dr B

We are here B BB BT
5 ul
4 drBC D
D B
d rC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4 ulD 3
dr dr u dr
AC BC B ACC C BCT lC DT DCDD
D D D D
AC BC CC CT
Real problem:
 State s: DD; goal G: AD.
 Actions A: pre, add, del.
 Successors: CD, DT .

Greedy best-first search: (tie-breaking: alphabetic)


18.4. THE h+ HEURISTIC 431

4 ulA 5
dr
AT ABAA
5 D

A
drB
We are here B BB BT
5 ul
4 drBC D
D B
drC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4 ulD 3 drDC 2
dr dr u dr l
AC BC B ACC C BCT lC DT DCDD oD CD
D D D D D
AC BC CC CT DT
Real problem:
 State s: CD; goal G: AD.
 Actions A: pre, add, del.
 Successors: BD, DD.

Greedy best-first search: (tie-breaking: alphabetic)

4 ulA 5
dr
AT ABAA
5 D
A
drB

We are here B BB BT
5 ul
4 drBC D
D B
r C DC rC BT CT
d
5 drAB 5 drBC 5 dloC 4 drCD 4 ulD 3 drDC 2 drCB 1
dr dr u dr l dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD
D D D D D D
AC BC CC CT DT DD
Real problem:
 State s: BD; goal G: AD.
 Actions A: pre, add, del.
 Successors: AD, CD.

Greedy best-first search: (tie-breaking: alphabetic)


432 CHAPTER 18. PLANNING II: ALGORITHMS

4 ulA 5
dr
AT ABAA
5 D

A
drB
We are here B BB BT
5 ul
4 drBC D
D B
drC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4 ulD 3 drDC 2 drCB 1 drBA 0
dr dr u dr l dr dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD B CAD
D D D D D D D
AC BC CC CT DT DD CD
Real problem:
 State s: AD; goal G: AD.
 Actions A: pre, add, del.
 Goal state!

Greedy best-first search: (tie-breaking: alphabetic)

4 ulA 5
dr
AT ABAA
5 D
A
drB

We are here B BB BT
5 ul
4 drBC D
D B
r C DC rC BT CT
d
5 drAB 5 drBC 5 dloC 4 drCD 4 ulD 3 drDC 2 drCB 1 drBA 0
dr dr u dr l dr dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD B CAD
D D D D D D D
AC BC CC CT DT DD CD

Michael Kohlhase: Artificial Intelligence 1 636 2025-02-06

Of course there are also bad cases. Here is one.

h+ in the Blocksworld

A
A
B B
D C C

 Initial State Goal State

 Optimal plan: ⟨putdown(A), unstack(B, D), stack(B, C), pickup(A), stack(A, B)⟩.
 Optimal relaxed plan: ⟨stack(A, B), unstack(B, D), stack(B, C)⟩.

 Observation: What can we say about the “search space surface” at the initial
state here?
18.5. CONCLUSION 433

 The initial state lies on a local minimum under h+ , together with the successor
state s where we stacked A onto B. All direct other neighbors of these two states
have a strictly higher h+ value.

Michael Kohlhase: Artificial Intelligence 1 637 2025-02-06

18.5 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26906.

Summary
 Heuristic search on classical search problems relies on a function h mapping states
s to an estimate h(s) of their goal state distance. Such functions h are derived by
solving relaxed problems.

 In planning, the relaxed problems are generated and solved automatically. There
are four known families of suitable relaxation methods: abstractions, landmarks,
critical paths, and ignoring deletes (aka delete relaxation).
 The delete relaxation consists in dropping the deletes from STRIPS tasks. A relaxed
plan is a plan for such a relaxed task. h+ (s) is the length of an optimal relaxed plan
for state s. h+ is NP-hard to compute.
 hFF approximates h+ by computing some, not necessarily optimal, relaxed plan.
That is done by a forward pass (building a relaxed planning graph), followed by a
backward pass (extracting a relaxed plan).

Michael Kohlhase: Artificial Intelligence 1 638 2025-02-06

Topics We Didn’t Cover Here


 Abstractions, Landmarks, Critical-Path Heuristics, Cost Partitions, Compil-
ability between Heuristic Functions, Planning Competitions:
 Tractable fragments: Planning sub-classes that can be solved in polynomial time.
Often identified by properties of the “causal graph” and “domain transition graphs”.
 Planning as SAT: Compile length-k bounded plan existence into satisfiability of
a CNF formula φ. Extensive literature on how to obtain small φ, how to schedule
different values of k, how to modify the underlying SAT solver.

 Compilations: Formal framework for determining whether planning formalism X


is (or is not) at least as expressive as planning formalism Y .
 Admissible pruning/decomposition methods: Partial-order reduction, symme-
try reduction, simulation-based dominance pruning, factored planning, decoupled
search.

 Hand-tailored planning: Automatic planning is the extreme case where the com-
puter is given no domain knowledge other than “physics”. We can instead allow the
434 CHAPTER 18. PLANNING II: ALGORITHMS

user to provide search control knowledge, trading off modeling effort against search
performance.
 Numeric planning, temporal planning, planning under uncertainty . . .

Michael Kohlhase: Artificial Intelligence 1 639 2025-02-06

Suggested Reading (RN: Same As Previous Chapter):


• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].
– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
• A good source for modern information (some of which we covered in the course) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 19

Searching, Planning, and Acting in


the Real World

Outline
 So Far: we made idealizing/simplifying assumptions:
The environment is fully observable and deterministic.

 Outline: In this chapter we will lift some of them


 The real world (things go wrong)
 Agents and Belief States
 Conditional planning
 Monitoring and replanning
 Note: The considerations in this chapter apply to both search and planning.

Michael Kohlhase: Artificial Intelligence 1 640 2025-02-06

19.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26908.

The real world


 Example 19.1.1. We have a flat tire – what to do?

435
436 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

Michael Kohlhase: Artificial Intelligence 1 641 2025-02-06

Generally: Things go wrong (in the real world)


 Example 19.1.2 (Incomplete Information).
 Unknown preconditions, e.g., Intact(Spare)?
 Disjunctive effects, e.g., Inf late(x) causes Inf lated(x)∨SlowHiss(x)∨Burst(x)∨
BrokenP ump ∨ . . .

 Example 19.1.3 (Incorrect Information).


 Current state incorrect, e.g., spare NOT intact
 Missing/incorrect effects in actions.

 Definition 19.1.4. The qualification problem in planning is that we can never finish
listing all the required preconditions and possible conditional effects of actions.
 Root Cause: The environment is partially observable and/or non-deterministic.
 Technical Problem: We cannot know the “current state of the world”, but search/-
planning algorithms are based on this assumption.

 Idea: Adapt search/planning algorithms to work with “sets of possible states”.

Michael Kohlhase: Artificial Intelligence 1 642 2025-02-06

What can we do if things (can) go wrong?


 One Solution: Sensorless planning: plans that work regardless of state/outcome.
 Problem: Such plans may not exist! (but they often do in practice)
 Another Solution: Conditional plans:
 Plan to obtain information, (observation actions)
 Subplan for each contingency.
19.2. THE FURNITURE COLORING EXAMPLE 437

 Example 19.1.5 (A conditional Plan). (AAA =


b ADAC)
[Check(T 1), if Intact(T 1) then Inf late(T 1) else CallAAA fi]
 Problem: Expensive because it plans for many unlikely cases.

 Still another Solution: Execution monitoring/replanning


 Assume normal states/outcomes, check progress during execution, replan if nec-
essary.
 Problem: Unanticipated outcomes may lead to failure. (e.g., no AAA card)

 Observation 19.1.6. We really need a combination; plan for likely/serious even-


tualities, deal with others when they arise, as they must eventually.

Michael Kohlhase: Artificial Intelligence 1 643 2025-02-06

19.2 The Furniture Coloring Example


A Video Nugget covering this section can be found at https://fau.tv/clip/id/29180.
We now introduce a planning example that shows off the various features.

The Furniture-Coloring Example: Specification


 Example 19.2.1 (Coloring Furniture).

Paint a chair and a table in matching colors.


 The initial state is:

 we have two cans of paint of unknown color,


 the color of the furniture is unknown as well,
 only the table is in the agent’s field of view.
 Actions:

 remove lid from can


 paint object with paint from open can.

Michael Kohlhase: Artificial Intelligence 1 644 2025-02-06

We formalize the example in PDDL for simplicity. Note that the :percept scheme is not part of
the official PDDL, but fits in well with the design.

The Furniture-Coloring Example: PDDL


 Example 19.2.2 (Formalization in PDDL).
 The PDDL domain file is as expected (actions below)
(define (domain furniture−coloring)
(:predicates (object ?x) (can ?x) (inview ?x) (color ?x ?y))
...)
438 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

 The PDDL problem file has a “free” variable ?c for the (undetermined) joint
color.
(define (problem tc−coloring)
(:domain furniture−objects)
(:objects table chair c1 c2)
(:init (object table) (object chair) (can c1) (can c2) (inview table))
(:goal (color chair ?c) (color table ?c)))

 Two action schemata: remove can lid to open and paint with open can
(:action remove−lid
:parameters (?x)
:precondition (can ?x)
:effect (open can))
(:action paint
:parameters (?x ?y)
:precondition (and (object ?x) (can ?y) (color ?y ?c) (open ?y))
:effect (color ?x ?c))
has a universal variable ?c for the paint action ⇝ we cannot just give paint a
color argument in a partially observable environment.
 Sensorless Plan: Open one can, paint chair and table in its color.
 Note: Contingent planning can create better plans, but needs perception
 Two percept schemata: color of an object and color in a can
(:percept color
:parameters (?x ?c)
:precondition (and (object ?x) (inview ?x)))
(:percept can−color
:parameters (?x ?c)
:precondition (and (can ?x) (inview ?x) (open ?x)))
To perceive the color of an object, it must be in view, a can must also be open.
Note: In a fully observable world, the percepts would not have preconditions.
 An action schema: look at an object that causes it to come into view.
(:action lookat
:parameters (?x)
:precond: (and (inview ?y) and (notequal ?x ?y))
:effect (and (inview ?x) (not (inview ?y))))

 Contingent Plan:
1. look at furniture to determine color, if same ; done.
2. else, look at open and look at paint in cans
3. if paint in one can is the same as an object, paint the other with this color
4. else paint both in any color

Michael Kohlhase: Artificial Intelligence 1 645 2025-02-06

19.3 Searching/Planning with Non-Deterministic Actions


A Video Nugget covering this section can be found at https://fau.tv/clip/id/29181.
19.3. SEARCHING/PLANNING WITH NON-DETERMINISTIC ACTIONS 439

Conditional Plans
 Definition 19.3.1. Conditional plans extend the possible actions in plans by condi-
tional steps that execute sub plans conditionally whether K + P ⊨ C, where K + P
is the current knowledge base + the percepts.

 Definition 19.3.2. Conditional plans can contain


 conditional step: [. . . , if C then P lanA else P lanB fi, . . .],
 while step: [. . . , while C do P lan done, . . .], and
 the empty plan ∅ to make modeling easier.

 Definition 19.3.3. If the possible percepts are limited to determining the current
state in a conditional plan, then we speak of a contingency plan.
 Note: Need some plan for every possible percept! Compare to
game playing: some response for every opponent move.
backchaining: some rule such that every premise satisfied.
 Idea: Use an AND–OR tree search (very similar to backward chaining algorithm)

Michael Kohlhase: Artificial Intelligence 1 646 2025-02-06

Contingency Planning: The Erratic Vacuum Cleaner


 Example 19.3.4 (Erratic vacuum world).
1

Suck Right

A variant suck action:


if square is 7 5 2

GOAL Suck Right Left Suck


 dirty: clean the square,
sometimes remove dirt in
adjacent square.
5 1 6 1 8 4

 clean: sometimes deposits LOOP LOOP Suck Left LOOP GOAL


dirt on the carpet.

8 5

GOAL LOOP

Solution: [suck, if State = 5 then [right, suck] else [] fi]

Michael Kohlhase: Artificial Intelligence 1 647 2025-02-06

Conditional AND-OR Search (Data Structure)


440 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

 Idea: Use AND-OR trees as data structures for representing problems (or goals)
that can be reduced to to conjunctions and disjunctions of subproblems (or sub-
goals).
 Definition 19.3.5. An AND-OR graph is a is a graph whose non-terminal nodes
are partitioned into AND nodes and OR nodes. A valuation of an AND-OR graph
T is an assignment of T or F to the nodes of T . A valuation of the terminal nodes
of T can be extended by all nodes recursively: Assign T to an
 OR node, iff at least one of its children is T.
 AND node, iff all of its children are T.

A solution for T is a valuation that assigns T to the initial nodes of T .


 Idea: A planning task with non deterministic actions generates a AND-OR graph
T . A solution that assigns T to a terminal node, iff it is a goal node. Corresponds
to a conditional plan.

Michael Kohlhase: Artificial Intelligence 1 648 2025-02-06

Conditional AND-OR Search (Example)


 Definition 19.3.6. An AND-OR tree is a AND-OR graph that is also a tree.
Notation: AND nodes are written with arcs connecting the child edges.
 Example 19.3.7 (An AND-OR-tree).

Michael Kohlhase: Artificial Intelligence 1 649 2025-02-06

Conditional AND-OR Search (Algorithm)


 Definition 19.3.8. AND-OR search is an algorithm for searching AND–OR graphs
generated by nondeterministic environments.
function AND/OR−GRAPH−SEARCH(prob) returns a conditional plan, or fail
OR−SEARCH(prob.INITIAL−STATE, prob, [])
function OR−SEARCH(state,prob,path) returns a conditional plan, or fail
19.4. AGENT ARCHITECTURES BASED ON BELIEF STATES 441

if prob.GOAL−TEST(state) then return the empty plan


if state is on path then return fail
for each action in prob.ACTIONS(state) do
plan := AND−SEARCH(RESULTS(state,action),prob,[state | path])
if plan ̸= fail then return [action | plan]
return fail
function AND−SEARCH(states,prob,path) returns a conditional plan, or fail
for each si in states do
pi := OR−SEARCH(si ,prob,path)
if pi = fail then return fail
return [if s1 then p1 else if s2 then p2 else . . . if sn−1 then pn−1 else pn ]

 Cycle Handling: If a state has been seen before ; fail

 fail does not mean there is no solution, but


 if there is a non-cyclic solution, then it is reachable by an earlier incarnation!

Michael Kohlhase: Artificial Intelligence 1 650 2025-02-06

The Slippery Vacuum Cleaner (try, try, try, . . . try again)


 Example 19.3.9 (Slippery Vacuum World).
1

Suck Right

Moving sometimes fails 5 2

; AND-OR graph Right

Two possible solutions (depending on what our plan language allows)

 [L1 : lef t, if AtR then L1 else [if CleanL then ∅ else suck fi] fi] or
 [while AtR do [lef t] done, if CleanL then ∅ else suck fi]
 We have an infinite loop but plan eventually works unless action always fails.

Michael Kohlhase: Artificial Intelligence 1 651 2025-02-06

19.4 Agent Architectures based on Belief States


A Video Nugget covering this section can be found at https://fau.tv/clip/id/29182.
We are now ready to proceed to environments which can only partially observed and where actions
are non deterministic. Both sources of uncertainty conspire to allow us only partial knowledge
about the world, so that we can only optimize “expected utility” instead of “actual utility” of our
actions.
442 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

World Models for Uncertainty


 Problem: We do not know with certainty what state the world is in!
 Idea: Just keep track of all the possible states it could be in.
 Definition 19.4.1. A model-based agent has a world model consisting of

 a belief state that has information about the possible states the world may be
in, and
 a sensor model that updates the belief state based on sensor information
 a transition model that updates the belief state based on actions.
 Idea: The agent environment determines what the world model can be.

 In a fully observable, deterministic environment,


 we can observe the initial state and subsequent states are given by the actions
alone.
 thus the belief state is a singleton (we call its member the world state) and the
transition model is a function from states and actions to states: a transition
function.

Michael Kohlhase: Artificial Intelligence 1 652 2025-02-06

That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.

World Models by Agent Type in AI-1


 Search-based Agents: In a fully observable, deterministic environment
 goal-based agent with world state =
b “current state”
 no inference. (goal =
b goal state from search problem)

 CSP-based Agents: In a fully observable, deterministic environment


 goal-based agent withworld state =
b constraint network,
 inference =
b constraint propagation. (goal =
b satisfying assignment)
 Logic-based Agents: In a fully observable, deterministic environment

 model-based agent with world state =


b logical formula
 inference =
b e.g. DPLL or resolution.
 Planning Agents: In a fully observable, deterministic, environment
 goal-based agent with world state =
b PL0, transition model =
b STRIPS,
 inference =
b state/plan space search. (goal: complete plan/execution)
19.5. SEARCHING/PLANNING WITHOUT OBSERVATIONS 443

Michael Kohlhase: Artificial Intelligence 1 653 2025-02-06

Let us now see what happens when we lift the restrictions of total observability and determin-
ism.

World Models for Complex Environments


 In a fully observable, but stochastic environment,
 the belief state must deal with a set of possible states.
 ; generalize the transition function to a transition relation.
 Note: This even applies to online problem solving, where we can just perceive the
state. (e.g. when we want to optimize utility)
 In a deterministic, but partially observable environment,

 the belief state must deal with a set of possible states.


 we can use transition functions.
 We need a sensor model, which predicts the influence of percepts on the belief
state – during update.
 In a stochastic, partially observable environment,

 mix the ideas from the last two. (sensor model + transition relation)

Michael Kohlhase: Artificial Intelligence 1 654 2025-02-06

Preview: New World Models (Belief) ; new Agent Types


 Probabilistic Agents: In a partially observable environment
 belief state =
b Bayesian networks,
 inference =
b probabilistic inference.

 Decision-Theoretic Agents: In a partially observable, stochastic environment


 belief state + transition model =
b decision networks,
 inference =
b maximizing expected utility.

 We will study them in detail in the coming semester.

Michael Kohlhase: Artificial Intelligence 1 655 2025-02-06

19.5 Searching/Planning without Observations


A Video Nugget covering this section can be found at https://fau.tv/clip/id/29183.

Conformant/Sensorless Planning
 Definition 19.5.1. Conformant or sensorless planning tries to find plans that work
444 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

without any sensing. (not even the initial state)

 Example 19.5.2 (Sensorless Vacuum Cleaner World).


States integer dirt and robot locations
Actions lef t, right, suck, noOp
Goal states notdirty?

 Observation 19.5.3. In a sensorless world we do not know the initial state. (or
any state after)
 Observation 19.5.4. Sensorless planning must search in the space of belief states
(sets of possible actual states).

 Example 19.5.5 (Searching the Belief State Space).


 Start in {1, 2, 3, 4, 5, 6, 7, 8}
 Solution: [right, suck, lef t, suck] right → {2, 4, 6, 8}
suck → {4, 8}
lef t → {3, 7}
suck → {7}

Michael Kohlhase: Artificial Intelligence 1 656 2025-02-06

Search in the Belief State Space: Let’s Do the Math


 Recap: We describe an search problem Π := ⟨S , A, T , I , G ⟩ via its states S,
actions A, and transition model T : A×S → P(A), goal states G, and initial state
I.
 Problem: What is the corresponding sensorless problem?

 Let’ think: Let Π := ⟨S , A, T , I , G ⟩ be a (physical) problem


 States S b : The belief states are the 2|S| subsets of S.
 The initial state I b is just S (no information)
 Goal states G := {S ∈ S | S ⊆ G}
b b
(all possible states must be physical goal
states)
 Actions Ab : we just take A. (that’s the point!)
 Transition model T b : Ab ×S b → P(Ab ): i.e. what is T b (a, S) for a ∈ A and
S ⊆ S? This is slightly tricky as a need not be applicable to all s ∈ S.
S
1. if actions are harmless to the environment, take T b (a, S) := s∈S T (a, s).
T
2. if not, better take T b (a, S) := s∈S T (a, s). (the safe bet)
 Observation 19.5.6. In belief-state space the problem is always fully observable!

Michael Kohlhase: Artificial Intelligence 1 657 2025-02-06

Let us see if we can understand the options for T b (a, S) a bit better. The first question is when we
want an action a to be applicable to a belief state S ⊆ S, i.e. when should T b (a, S) be non-empty.
19.5. SEARCHING/PLANNING WITHOUT OBSERVATIONS 445

In the first case, ab would be applicable iff a is applicable to some s ∈ S, in the second case if a
is applicable to all s ∈ S. So we only want to choose the first case if actions are harmless.
The second question we ask ourselves is what should be the results of applying a to S ⊆ S?,
again, if actions are harmless, we can just collect the results, otherwise, we need to make sure that
all members of the result ab are reached for all possible states in S.

State Space vs. Belief State Space


 Example 19.5.7 (State/Belief State Space in the Vacuum World). In the
70 Chapter 3. Solving Problems by Searching
vacuum world all actions are always applicable (1./2. equal)

R
L R

L
S S

R R
L R L R

L L
S S
S S
R
L R

S S

Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
Right, S = Suck.

3.2.1 Toy problems


The first example we examine is the vacuum world first introduced in Chapter 2. (See
Figure 2.2.) This can be formulated as a problem as follows:
• States: The state is determined by both the agent location and the dirt locations. The
agent is in one of two locations, each of which might or might not contain dirt. Thus,
there are 2 × 22 = 8 possible world states. A larger environment with n locations has
n · 2n states.
• Initial state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
Suck. Larger environments might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
have no effect. The complete state space is shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
space. The object is to reach a specified goal state, such as the one shown on the right of the
figure.
MichaelThe standard
Kohlhase: formulation
Artificial Intelligence 1is as follows: 658 2025-02-06

Evaluating Conformant Planning


 Upshot: We can build belief-space problem formulations automatically,
 but they are exponentially bigger in theory, in practice they are often similar;
 e.g. 12 reachable belief states out of 28 = 256 for vacuum example.

 Problem: Belief states are HUGE; e.g. initial belief state for the 10 × 10 vacuum
446 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

world contains 100 · 2100 ≈ 1032 physical states


 Idea: Use planning techniques: compact descriptions for
 belief states; e.g. all for initial state or not leftmost column after lef t.
 actions as belief state to belief state operations.
 This actually works: Therefore we talk about conformant planning!

Michael Kohlhase: Artificial Intelligence 1 659 2025-02-06

19.6 Searching/Planning with Observation


A Video Nugget covering this section can be found at https://fau.tv/clip/id/29184.

Conditional planning (Motivation)


 Note: So far, we have never used the agent’s sensors.

 In ??, since the environment was observable and deterministic we could just use
offline planning.
 In ?? because we chose to.
 Note: If the world is nondeterministic or partially observable then percepts usually
provide information, i.e., split up the belief state

 Idea: This can systematically be used in search/planning via belief-state search,


but we need to rethink/specialize the Transition model.

Michael Kohlhase: Artificial Intelligence 1 660 2025-02-06

A Transition Model for Belief-State Search


 We extend the ideas from slide 657 to include partial observability.
 Definition 19.6.1. Given a (physical) search problem Π := ⟨S , A, T , I , G ⟩, we de-
fine the belief state search problem induced by Π to be ⟨P(S), A, T b , S, {S ∈ S b | S ⊆ G}⟩,
where the transition model T b is constructed in three stages:
 The prediction stage: given a belief state b and an action a we define bb :=
PRED(b, a) for some function PRED : P(S)×A → P(S).
 The observation prediction stage determines the set of possible percepts that
could be observed in the predicted belief state: PossPERC(bb) = {PERC(s) | s ∈
19.6. SEARCHING/PLANNING WITH OBSERVATION 447

bb}.
 The update stage determines, for each possible percept, the resulting belief
state: UPDATE(bb, o) := {s | o = PERC(s) and s ∈ bb}
The functions PRED and PERC are the main parameters of this model. We define
RESULT(b, a):={UPDATE(PRED(b, a), o) | PossPERC(PRED(b, a))}

 Observation 19.6.2. We always have UPDATE(bb, o) ⊆ bb.


 Observation 19.6.3. If sensing is deterministic, belief states for different possible
percepts are disjoint, forming a partition of the original predicted belief state.

Michael Kohlhase: Artificial Intelligence 1 661 2025-02-06

Example: Local Sensing Vacuum Worlds


 Example 19.6.4 (Transitions in the Vacuum World). Deterministic World:

[B,Dirty] 2
Right
1 2
(a)
3 4
[B,Dirty] 2
Right
1 2
[B,Clean] 4
(a)
3 4

[B,Clean] 4
The action Right is deterministic, sensing disambiguates to singletons Slippery
World:
2
[B,Dirty]
2
[B,Dirty]
Right 2

Right 2

1 1 [A,Dirty] 1
(b) 1 1 [A,Dirty] 1
(b)
3 33 3
3
3

44

[B,Clean]
[B,Clean]4
4

Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
ministic world,
and [R,Right
Clean],isleading
applied in belief
to two the initial belief
states, each state,isresulting
of which a singleton.in(b)a In
new predicted belief
the slippery
2
[B,Dirty]

Right 2

1 1 [A,Dirty] 1

448 CHAPTER 19. (b)


SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
3 3 3

The action Right is non-deterministic, sensing


4 disambiguates somewhat
[B,Clean]
4

Michael Kohlhase: Artificial Intelligence 1 662 2025-02-06

Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Belief-State Search with Percepts
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery
world, Right is The
 Observation: applied in the initialtransition
belief-state belief state, givinginduces
model a new belief state with four
an AND-OR physi-
graph.
cal states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean],
 Idea: Use
leading AND-OR
to three search
belief states as in non deterministic environments.
shown.

 Example 19.6.5. AND-OR graph for initial percept [A, Dirty].

3
Suck Right

A,Clean] B,Dirty] B,Clean]

5
2 4

7
36 Chapter 4 Search in Complex Environments

Figure 4.15 The first level of the AND – OR search tree for a problem in the local-sensing
Solution:
vacuum world;
[Suck, is theiffirst
Right,
Suck action=
Bstate in {6}
the solution.
then Suck else [] fi]
 Note: Belief-state-problem ; conditional step tests on belief-state percept (plan
would not be executable
Suck
in a partially
[A,Clean]
observableRight
environment
2
otherwise)
[B,Dirty]

1 5 5 6 2
Michael Kohlhase: Artificial Intelligence 1 663 2025-02-06

3 7 7 4 6

Example: Agent Localization


8

Figure 4.16 Two prediction–update cycles of belief-state maintenance in the kindergarten


 Example 19.6.6. An agent inhabits a maze of which it has an accurate map. It has
vacuum world with local sensing.
four sensors that can (reliably) detect walls. The M ove action is non-deterministic,
moving the agent randomly into one of the adjacent squares.
1. Initial belief state ; bb1 all possible locations.
2. Initial percept: N W S (walls north, west, and south) ; bb2 = UPDATE(bb1 , N W S)

(a) Possible locations of robot after E = 1011


3. Agent executes M ove ; bb3 = PRED(bb12 , M ove) = one step away from these.
4. Next percept: N S ; bb4 = UPDATE(bb3 , N S)
19.6. SEARCHING/PLANNING WITH OBSERVATION 449
(a) Possible locations of robot after E1 = 1011

(b) Possible locations of robot after E1 = 1011, E2 = 1010


All in all, bb4 = UPDATE(PRED(UPDATE(bb1 , N W S), M ove), N S) localizes the
agent. Figure 4.17 Possible positions of the robot, !, (a) after one observation, E1 = 1011, and
(b) after moving one square and making a second observation, E2 = 1010. When sensors are
 Observation:
noiseless and enlarges
the transition
PRED modelthe belief there
is accurate, state, while
is only one UPDATE shrinks
possible location it again.
for the robot
consistent with this sequence of two observations.

Michael Kohlhase: Artificial Intelligence 1 664 2025-02-06

Contingent Planning
 Definition 19.6.7. The generation of plan with conditional branching based on
percepts is called contingent planning, solutions are called contingent plans.
 Appropriate for partially observable or non-deterministic environments.
 Example 19.6.8. Continuing ??.
One of the possible contingent plan is
((lookat table) (lookat chair)
(if (and (color table c) (color chair c)) (noop)
((removelid c1) (lookat c1) (removelid c2) (lookat c2)
(if (and (color table c) (color can c)) ((paint chair can))
(if (and (color chair c) (color can c)) ((paint table can))
((paint chair c1) (paint table c1)))))))
 Note: Variables in this plan are existential; e.g. in
 line 2: If there is come joint color c of the table and chair ; done.
 line 4/5: Condition can be satisfied by [c1 /can] or [c2 /can] ; instantiate ac-
cordingly.
 Definition 19.6.9. During plan execution the agent maintains the belief state b,
chooses the branch depending on whether b ⊨ c for the condition c.
 Note: The planner must make sure b ⊨ c can always be decided.

Michael Kohlhase: Artificial Intelligence 1 665 2025-02-06

Contingent Planning: Calculating the Belief State


 Problem: How do we compute the belief state?
 Recall: Given a belief state b, the new belief state bb is computed based on
prediction with the action a and the refinement with the percept p.

 Here:
Given an action a and percepts p = p1 ∧ . . . ∧ pn , we have
450 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

 bb = b\dela ∪ adda (as for the sensorless agent)


 If n = 1 and (:percept p1 :precondition c) is the only percept axiom, also add p
and c to bb. (add c as otherwise p impossible)
 If n > 1 and (:percept pi :precondition ci ) are the percept axioms, also add p
and c1 ∨ . . . ∨ cn to bb. (belief state no longer conjunction of literals /)

 Idea: Given such a mechanism for generating (exact or approximate) updated belief
states, we can generate contingent plans with an extension of AND-OR search over
belief states.
 Extension: This also works for non-deterministic actions: we extend the represen-
tation of effects to disjunctions.

Michael Kohlhase: Artificial Intelligence 1 666 2025-02-06

AI-1 Survey on ALeA

 Online survey evaluating ALeA until 28.02.25 24:00 (Feb last)


 Works on all common devices (mobile phone, notebook, etc.)
 Is in English; takes about 10 - 20 min
depending on proficiency in english and using ALeA

 Questions about how ALeA is used, what it is like usig ALeA, and questions about
demography
 Token is generated at the end of the survey (SAVE THIS CODE!)
 Completed survey count as a successfull prepquiz in AI1!
 Look for Quiz 15 in the usual place (single question)
 just submit the token to get full points
 The token can also be used to exercise the rights of the GDPR.
 Survey has no timelimit and is free, anonymous, can be paused and continued later
on and can be cancelled.

Michael Kohlhase: Artificial Intelligence 1 667 2025-02-06

Find the Survey Here


19.7. ONLINE SEARCH 451

https:
//ddi-survey.cs.fau.de/limesurvey/index.php/667123?lang=en
This URL will also be posted on the forum tonight.

Michael Kohlhase: Artificial Intelligence 1 668 2025-02-06

19.7 Online Search


A Video Nugget covering this section can be found at https://fau.tv/clip/id/29185.

Online Search and Replanning


 Note: So far we have concentrated on offline problem solving, where the agent
only acts (plan execution) after search/planning terminates.
 Recall: In online problem solving an agent interleaves computation and action: it
computes one action at a time based on incoming perceptions.
 Online problem solving is helpful in
 dynamic or semidynamic environments. (long computation times can be
harmful)
 stochastic environments. (solve contingencies only when they arise)

 Online problem solving is necessary in unknown environments ; exploration prob-


lem.

Michael Kohlhase: Artificial Intelligence 1 669 2025-02-06

Online Search Problems


 Observation: Online problem solving even makes sense in deterministic, fully
observable environments.
 Definition 19.7.1. A online search problem consists of a set S of states, and
452 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

 a function Actions(s) that returns a list of actions allowed in state s.


 the step cost function c, where c(s, a, s′ ) is the cost of executing action a in
state s with outcome s′ . (cost unknown before executing a)
 a goal test Goal Test.

 Note: We can only determine RESULT(s, a) by being in s and executing a.


 Definition 19.7.2. The competitive ratio of an online problem solving agent is the
quotient of
 offline performance, i.e. cost of optimal solutions with full information and
 online performance, i.e. the actual cost induced by online problem solving.

Michael Kohlhase: Artificial Intelligence 1 670 2025-02-06

Online Search Problems (Example) 3 G 37


37
 Example 19.7.3 (A simple maze problem). 2

The agent starts at S and must reach G but knows nothing


of the environment. In particular not that 3
1 S
G

 Up(1, 1) results in (1,2) and 1 2 3


1 S

FigureG4.18 (i.e.
3
 Down(1, 1) results in (1,1) back)
A simple maze problem. The agent starts at S and must reach G but
1 2 3

nothing of theFigure
environment.
4.18 A simple maze problem. The agent starts at S and must reach G but knows
nothing of the environment.
2

Michael Kohlhase: Artificial Intelligence 1 671 2025-02-06

1 S

Online Search Obstacles (Dead1 Ends)


2 3
G
Figure 4.18 A simple maze problem. The agent starts at S and must reach G but knows
S A
nothing of the environment.
 Definition 19.7.4. We call a state a dead end, iff no state is reachable from it by
an action. An action that leads to a dead end is called irreversible. G S G

 Note: With irreversible actions the competitive ratio can be infinite.


S AS A

 Observation 19.7.5. No online algorithm can avoid dead ends in all state
G spaces.
(a) (b)
 Example 19.7.6. Two state spaces that lead an onlineFigure
agent4.19 (a)into
Two state dead ends:
spaces that might lead an online search agent into a dead end.
S
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
that can cause an online search agent to follow an arbitrarily inefficient route to the goal.
Whichever choice the agent makes, the adversary blocks that route with another long, thin
wall, so that the path followed is much longer than the best possible path.
G

S A S A

G
S (a)
G (b)
Any agent will fail in at least one of the spaces.
Figure 4.19 (a) Two state spaces that might lead an online search agent into a de
Any given agent will fail in at least one of these spaces. (b) A two-dimensional envir
 Definition 19.7.7. We call ?? an adversarythat
argument.
can cause an online search agent to follow an arbitrarily inefficient route to th
S A Whichever choice the agent makes, the adversary blocks that route with another lon
 Example 19.7.8. Forcing an online agent into
wall,an
so arbitrarily
that the pathinefficient route:longer than the best possible path.
followed is much
G
(a) (b)
Figure 4.19 (a) Two state spaces that might lead an online search agent into a dead end.
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
G
19.7. ONLINE SEARCH 453
S A

S G
Whichever choice the agent makes
the adversary can block with a
long, thin wall
S A

G
(a) (b)
FigureDead
 Observation: 4.19 (a) Two are
ends stateaspaces that might lead
real problem an online search
for robots: ramps, agent into acliffs,
stairs, dead end.
...
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
 Definitionthat can causeAan state
19.7.9. online search
spaceagent to follow
is called an arbitrarily
safely inefficient
explorable, iff route to thestate
a goal goal. is
Whichever choice the agent makes, the adversary blocks that route with another long, thin
reachable from every
wall, so reachable
that the state.
path followed is much longer than the best possible path.

 We will always assume this in the following.

Michael Kohlhase: Artificial Intelligence 1 672 2025-02-06

Online Search Agents


 Observation: Online and offline search algorithms differ considerably:
 For an offline agent, the environment is visible a priori.
 An online agent builds a “map” of the environment from percepts in visited
states.
Therefore, e.g. A∗ can expand any node in the fringe, but an online agent must go
there to explore it.
 Intuition: It seems best to expand nodes in “local order” to avoid spurious travel.

 Idea: Depth first search seems a good fit. (must only travel for backtracking)

Michael Kohlhase: Artificial Intelligence 1 673 2025-02-06

Online DFS Search Agent


 Definition 19.7.10. The online depth first search algorithm:
function ONLINE−DFS−AGENT(s′ ) returns an action
inputs: s′ , a percept that identifies the current state
persistent: result, a table mapping (s, a) to s′ , initially empty
untried, a table mapping s to a list of untried actions
unbacktracked, a table mapping s to a list backtracks not tried
s, a, the previous state and action, initially null
if Goal Test(s′ ) then return stop
if s′ ̸∈ untried then untried[s′ ] := Actions(s′ )
if s is not null then
result[s, a] := s′
add s to the front of unbacktracked[s′ ]
if untried[s′ ] is empty then
454 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

if unbacktracked[s′ ] is empty then return stop


else a := an action b such that result[s′ , b] = pop(unbacktracked[s′ ])
else a := pop(untried[s′ ])
s := s′
return a
85
 Note: result is the “environment map” constructed as the agent explores.

function A NGELIC -S EARCH ( problem, hierarchy , initialPlan ) returns solution or fail


Michael Kohlhase: Artificial Intelligence 1 674 2025-02-06

frontier ← a FIFO queue with initialPlan as the only element


while true do
19.8 Replanning and
if E MPTY ?( frontier Execution
) then return fail Monitoring
plan ← P OP ( frontier ) // chooses the shallowest node in frontier
A Video Nugget covering
if R EACH + this section
(problem.I NITIALcan be found
, plan) at https://fau.tv/clip/id/29186.
intersects problem.G OAL then
if plan is primitive then return plan // R EACH + is exact for primitive plans
Replanning (Ideas) −
guaranteed ← R EACH (problem.I NITIAL, plan ) ∩ problem.G OAL
if guaranteed#={ } and M AKING -P ROGRESS (plan, initialPlan ) then
 Idea: We finalState any element
can turn a←planner P into of an online problem solver by adding an action
guaranteed
return D (hierarchy
RePlan(g) without preconditions that re-starts
ECOMPOSE , problem.I
P inNITIAL , plan ,state
the current finalState)
with goal g.
hla ← some HLA in plan
 Observation: Replanning
prefix ,suffix induces
← the action a tradeoff between
subsequences pre-planning
before and after hla inand
planre-planning.
outcome ← R ESULT(problem.I NITIAL, prefix )
 Example 19.8.1. The plan [RePlan(g)] is a (trivially) complete plan for any goal
for each sequence in R EFINEMENTS(hla, outcome, hierarchy ) do
g. (not helpful)
frontier ← Insert(A PPEND( prefix , sequence, suffix ), frontier )
 Example 19.8.2. A plan with sub-plans for every contingency (e.g. what to do if
function
a meteorD ECOMPOSE
strikes) may (hierarchy , s0 , plan, sf ) returns a solution
be too costly/large. (wasted effort)
solution ← an empty plan
Example
 while plan19.8.3.
is not empty But when do a tire blows while driving into the desert, we want to
have water pre-planned.
action ← R EMOVE -L AST(plan) (due diligence against catastrophies)
− −
si ← a state inInRstochastic
 Observation: EACH (s0 ,or plan) suchobservable
partially that sf ∈R EACH (si , action
environments we)also need some
problem ← a problem with I NITIAL = si and G OAL = sf
form of execution monitoring to determine the need for replanning (plan repair).
solution ← A PPEND(A NGELIC -S EARCH (problem, hierarchy , action ), solution)
sf ← si
return solution
Michael Kohlhase: Artificial Intelligence 1 675 2025-02-06

Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and
Replanning for Plan
commit to high-level Repair
plans that work while avoiding high-level plans that don’t. The predi-
cate M AKING -P ROGRESS checks to make sure that we aren’t stuck in an infinite regression
of Generally:
 Replanning
refinements. At top level, when the agent’s
call A NGELIC modelwith
-S EARCH of the world
[Act] is initialPlan
as the incorrect. .
 Example 19.8.4 (Plan Repair by Replanning). Given a plan from S to G.

Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G.
The agent executes steps of the plan until it expects to be in state E, but observes that it is
actually in O. The agent then replans for the minimal repair plus continuation to reach G.
19.8. REPLANNING AND EXECUTION MONITORING 455

 The agent executes wholeplan step by step, monitoring the rest (plan).
 After a few steps the agent expects to be in E, but observes state O.
 Replanning: by calling the planner recursively
 find state P in wholeplan and a plan repair from O to P . (P may be G)
 minimize the cost of repair + continuation

Michael Kohlhase: Artificial Intelligence 1 676 2025-02-06

Factors in World Model Failure ; Monitoring


 Generally: The agent’s world model can be incorrect, because
 an action has a missing precondition (need a screwdriver for remove−lid)
 an action misses an effect (painting a table gets paint on the floor)
 it is missing a state variable (amount of paint in a can: no paint ; no color)
 no provisions for exogenous events (someone knocks over a paint can)
 Observation: Without a way for monitoring for these, planning is very brittle.

 Definition 19.8.5. There are three levels of execution monitoring: before executing
an action
 action monitoring checks whether all preconditions still hold.
 plan monitoring checks that the remaining plan will still succeed.
 goal monitoring checks whether there is a better set of goals it could try to
achieve.
 Note: ?? was a case of action monitoring leading to replanning.

Michael Kohlhase: Artificial Intelligence 1 677 2025-02-06

Integrated Execution Monitoring and Planning


 Problem: Need to upgrade planing data structures by bookkeeping for execution
monitoring.
 Observation: With their causal links, partially ordered plans already have most of
the infrastructure for action monitoring:
Preconditions of remaining plan
b all preconditions of remaining steps not achieved by remaining steps
=
b all causal link “crossing current time point”
=

 Idea: On failure, resume planning (e.g. by POP) to achieve open conditions from
current state.
 Definition 19.8.6. IPEM (Integrated Planning, Execution, and Monitoring):

 keep updating Start to match current state


 links from actions replaced by links from Start when done
456 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

Michael Kohlhase: Artificial Intelligence 1 678 2025-02-06

Execution Monitoring Example


 Example 19.8.7 (Shopping for a drill, milk, and bananas). Start/end at home,
drill sold by hardware store, milk/bananas by supermarket.
19.8. REPLANNING AND EXECUTION MONITORING 457
458 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD

Michael Kohlhase: Artificial Intelligence 1 679 2025-02-06


Chapter 20

Semester Change-Over

20.1 What did we learn in AI 1?


A Video Nugget covering this section can be found at https://fau.tv/clip/id/26916.

Topics of AI-1 (Winter Semester)


 Getting Started
 What is Artificial Intelligence? (situating ourselves)
 Logic programming in Prolog (An influential paradigm)
 Intelligent Agents (a unifying framework)
 Problem Solving
 Problem Solving and search (Black Box World States and Actions)
 Adversarial search (Game playing) (A nice application of search)
 constraint satisfaction problems (Factored World States)
 Knowledge and Reasoning
 Formal Logic as the mathematics of Meaning
 Propositional logic and satisfiability (Atomic Propositions)
 First-order logic and theorem proving (Quantification)
 Logic programming (Logic + Search; Programming)
 Description logics and semantic web
 Planning

 Planning Frameworks
 Planning Algorithms
 Planning and Acting in the real world

Michael Kohlhase: Artificial Intelligence 1 680 2025-02-06

459
460 CHAPTER 20. SEMESTER CHANGE-OVER

Rational Agents as an Evaluation Framework for AI


 Agents interact with the environment

Section 2.1. Agents and Environments 35


General agent schema

Agent Sensors
Percepts

Environment
?

Actions
Actuators

Figure 2.1 Agents interact with environments through sensors and actuators.
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49

there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
We can imagine tabulating the agent function that describes any given agent; for most
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment

in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling in the right-hand column
rule ← RULE -M ATCH(state, rules) in various ways. The obvious question, then, is this: What
is theaction
right ←way to fill
rule.A CTIONout the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action

1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
Section
20.1. 2.4.
WHATThe
DIDStructure of AgentsIN AI 1?
WE LEARN 51 461

Sensors
State
How the world evolves What the world
is like now

Environment
What my actions do

Condition-action rules What action I


should do now

Agent Actuators

52 Figure 2.11 A model-based reflex agent. Chapter 2. Intelligent Agents


Goal-Based Agents

function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action


Sensors
persistent: state, the agent’s current conception of the world state
model , a description of how the next state depends on current state and action
State
What the world
rules, aHow
setthe
of world
condition–action
evolves rules is like now
action, the most recent action, initially none

Environment
What it will be like
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
What action I
Goals should do now
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
Agent Actuators

is responsible for creating


Figure 2.13 the newgoal-based
A model-based, internal state description.
agent. The
It keeps track details
of the ofstate
world howasmodels
well as and
54 statesa are represented vary widely depending on the Chapter 2.
type ofthat
environmentIntelligent Agents
and lead
the particular
Utility-Based
set of Agent
goals it is trying to achieve, and chooses an action will (eventually) to the
technology used of
achievement in its
thegoals.
agent design. Detailed examples of models and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
Regardless of the kind of representation used, Sensorsit is seldom possible for the agent to
example, the
determine thetaxi may state
current be driving back home,
of a partially
State
and it may
observable have a rule
environment telling Instead,
exactly. it to fill up
thewith
box
gas on the way home
labeled “what the world unless it
is world has
like now” at least half a tank. Although “driving back home” may
evolves (Figure 2.11) represents the agent’s “best guess” (or
What the world
How the
seem to an best
sometimes aspect of the world
guesses). state, thean
For example, fact of the
automated taxi’s
is like
may not beisable
taxidestination
now actually
to seeanaround
aspect the
of
Environment

the agent’s internal state.


large truck that has stopped If you find this puzzling, consider that the taxi could be in exactly
What myinactions
frontdoof it and canWhatonly
it will guess
be like about what may be causing the
the same Thus,
hold-up. place uncertainty
at the same time, but
about the intending toif reach
current state
I do action
may be a different
A destination.
unavoidable, but the agent still has
to make a decision. How happy I will be
2.4.4A perhaps
Goal-based agents
less obvious
Utility
point about the internal in such a state
“state” maintained by a model-based
agent
Knowingis that it does not
something have
about thetocurrent
describe “what
state of thethe world
environment
What action I
is like now”
is not in a enough
always literal sense. For
to decide
should do now
what to do. For example, at a road junction, the taxi can turn left, turn right, or go straight
on. The correct decision
Agent depends on where the taxiActuators is trying to get to. In other words, as well
GOAL as a current state description, the agent needs some sort of goal information that describes
situations that are desirable—for example, being at the passenger’s destination. The agent
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
program can combine this with the model (the same information as was used in the model-
Learning Agents
a utility function that measures its preferences among states of the world. Then it chooses the
basedaction
reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
that leads to the best expected utility, where expected utility is computed by averaging
agent’s structure.
over all possible outcome states, weighted by the probability of the outcome.
Sometimes goal-based action selection is straightforward—for example, when goal sat-
isfaction results immediately from a single action. Sometimes it will be more tricky—for
outcome. (Appendix
example, when A defines
the agent has toexpectation
consider longmore precisely.)
sequences In Chapter
of twists 16, we
and turns show to
in order that any
find a
rational agent must
way to achieve behave
the goal. it possesses
as if (Chapters
Search 3 toa 5)
utility
and function
planningwhose expected
(Chapters 10 andvalue it tries
11) are the
to maximize.
subfields of AIAn agent that
devoted possesses
to finding an sequences
action explicit utility
that function
achieve thecanagent’s
make rational
goals. decisions
with aNotice
general-purpose algorithm
that decision making that does
of this notisdepend
kind on the specific
fundamentally differentutility function
from the being
condition–
maximized. In this way,
action rules described the “global”
earlier, in that itdefinition of rationality—designating
involves consideration of the future—both as rational
“Whatthose
will
agent
happenfunctions that have the highest
if I do such-and-such?” and “Willperformance—is turned into
that make me happy?” a “local”
In the constraint
reflex agent on
designs,
rational-agent
this information designs
is notthat can be expressed
explicitly in abecause
represented, simple theprogram.
built-in rules map directly from
The utility-based agent structure appears in Figure 2.14. Utility-based agent programs
appear in Part IV, where we design decision-making agents that must handle the uncertainty
Section
462 2.4. The Structure of Agents 55
CHAPTER 20. SEMESTER CHANGE-OVER

Performance standard

Critic Sensors

feedback

Environment
changes
Learning Performance
element element
knowledge
learning
goals

Problem
generator

Actuators
Agent

Figure 2.15 A general learning agent.


Michael Kohlhase: Artificial Intelligence 1 681 2025-02-06

He estimates how much work this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
in initially
 Idea: Tryunknown
to design environments
agents that andare
to become more competent than(do
successful its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
 Definition
Throughout 20.1.1.
the book, An we agent
comment is called rational, if
on opportunities andit methods
chooses for
whichever
learning inaction max-
particular
kinds of
imizes theagents.
expected Part Vvalue
goes into
of themuch more depth on
performance the learning
measure givenalgorithms
the perceptthemselves.
sequence
to date.A learning
This is calledagent canthebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT
 Note:
sponsibleAfor rational
makingagent need notand
improvements, bethe
perfect
performance element, which is responsible for
selecting external actions. The performance element is what we have previously considered
to only
 be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
 need from critic on
notthepredict e.g.how verytheunlikely
agent is but
doing and determines
catastrophic how in
events thethe
performance
future
element should be modified to do better in the future.
 percepts may not supply all relevant information (Rational ̸= clairvoyant)
The design of the learning element depends very much on the design of the performance
 if we
element. When cannot
tryingperceive
to design things
an agentwe dolearns
that not need to react
a certain to them.
capability, the first question is
not “How am I going to get it to learn this?” but
 but we may need to try to find out about hidden dangers “What kind of performance element will my
(exploration)
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
 action outcomes may not be as expected (rational ̸= successful)
can be constructed to improve every part of the agent.
 but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
 Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance

Michael Kohlhase: Artificial Intelligence 1 682 2025-02-06

Symbolic AI: Adding Knowledge to Algorithms


 Problem Solving (Black Box States, Transitions, Heuristics)

 Framework: Problem Solving and Search (basic tree/graph walking)


 Variant: Game playing (Adversarial search) (minimax + αβ-Pruning)
 Constraint Satisfaction Problems (heuristic search over partial assignments)
 States as partial variable assignments, transitions as assignment
20.1. WHAT DID WE LEARN IN AI 1? 463

 Heuristics informed by current restrictions, constraint graph


 Inference as constraint propagation (transferring possible values across arcs)
 Describing world states by formal language (and drawing inferences)

 Propositional logic and DPLL (deciding entailment efficiently)


 First-order logic and ATP (reasoning about infinite domains)
 Digression: Logic programming (logic + search)
 Description logics as moderately expressive, but decidable logics

 Planning: Problem Solving using white-box world/action descriptions


 Framework: describing world states in logic as sets of propositions and actions
by preconditions and add/delete lists
 Algorithms: e.g heuristic search by problem relaxations

Michael Kohlhase: Artificial Intelligence 1 683 2025-02-06

Topics of AI-2 (Summer Semester)


 Uncertain Knowledge and Reasoning
 Uncertainty
 Probabilistic reasoning
 Making Decisions in Episodic Environments
 Problem Solving in Sequential Environments
 Foundations of machine learning
 Learning from Observations
 Knowledge in Learning
 Statistical Learning Methods
 Communication (If there is time)
 Natural Language Processing
 Natural Language for Communication

Michael Kohlhase: Artificial Intelligence 1 684 2025-02-06


464 CHAPTER 20. SEMESTER CHANGE-OVER

Artificial Intelligence I/II

Prof. Dr. Michael Kohlhase


Professur für Wissensrepräsentation und -verarbeitung
Informatik, FAU Erlangen-Nürnberg
[email protected]
20.2. ADMINISTRATIVA 465

20.2 Administrativa
We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible. If you have questions please make sure you discuss them
with the instructor, the teaching assistants, or your fellow students. There are three sensible
venues for such discussions: online in the lectures, in the tutorials, which we discuss now, or in
the course forum – see below. Finally, it is always a very good idea to form study groups with
your friends.

Tutorials for Artificial Intelligence 1


 Approach: Weekly tutorials and homework assignments (first one in week two)
 Goal 1: Reinforce what was taught in the lectures. (you need practice)

 Goal 2: Allow you to ask any question you have in a protected environment.
 Instructor/Lead TA: Florian Rabe (KWARC Postdoc)
 Room: 11.137 @ Händler building, [email protected]

 Tutorials: One each taught by Florian Rabe, . . . .


 Life-saving Advice: Go to your tutorial, and prepare for it by having looked at
the slides and the homework assignments!

Michael Kohlhase: Artificial Intelligence 2 685 2025-02-06

Now we come to a topic that is always interesting to the students: the grading scheme.

Assessment, Grades
 Overall (Module) Grade:

 Grade via the exam (Klausur) ; 100% of the grade.


 Up to 10% bonus on-top for an exam with ≥ 50% points.(< 50% ; no bonus)
 Bonus points =
b percentage sum of the best 10 prepquizzes divided by 100.
 Exam: 90 minutes exam conducted in presence on paper! (∼ Oct. 1. 2025)

 Retake Exam: 90 min exam six months later. (∼ April 1. 2026)

 Register for exams in https://campo.fau.de. (there is a deadine!)


 Note: You can de-register from an exam on https://campo.fau.de up to three
working days before exam. (do not miss that if you are not prepared)

Michael Kohlhase: Artificial Intelligence 2 686 2025-02-06

AI-2 Homework Assignments


 Goal: Homework assignments reinforce what was taught in lectures.
466 CHAPTER 20. SEMESTER CHANGE-OVER

 Homework Assignments: Small individual problem/programming/proof task


 but take time to solve (at least read them directly ; questions)
 Didactic Intuition: Homework assignments give you material to test your under-
standing and show you how to apply it.
 Homeworks give no points, but without trying you are unlikely to pass the exam.
 Homeworks will be mainly peer-graded in the ALeA system.
 Didactic Motivation: Through peer grading students are able to see mistakes
in their thinking and can correct any problems in future assignments. By grading
assignments, students may learn how to complete assignments more accurately and
how to improve their future results. (not just us being lazy)

Michael Kohlhase: Artificial Intelligence 2 687 2025-02-06

It is very well-established experience that without doing the homework assignments (or something
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take very little home from the course. Just sitting in the course and nodding is not
enough!

AI-2 Homework Assignments – Howto

 Homework Workflow: in ALeA (see below)


 Homework assignments will be published on thursdays: see https://courses.
voll-ki.fau.de/hw/ai-1
 Submission of solutions via the ALeA system in the week after
 Peer grading/feedback (and master solutions) via answer classes.
 Quality Control: TAs and instructors will monitor and supervise peer grading.
 Experiment: Can we motivate enough of you to make peer assessment self-
sustaining?

 I am appealing to your sense of community responsibility here . . .


 You should only expect other’s to grade your submission if you grade their’s
(cf. Kant’s “Moral Imperative”)
 Make no mistake: The grader usually learns at least as much as the gradee.
 Homework/Tutorial Discipline:

 Start early! (many assignments need more than one evening’s work)
 Don’t start by sitting at a blank screen (talking & study groups help)
 Humans will be trying to understand the text/code/math when grading it.
 Go to the tutorials, discuss with your TA! (they are there for you!)

Michael Kohlhase: Artificial Intelligence 2 688 2025-02-06


20.2. ADMINISTRATIVA 467

Prerequisites for AI-2


 Content Prerequisites: the mandatory courses in CS@FAU; Sem 1-4, in particular:
 course “Mathematik C4” (InfMath4). (for stochastics)
 (very) elementary complexity theory. (big Oh and friends)
also AI-1 (“Artificial Intelligence I”) (of course)

 Intuition: (take them with a kilo of salt)


 This is what I assume you know! (I have to assume something)
 In many cases, the dependency of AI-2 on these is partial and “in spirit”.
 If you have not taken these (or do not remember), read up on them as needed!

 The real Prerequisite: Motivation, Interest, Curiosity, hard work. (AI-2 is


non-trivial)
 You can do this course if you want! (and I hope you are successful)

Michael Kohlhase: Artificial Intelligence 2 689 2025-02-06

One special case of academic rules that affects students is the question of cheating, which we will
cover next.

Cheating [adapted from CMU:15-211 (P. Lee, 2003)]

 There is no need to cheat in this course!! (hard work will usually do)
 Note: Cheating prevents you from learning (you are cutting into your own flesh)
 We expect you to know what is useful collaboration and what is cheating.
 You have to hand in your own original code/text/math for all assignments
 You may discuss your homework assignments with others, but if doing so impairs
your ability to write truly original code/text/math, you will be cheating
 Copying from peers, books or the Internet is plagiarism unless properly attributed
(even if you change most of the actual words)
 I am aware that there may have been different standards about this at your previous
university! (these are the ground rules here)
 There are data mining tools that monitor the originality of text/code.
 Procedure: If we catch you at cheating. . . (correction: if we suspect cheating)
 We will confront you with the allegation and impose a grade sanction.
 If you have a reasonable explanation we lift that. (you have to convince us)
 Note: Both active (copying from others) and passive cheating (allowing others to
copy) are penalized equally.

Michael Kohlhase: Artificial Intelligence 2 690 2025-02-06

We are fully aware that the border between cheating and useful and legitimate collaboration is
468 CHAPTER 20. SEMESTER CHANGE-OVER

difficult to find and will depend on the special case. Therefore it is very difficult to put this into
firm rules. We expect you to develop a firm intuition about behavior with integrity over the course
of stay at FAU. Do use the opportunity to discuss the AI-2 topics with others. After all, one
of the non-trivial skills you want to learn in the course is how to talk about Artificial Intelligence
topics. And that takes practice, practice, and practice. Due to the current AI hype, the course
Artificial Intelligence is very popular and thus many degree programs at FAU have adopted it for
their curricula. Sometimes the course setup that fits for the CS program does not fit the other’s
very well, therefore there are some special conditions. I want to state here.

Special Admin Conditions


 Some degree programs do not “import” the course Artificial Intelligence 1, and thus
you may not be able to register for the exam via https://campo.fau.de.
 Just send me an e-mail and come to the exam, (we do the necessary admin)
 Tell your program coordinator about AI-1/2 so that they remedy this situation

 In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.
 ECTS credits need to be divisible by five ⇝ 7.5 + 7.5 = 15.

Michael Kohlhase: Artificial Intelligence 2 691 2025-02-06

I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.

20.3 Overview over AI and Topics of AI-II


We restart the new semester by reminding ourselves of (the problems, methods, and issues of)
Artificial Intelligence, and what has been achived so far.

20.3.1 What is Artificial Intelligence?


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/21701.
The first question we have to ask ourselves is “What is Artificial Intelligence?”, i.e. how can we
define it. And already that poses a problem since the natural definition like human intelligence,
but artificially realized presupposes a definition of intelligence, which is equally problematic; even
Psychologists and Philosophers – the subjects nominally “in charge” of natural intelligence – have
problems defining it, as witnessed by the plethora of theories e.g. found at [WHI].

What is Artificial Intelligence? Definition


20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 469

 Definition 20.3.1 (According to


Wikipedia). Artificial Intelligence (AI)
is intelligence exhibited by machines
 Definition 20.3.2 (also). Artificial Intelli-
gence (AI) is a sub-field of computer science
that is concerned with the automation of in-
telligent behavior.
 BUT: it is already difficult to define intel-
ligence precisely.

 Definition 20.3.3 (Elaine Rich). Artificial


Intelligence (AI) studies how we can make
the computer do things that humans can still
do better at the moment.
Michael Kohlhase: Artificial Intelligence 2 692 2025-02-06

Maybe we can get around the problems of defining “what artificial intelligence is”, by just describing
the necessary components of AI (and how they interact). Let’s have a try to see whether that is
more informative.

What is Artificial Intelligence? Components


 Elaine Rich: AI studies how we can make the computer do things that humans
can still do better at the moment.

 This needs a combination of

the ability to learn

Inference

Perception
470 CHAPTER 20. SEMESTER CHANGE-OVER

Language understanding

Emotion

Michael Kohlhase: Artificial Intelligence 2 693 2025-02-06

Note that list of components is controversial as well. Some say that it lumps together cognitive
capacities that should be distinguished or forgets others, . . . . We state it here much more to get
AI-2 students to think about the issues than to make it normative.

20.3.2 Artificial Intelligence is here today!


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/21697.
The components of Artificial Intelligence are quite daunting, and none of them are fully understood,
much less achieved artificially. But for some tasks we can get by with much less. And indeed that
is what the field of Artificial Intelligence does in practice – but keeps the lofty ideal around. This
practice of “trying to achieve AI in selected and restricted domains” (cf. the discussion starting
with slide 32) has borne rich fruits: systems that meet or exceed human capabilities in such areas.
Such systems are in common use in many domains of application.

Artificial Intelligence is here today!


20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 471
472 CHAPTER 20. SEMESTER CHANGE-OVER

 in outer space
 in outer space systems
need autonomous con-
trol:
 remote control impos-
sible due to time lag
 in artificial limbs
 the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
 in household appliances
 The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
 general robotic house-
hold help is on the
horizon.
 in hospitals
 in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
 Paro is a cuddly robot
that eases solitude in
nursing homes.
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 473

Michael Kohlhase: Artificial Intelligence 2 694 2025-02-06

We will conclude this subsection with a note of caution.

The AI Conundrum
 Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
 But: researchers at the Dartmouth Conference (1956) really thought they would
solve/reach AI in two/three decades.

 Consequence: AI still asks the big questions. (and still promises answers soon)
 Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
 AI Conundrum: Once AI solves a subfield it is called “computer science”.
(becomes a separate subfield of CS)
 Example 20.3.4. Functional/Logic Programming, automated theorem proving,
Planning, machine learning, Knowledge Representation, . . .
 Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.

Michael Kohlhase: Artificial Intelligence 2 695 2025-02-06

All of these phenomena can be seen in the growth of AI as an academic discipline over the course
of its now over 70 year long history.

The current AI Hype — Part of a longer Story


 The history of AI as a discipline has been very much tied to the amount of funding
– that allows us to do research and development.
 Funding levels are tied to public perception of success (especially for AI)
 Definition 20.3.5. An AI winter is a time period of low public perception and
funding for AI,
mostly because AI has failed to deliver on its – sometimes overblown – promises
An AI summer is a time period of high public perception and funding for AI
 A potted history of AI (AI summers and summers)
474 CHAPTER 20. SEMESTER CHANGE-OVER

AI becomes
scarily effective,
ubiquitous

Excitement fades;
some applications
AI-conse- profit a lot
quences,
Biases, AI-bubble bursts,
Regulation the next AI winter
Lighthill report WWW ; comes
Dartmouth Conference Data/-
Turing Test Computing
AI Winter 2
AI Winter 1 Explosion
1987-1994
1974-1980

1950 1960 1970 1980 1990 2000 2010 2021

Michael Kohlhase: Artificial Intelligence 2 696 2025-02-06

Of course, the future of AI is still unclear, we are currently in a massive hype caused by the advent
of deep neural networks being trained on all the data of the Internet, using the computational
power of huge compute farms owned by an oligopoly of massive technology companies – we are
definitely in an AI summer.
But AI as a academic community and the tech industry also make outrageous promises, and
the media pick it up and distort it out of proportion, . . . So public opinion could flip again, sending
AI into the next winter.

20.3.3 Ways to Attack the AI Problem


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/21717.
There are currently three main avenues of attack to the problem of building artificially intelligent
systems. The (historically) first is based on the symbolic representation of knowledge about the
world and uses inference-based methods to derive new knowledge on which to base action decisions.
The second uses statistical methods to deal with uncertainty about the world state and learning
methods to derive new (uncertain) world assumptions to act on.

Four Main Approaches to Artificial Intelligence


 Definition 20.3.6. Symbolic AI is a subfield of AI based on the assumption that
many aspects of intelligence can be achieved by the manipulation of symbols, com-
bining them into meaning-carrying structures (expressions) and manipulating them
(using processes) to produce new expressions.
 Definition 20.3.7. Statistical AI remedies the two shortcomings of symbolic AI
approaches: that all concepts represented by symbols are crisply defined, and that all
aspects of the world are knowable/representable in principle. Statistical AI adopts
sophisticated mathematical models of uncertainty and uses them to create more
accurate world models and reason about them.
 Definition 20.3.8. Subsymbolic AI (also called connectionism or neural AI) is a
subfield of AI that posits that intelligence is inherently tied to brains, where infor-
mation is represented by a simple sequence pulses that are processed in parallel via
simple calculations realized by neurons, and thus concentrates on neural computing.

 Definition 20.3.9. Embodied AI posits that intelligence cannot be achieved by


reasoning about the state of the world (symbolically, statistically, or connectivist),
but must be embodied i.e. situated in the world, equipped with a “body” that can
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 475

interact with it via sensors and actuators. Here, the main method for realizing
intelligent behavior is by learning from the world.

Michael Kohlhase: Artificial Intelligence 2 697 2025-02-06

As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersection
of computer science (logic, programming, applied statistics), Cognitive Science (psychology, neu-
roscience), philosophy (can machines think, what does that mean?), linguistics (natural language
understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.

Two ways of reaching Artificial Intelligence?


 We can classify the AI approaches by their coverage and the analysis depth (they
are complementary)

Deep symbolic not there yet


AI-1 cooperation?

Shallow no-one wants this statistical/sub symbolic


AI-2
Analysis ↑
vs. Narrow Wide
Coverage →

 This semester we will cover foundational aspects of symbolic AI (deep/narrow


processing)
 next semester concentrate on statistical/subsymbolic AI.
(shallow/wide-coverage)

Michael Kohlhase: Artificial Intelligence 2 698 2025-02-06

We combine the topics in this way in this course, not only because this reproduces the histor-
ical development but also as the methods of statistical and subsymbolic AI share a common
basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.

Environmental Niches for both Approaches to AI


 Observation: There are two kinds of applications/tasks in AI

 Consumer tasks: consumer grade applications have tasks that must be fully
generic and wide coverage. ( e.g. machine translation like Google Translate)
 Producer tasks: producer grade applications must be high-precision, but can be
476 CHAPTER 20. SEMESTER CHANGE-OVER

domain-specific (e.g. multilingual documentation, machinery-control, program


verification, medical technology)

Precision
100% Producer Tasks

50% Consumer Tasks

103±1 Concepts 106±1 Concepts Coverage

after Aarne Ranta [Ran17].

 General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
 A domain of producer tasks I am interested in: mathematical/technical documents.

Michael Kohlhase: Artificial Intelligence 2 699 2025-02-06

An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
million Euro and sells them into dozens of countries. Thus T must also provide comprehensive
machine operation manuals, a non-trivial undertaking, since no two machines are identical and
they must be translated into many languages, leading to hundreds of documents. As those manual
share a lot of semantic content, their management should be supported by AI techniques. It is
critical that these methods maintain a high precision, operation errors can easily lead to very
costly machine damage and loss of production. On the other hand, the domain of these manuals is
quite restricted. A machine tool has a couple of hundred components only that can be described
by a couple of thousand attributes only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.

20.3.4 AI in the KWARC Group

The KWARC Research Group


 Observation: The ability to represent knowledge about the world and to draw
logical inferences is one of the central components of intelligent behavior.

 Thus: reasoning components of some form are at the heart of many AI systems.
 KWARC Angle: Scaling up (web-coverage) without dumbing down (too much)
 Content markup instead of full formalization (too tedious)
 User support and quality control instead of “The Truth” (elusive anyway)
 use Mathematics as a test tube ( Mathematics =
b Anything Formal )
 care more about applications than about philosophy (we cannot help getting
this right anyway as logicians)
 The KWARC group was established at Jacobs Univ. in 2004, moved to FAU Erlan-
gen in 2016
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 477

 see http://kwarc.info for projects, publications, and links

Michael Kohlhase: Artificial Intelligence 2 700 2025-02-06

Overview: KWARC Research and Projects

Applications: eMath 3.0, Active Documents, Active Learning, Semantic Spread-


sheets/CAD/CAM, Change Mangagement, Global Digital Math Library, Math
Search Systems, SMGloM: Semantic Multilingual Math Glossary, Serious Games,
...
Foundations of Math: KM & Interaction: Semantization:
 MathML, OpenM ath  Semantic Interpretation  LATEXML: LATEX ; XML
 advanced Type Theories (aka. Framing)  STEX: Semantic LATEX
 Mmt: Meta Meta The-  math-literate interaction  invasive editors
ory  MathHub: math archi-  Context-Aware IDEs
 Logic Morphisms/Atlas ves & active docs
 Mathematical Corpora
 Theorem Prover/CAS In-  Active documents: em-
bedded semantic services  Linguistics of Math
teroperability
 Model-based Education  ML for Math Semantics
 Mathematical Model- Extraction
s/Simulation
Foundations: Computational Logic, Web Technologies, OMDoc/Mmt

Michael Kohlhase: Artificial Intelligence 2 701 2025-02-06

Research Topics in the KWARC Group


 We are always looking for bright, motivated KWARCies.
 We have topics in for all levels! (Enthusiast, Bachelor, Master, Ph.D.)

 List of current topics: https://gl.kwarc.info/kwarc/thesis-projects/


 Automated Reasoning: Maths Representation in the Large
 Logics development, (Meta)n -Frameworks
 Math Corpus Linguistics: Semantics Extraction
 Serious Games, Cognitive Engineering, Math Information Retrieval, Legal Rea-
soning, . . .
 . . . last but not least: KWARC is the home of ALeA!
 We always try to find a topic at the intersection of your and our interests.
1
 We also sometimes have positions!. (HiWi, Ph.D.: 2 E-13, PostDoc: full E-13)

Michael Kohlhase: Artificial Intelligence 2 702 2025-02-06

20.3.5 Agents and Environments in AI2


This part of the lecture notes addresses inference and agent decision making in partially observable
environments, i.e. where we only know probabilities instead of certainties whether propositions
are true/false. We cover basic probability theory and – based on that – Bayesian Networks and
478 CHAPTER 20. SEMESTER CHANGE-OVER

simple decision making in such environments. Finally we extend this to probabilistic temporal
models and their decision theory.

20.3.5.1 Recap: Rational Agents as a Conceptual Framework


A Video Nugget covering this subsubsection can be found at https://fau.tv/clip/id/27585.

Agents and Environments


 Definition 20.3.10. An agent is anything that
 perceives its environment via sensors (a means of sensing the environment)
 acts on it with actuators (means of changing the environment).
Definition 20.3.11. Any recognizable, coherent employment of the actuators of an
agent is called an action.

 Example 20.3.12. Agents include humans, robots, softbots, thermostats, etc.


 remark: The notion of an agent and its environment is intentionally designed to
be inclusive. We will classify and discuss subclasses of both later

Michael Kohlhase: Artificial Intelligence 2 703 2025-02-06

One possible objection to this is that the agent and the environment are conceptualized as separate
entities; in particular, that the image suggests that the agent itself is not part of the environment.
Indeed that is intended, since it makes thinking about agents and environments easier and is of
little consequence in practice. In particular, the offending separation is relatively easily fixed if
needed.

Agent Schema: Visualizing the Internal Agent Structure


 Agent Schema: We will use the following kind of agent schema to visualize the
internal structure of an agent:
20.3. OVERVIEW
Section 2.1. OVER
Agents andAI AND TOPICS OF AI-II
Environments 35 479

Agent Sensors
Percepts

Environment
?

Actions
Actuators

Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTIONdescribed by the agent function that maps any given percept sequence to an action.
We can
Michael imagine
Kohlhase: Intelligence the
tabulating
Artificial 2 agent function that
704describes any given agent; for most
2025-02-06

agents, this would be a very large table—infinite, in fact, unless we place a bound on the
length of percept sequences we want to consider. Given an agent to experiment with, we can,
in principle, construct this table by trying out all possible percept sequences and recording
Rationality which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
 Idea: Try to mathematical
abstract design agents that the
description; areagent
successful!
program is a concrete(aka. “do therunning
implementation, right thing”)
within some physical system.
 Problem: To What do these
illustrate we mean ideas, we byuse “successful”,
a very simple how do we vacuum-cleaner
example—the measure “success”? world
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
 Definition 20.3.13.
it’s also A performance
a made-up world, so we can invent measure is a function
many variations. thatworld
This particular evaluates a sequence
has just two
locations: squares A and B. The vacuum agent perceives which square it is in and whether
of environments.
there is dirt in the square. It can choose to move left, move right, suck up the dirt, or do
nothing. One very simple agent function is the following: if the current square is dirty, then
 Example 20.3.14. A performance measure for a vacuum cleaner could
suck; otherwise, move to the other square. A partial tabulation of this agent function is shown
in Figure 2.3 and an agent program that implements it appears in Figure 2.8 on page 48.
 award one point per “square” cleaned up in time T ?
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
by filling
 award one pointin the right-hand
per clean column in variousper
“square” ways. The obvious
time question,one
step, minus then,per
is this: What
move?
is the right way to fill out the table? In other words, what makes an agent good or bad,
 penalize for or>stupid?
intelligent k dirty squares?
We answer these questions in the next section.

 Definition 20.3.15. An agent is called rational, if it chooses whichever action max-


1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many

times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
imizes the
showexpected value
later in this chapter that itof
can the
be veryperformance
intelligent. measure given the percept sequence
to date.
 Critical Observation: We only need to maximize the expected value, not the
actual value of the performance measure!

 Question: Why is rationality a good quality to aim for?

Michael Kohlhase: Artificial Intelligence 2 705 2025-02-06

Let us see how the observation that we only need to maximize the expected value, not the actual
value of the performance measure affects the consequences.

Consequences of Rationality: Exploration, Learning, Autonomy


 Note: A rational agent need not be perfect:
 It only needs to maximize expected value (rational ̸= omniscient)
 need not predict e.g. very unlikely but catastrophic events in the future
 Percepts may not supply all relevant information (rational ̸= clairvoyant)
480 CHAPTER 20. SEMESTER CHANGE-OVER

 if we cannot perceive things we do not need to react to them.


 but we may need to try to find out about hidden dangers (exploration)
 Action outcomes may not be as expected (rational ̸= successful)
 but we may need to take action to ensure that they do (more often)
(learning)
 Note: Rationality may entail exploration, learning, autonomy (depending on the
environment / task)
 Definition 20.3.16. An agent is called autonomous, if it does not rely on the prior
knowledge about the environment of the designer.
 Autonomy avoids fixed behaviors that can become unsuccessful in a changing en-
vironment. (anything else would be
irrational)
 The agent may have to learn all relevant traits, invariants, properties of the envi-
ronment and actions.

Michael Kohlhase: Artificial Intelligence 2 706 2025-02-06

For the design of agent for a specific task – i.e. choose an agent architecture and design an
agent program, we have to take into account the performance measure, the environment, and the
characteristics of the agent itself; in particular its actions and sensors.

PEAS: Describing the Task Environment


 Observation: To design a rational agent, we must specify the task environment in
terms of performance measure, environment, actuators, and sensors, together called
the PEAS components.
 Example 20.3.17. When designing an automated taxi:
 Performance measure: safety, destination, profits, legality, comfort, . . .
 Environment: US streets/freeways, traffic, pedestrians, weather, . . .
 Actuators: steering, accelerator, brake, horn, speaker/display, . . .
 Sensors: video, accelerometers, gauges, engine sensors, keyboard, GPS, . . .
 Example 20.3.18 (Internet Shopping Agent). The task environment:
 Performance measure: price, quality, appropriateness, efficiency
 Environment: current and future WWW sites, vendors, shippers
 Actuators: display to user, follow URL, fill in form
 Sensors: HTML pages (text, graphics, scripts)

Michael Kohlhase: Artificial Intelligence 2 707 2025-02-06

The PEAS criteria are essentially a laundry list of what an agent design task description should
include.

Environment types
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 481

 Observation 20.3.19. Agent design is largely determined by the type of environ-


ment it is intended for.
 Problem: There is a vast number of possible kinds of environments in AI.

 Solution: Classify along a few “dimensions”. (independent characteristics)


 Definition 20.3.20. For an agent a we classify the environment e of a by its type,
which is one of the following. We call e
1. fully observable, iff the a’s sensors give it access to the complete state of the
environment at any point in time, else partially observable.
2. deterministic, iff the next state of the environment is completely determined by
the current state and a’s action, else stochastic.
3. episodic, iff a’s experience is divided into atomic episodes, where it perceives and
then performs a single action. Crucially, the next episode does not depend on
previous ones. Non-episodic environments are called sequential.
4. dynamic, iff the environment can change without an action performed by a, else
static. If the environment does not change but a’s performance measure does,
we call e semidynamic.
5. discrete, iff the sets of e’s state and a’s actions are countable, else continuous.
6. single-agent, iff only a acts on e; else multi-agent (when must we count parts of
e as agents?)

Michael Kohlhase: Artificial Intelligence 2 708 2025-02-06

Simple reflex agents


 Definition 20.3.21. A simple reflex agent is an agent a that only bases its actions
on the last percept: so the agent function simplifies to f a : P → A.
 Agent
Section 2.4. Schema:
The Structure of Agents 49

Agent Sensors

What the world


is like now
Environment

Condition-action rules What action I


should do now

Actuators

Figure 2.9 Schematic diagram of a simple reflex agent.


 Example 20.3.22 (Agent Program).
procedurefunction
Reflex−Vacuum−Agent ) returns an action returns an action
[location,status]
S IMPLE -R EFLEX -AGENT( percept
if status = thena set
Dirty rules,
persistent: . . of. condition–action rules
state ← I NTERPRET-I NPUT( percept )
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
returnKohlhase:
Michael action Artificial Intelligence 2 709 2025-02-06

Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.

trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
482 CHAPTER 20. SEMESTER CHANGE-OVER

Model-based Reflex Agents: Idea


 Idea: Keep track of the state of the world we cannot see in an internal model.
Section 2.4. The Structure of Agents 51
 Agent Schema:

Sensors
State
How the world evolves What the world
is like now

Environment
What my actions do

Condition-action rules What action I


should do now

Agent Actuators

Figure 2.11 A model-based reflex agent.


Michael Kohlhase: Artificial Intelligence 2 710 2025-02-06

function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action


persistent: state, the agent’s current conception of the world state
Model-based Reflex Agents: Definition
model , a description of how the next state depends on current state and action
rules, a set of condition–action rules
action, the most recent action, initially none
 Definition 20.3.23. A model-based agent is an agent whose actions depend on
state ← U PDATE -S TATE(state, action , percept , model )
 a rule
world
← Rmodel: a set
ULE -M ATCH S ofrules)
(state, possible states.
action ← rule.ACTION
 a return
sensoraction
model S that given a state s and a percepts p determines a new state
S(s, p).
 a Figure
transition
2.12 model T , thatreflex
A model-based predicts
agent.a Itnew
keepsstate
track T
of (s,
the a) from
current a state
state s and an
of the world,
action a.internal model. It then chooses an action in the same way as the reflex agent.
using an
 An action function f that maps (new) states to an actions.
is responsible for creating the new internal state description. The details of how models and
Ifstates
the world model ofvary
are represented a model-based
widely depending agent theistype
on A in state s and A has
of environment and the taken action
particular
A will transition
a,technology used in the toagent
statedesign. s′ = TDetailed
(S(p, s), a) and of
examples take action
models

= f (s′ ).
andaupdating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
 Note:Regardless
As differentof the percept
kind ofsequences lead used,
representation to different states,
it is seldom so thefor
possible agent function
the agent to

a : P → the
fdetermine A no longer depends only on the last percept.
current state of a partially observable environment exactly. Instead, the box
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
 Example 20.3.24 (Tail Lights Again). Model-based agents can do the ?? if the
sometimes best guesses). For example, an automated taxi may not be able to see around the
states include a concept of tail light brightness.
large truck that has stopped in front of it and can only guess about what may be causing the
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
to make a decision.
Michael Kohlhase: Artificial Intelligence 2 711 2025-02-06
A perhaps less obvious point about the internal “state” maintained by a model-based
agent is that it does not have to describe “what the world is like now” in a literal sense. For
20.3.5.2 Sources of Uncertainty
A Video Nugget covering this subsubsection can be found at https://fau.tv/clip/id/27582.

Sources of Uncertainty in Decision-Making


20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 483

Where’s that d. . . Wumpus?


And where am I, anyway??

 Non-deterministic actions:

 “When I try to go forward in this dark cave, I might actually go forward-left or


forward-right.”
 Partial observability with unreliable sensors:
 “Did I feel a breeze right now?”;
 “I think I might smell a Wumpus here, but I got a cold and my nose is blocked.”
 “According to the heat scanner, the Wumpus is probably in cell [2,3].”
 Uncertainty about the domain behavior:
 “Are you sure the Wumpus never moves?”

Michael Kohlhase: Artificial Intelligence 2 712 2025-02-06

Unreliable Sensors
 Robot Localization: Suppose we want to support localization using landmarks
to narrow down the area.
 Example 20.3.25. If you see the Eiffel tower, then you’re in Paris.

 Difficulty: Sensors can be imprecise.


 Even if a landmark is perceived, we cannot conclude with certainty that the
robot is at that location.
 This is the half-scale Las Vegas copy, you dummy.
 Even if a landmark is not perceived, we cannot conclude with certainty that the
robot is not at that location.
 Top of Eiffel tower hidden in the clouds.
 Only the probability of being at a location increases or decreases.

Michael Kohlhase: Artificial Intelligence 2 713 2025-02-06

20.3.5.3 Agent Architectures based on Belief States

We are now ready to proceed to environments which can only partially observed and where actions
are non deterministic. Both sources of uncertainty conspire to allow us only partial knowledge
about the world, so that we can only optimize “expected utility” instead of “actual utility” of our
actions.
484 CHAPTER 20. SEMESTER CHANGE-OVER

World Models for Uncertainty


 Problem: We do not know with certainty what state the world is in!
 Idea: Just keep track of all the possible states it could be in.
 Definition 20.3.26. A model-based agent has a world model consisting of

 a belief state that has information about the possible states the world may be
in, and
 a sensor model that updates the belief state based on sensor information
 a transition model that updates the belief state based on actions.
 Idea: The agent environment determines what the world model can be.

 In a fully observable, deterministic environment,


 we can observe the initial state and subsequent states are given by the actions
alone.
 thus the belief state is a singleton (we call its member the world state) and the
transition model is a function from states and actions to states: a transition
function.

Michael Kohlhase: Artificial Intelligence 2 714 2025-02-06

That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.

World Models by Agent Type in AI-1


 Search-based Agents: In a fully observable, deterministic environment
 goal-based agent with world state =
b “current state”
 no inference. (goal =
b goal state from search problem)

 CSP-based Agents: In a fully observable, deterministic environment


 goal-based agent withworld state =
b constraint network,
 inference =
b constraint propagation. (goal =
b satisfying assignment)
 Logic-based Agents: In a fully observable, deterministic environment

 model-based agent with world state =


b logical formula
 inference =
b e.g. DPLL or resolution.
 Planning Agents: In a fully observable, deterministic, environment
 goal-based agent with world state =
b PL0, transition model =
b STRIPS,
 inference =
b state/plan space search. (goal: complete plan/execution)
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 485

Michael Kohlhase: Artificial Intelligence 2 715 2025-02-06

Let us now see what happens when we lift the restrictions of total observability and determin-
ism.

World Models for Complex Environments


 In a fully observable, but stochastic environment,
 the belief state must deal with a set of possible states.
 ; generalize the transition function to a transition relation.
 Note: This even applies to online problem solving, where we can just perceive the
state. (e.g. when we want to optimize utility)
 In a deterministic, but partially observable environment,

 the belief state must deal with a set of possible states.


 we can use transition functions.
 We need a sensor model, which predicts the influence of percepts on the belief
state – during update.
 In a stochastic, partially observable environment,

 mix the ideas from the last two. (sensor model + transition relation)

Michael Kohlhase: Artificial Intelligence 2 716 2025-02-06

Preview: New World Models (Belief) ; new Agent Types


 Probabilistic Agents: In a partially observable environment
 belief state =
b Bayesian networks,
 inference =
b probabilistic inference.

 Decision-Theoretic Agents: In a partially observable, stochastic environment


 belief state + transition model =
b decision networks,
 inference =
b maximizing expected utility.

 We will study them in detail this semester.

Michael Kohlhase: Artificial Intelligence 2 717 2025-02-06

Overview: AI2
 Basics of probability theory (probability spaces, random variables, conditional
probabilities, independence,...)

 Probabilistic reasoning: Computing the a posteriori probabilities of events given


evidence, causal reasoning (Representing distributions efficiently, Bayesian
486 CHAPTER 20. SEMESTER CHANGE-OVER

networks,...)
 Probabilistic Reasoning over time (Markov chains, Hidden Markov models,...)
⇒ We can update our world model episodically based on observations (i.e. sensor
data)
 Decision theory: Making decisions under uncertainty (Preferences, Utilities,
Decision networks, Markov Decision Procedures,...)
⇒ We can choose the right action based on our world model and the likely outcomes
of our actions

 Machine learning: Learning from data (Decision Trees, Classifiers, Neural


Networks,...)

Michael Kohlhase: Artificial Intelligence 2 718 2025-02-06


Part V

Reasoning with Uncertain


Knowledge

487
489

This part of the lecture notes addresses inference and agent decision making in partially observable
environments, i.e. where we only know probabilities instead of certainties whether propositions
are true/false. We cover basic probability theory and – based on that – Bayesian Networks and
simple decision making in such environments. Finally we extend this to probabilistic temporal
models and their decision theory.
490
Chapter 21

Quantifying Uncertainty

21.1 Probability Theory

Probabilistic Models
 Definition 21.1.1 (Mathematically (slightly simplified)). A probability space
or (probability model) is a pair ⟨Ω, P ⟩ such that:

 Ω is a set of outcomes (called the sample space),


 P is a function P(Ω) → [0,1], such that:
 P (Ω) = 1 and
S P
 P ( i Ai ) = i P (Ai ) for all pairwise disjoint Ai ∈ P(Ω).
P is called a probability measure.
These properties are called the Kolmogorov axioms.

 Intuition: We run some experiment, the outcome of which is any ω ∈ Ω. P (X)


is the probability that the result of the experiment is any one of the outcomes in
X. Naturally, the probability that any outcome occurs is 1 (hence P (Ω) = 1).
The probability of pairwise disjoint sets of outcomes should just be the sum of their
probabilities.

 Example 21.1.2 (Dice throws). Assume we throw a (fair) die two times. Then
1
the sample space is {(i, j) | 1 ≤ i, j ≤ 6}. We define P by letting P ({A}) = 36 for
every A ∈ Ω.
Since the probability of any outcome is the same, we say P is uniformly distributed

Michael Kohlhase: Artificial Intelligence 2 719 2025-02-06

The definition is simplified in two places: Firstly, we assume that P is defined on the full power
set. This is not always possible, especially if Ω is uncountable. In that case we need an additional
set of “events” instead, and lots of mathematical machinery to make sure that we can safely take
unions, intersections, complements etc. of these events.
Secondly, we would technically only demand that P is additive on countably many disjoint
sets.
In this course we will assume that our sample space is at most countable anyway; usually even
finite.

491
492 CHAPTER 21. QUANTIFYING UNCERTAINTY

Random Variables
In practice, we are rarely interested in the specific outcome of an experiment, but
rather in some property of the outcome. This is especially true in the very common
situation where we don’t even know the precise probabilities of the individual outcomes.

 Example 21.1.3. The probability that the sum of our two dice throws is 7 is
6
P ({(i, j) ∈ Ω | i + j = 7}) = P ({(6, 1), (1, 6), (5, 2), (2, 5), (4, 3), (3, 4)}) = 36 =
1
6 .

 Definition 21.1.4 (Again, slightly simplified). Let D be a set. A random


variable is a function X : Ω → D. We call D (somewhat confusingly) the domain
of X, denoted dom(X).
For x ∈ D, we define the probability of x as P (X = x) := P ({ω ∈ Ω | X(ω) = x}).

 Definition 21.1.5. We say that a random variable X is finite domain, iff its domain
dom(X) is finite and Boolean, iff dom(X) = {T, F}.
For a Boolean random variable, we will simply write P (X) for P (X = T) and
P (¬X) for P (X = F).

Michael Kohlhase: Artificial Intelligence 2 720 2025-02-06

Note that a random variable, according to the formal definition, is neither random nor a variable:
It is a function with clearly defined domain and codomain – and what we call the domain of the
“variable” is actually its codomain... are you confused yet? ,
This confusion is a side-effect of the mathematical formalism. In practice, a random variable is
some indeterminate value that results from some statistical experiment – i.e. it is random, because
the result is not predetermined, and it is a variable, because it can take on different values.
It just so happens that if we want to model this scenario mathematically, a function is the most
natural way to do so.

Some Examples
 Example 21.1.6. Summing up our two dice throws is a random variable S : Ω →
[2,12] with S((i, j)) = i + j. The probability that they sum up to 7 is written as
P (S = 7) = 16 .

 Example 21.1.7. The first and second of our two dice throws are random variables
First, Second : Ω → [1,6] with First((i, j)) = i and Second((i, j)) = j.

 Remark 21.1.8. Note, that the identity Ω → Ω is a random variable as well.


 Example 21.1.9. We can model toothache, cavity and gingivitis as Boolean
random variables, with the underlying probability space being...??
 Example 21.1.10. We can model tomorrow’s weather as a random variable with
domain {sunny, rainy, foggy, warm, cloudy, humid, ...}, with the underlying prob-
ability space being...??
21.1. PROBABILITY THEORY 493

⇒ This is why probabilistic reasoning is necessary: We can rarely reduce probabilistic


scenarios down to clearly defined, fully known probability spaces and derive all the
interesting things from there.
But: The definitions here allow us to reason about probabilities and random variables
in a mathematically rigorous way, e.g. to make our intuitions and assumptions
precise, and prove our methods to be sound.

Michael Kohlhase: Artificial Intelligence 2 721 2025-02-06

Propositions
This is nice and all, but in practice we are interested in “compound” probabilities
like:

“What is the probability that the sum of our two dice throws is 7, but neither of the
two dice is a 3?”

Idea: Reuse the syntax of propositional logic and define the logical connectives for
random variables!
Example 21.1.11. We can express the above as: P (¬(First = 3) ∧ ¬(Second =
3) ∧ (S = 7))
Definition 21.1.12. Let X1 , X2 be random variables, x1 ∈ dom(X1 ) and x2 ∈
dom(X2 ). We define:
1. P (X1 ̸= x1 ):=P (¬(X1 = x1 )) := P ({ω ∈ Ω | X1 (ω) ̸= x1 })=1 − P (X1 = x1 ).

2. P ((X1 = x1 ) ∧ (X2 = x2 )) := P ({ω ∈ Ω | (X1 (ω) = x1 ) ∧ (X2 (ω) = x2 )})


=P ({ω ∈ Ω | X1 (ω) = x1 } ∩ {ω ∈ Ω | X2 (ω) = x2 }).
3. P ((X1 = x1 ) ∨ (X2 = x2 )) := P ({ω ∈ Ω | (X1 (ω) = x1 ) ∨ (X2 (ω) = x2 )})
=P ({ω ∈ Ω | X1 (ω) = x1 } ∪ {ω ∈ Ω | X2 (ω) = x2 }).
It is also common to write P (A, B) for P (A ∧ B)
Example 21.1.13. P ((First ̸= 3)∧(Second ̸= 3)∧(S = 7)) = P ({(1, 6), (6, 1), (2, 5), (5, 2)}) =
1
9

Michael Kohlhase: Artificial Intelligence 2 722 2025-02-06

Events
Definition 21.1.14 (Again slightly simplified). Let ⟨Ω, P ⟩ be a probability space.
An event is a subset of Ω.
Definition 21.1.15 (Convention). We call an event (by extension) anything that
represents a subset of Ω: any statement formed from the logical connectives and values
of random variables, on which P (·) is defined.

Problem 1.1
Remember: We can define A ∨ B := ¬(¬A ∧ ¬B), T := A ∨ ¬A and F := ¬T
– is this compatible with the definition of probabilities on propositional formulae? And
why is P (X1 ̸= x1 ) = 1 − P (X1 = x1 )?
494 CHAPTER 21. QUANTIFYING UNCERTAINTY

Problem 1.2 (Inclusion-Exclusion-Principle)


Show that P (A ∨ B) = P (A) + P (B) − P (A ∧ B).

Problem 1.3
Show that P (A) = P (A ∧ B) + P (A ∧ ¬B)

Michael Kohlhase: Artificial Intelligence 2 723 2025-02-06

Conditional Probabilities
 As we gather new information, our beliefs (should ) change, and thus our probabil-
ities!
 Example 21.1.16. Your “probability of missing the connection train” increases
when you are informed that your current train has 30 minutes delay.

 Example 21.1.17. The “probability of cavity” increases when the doctor is in-
formed that the patient has a toothache.
 Example 21.1.18. The probability that S = 3 is clearly higher if I know that
First = 1 than otherwise – or if I know that First = 6!

 Definition 21.1.19. Let A and B be events where P (B) ̸= 0. The conditional


probability of A given B is defined as:

P (A ∧ B)
P (A|B):=
P (B)

We also call P (A) the prior probability of A, and P (A|B) the posterior probability.

 Intuition: If we assume B to hold, then we are only interested in the “part” of Ω


where A is true relative to B.
Alternatively: We restrict our sample space Ω to the subset of outcomes where
B holds. We then define a new probability space on this subset by scaling the
probability measure so that it sums to 1 – which we do by dividing by P (B). (We
“update our beliefs based on new evidence”)

Michael Kohlhase: Artificial Intelligence 2 724 2025-02-06

Examples
 Example 21.1.20. If we assume First = 1, then P (S = 3|First = 1) should be
precisely P (Second = 2) = 61 . We check:

P ((S = 3) ∧ (First = 1)) 1/36 1


P (S = 3|First = 1) = = =
P (First = 1) 1/6 6

 Example 21.1.21. Assume the prior probability P (cavity) is 0.122. The probability
that a patient has both a cavity and a toothache is P (cavity ∧toothache) = 0.067.
The probability that a patient has a toothache is P (toothache) = 0.15.
21.1. PROBABILITY THEORY 495

If the patient complains about a toothache, we can update our estimation by com-
puting the posterior probability:

P (cavity ∧ toothache) 0.067


P (cavity|toothache) = = = 0.45.
P (toothache) 0.15

 Note: We just computed the probability of some underlying disease based on the
presence of a symptom!
Or more generally: We computed the probability of a cause from observing its effect.

Michael Kohlhase: Artificial Intelligence 2 725 2025-02-06

Some Rules
Equations on unconditional probabilities have direct analogues for conditional proba-
bilities.
Problem 1.4
Convince yourself of the following:
 P (A|C) = 1 − P (¬A|C).

 P (A|C) = P (A ∧ B|C) + P (A ∧ ¬B|C).


 P (A ∨ B|C) = P (A|C) + P (B|C) − P (A ∧ B|C).

But not on the right hand side!


Problem 1.5
Find counterexamples for the following (false) claims:
 P (A|C) = 1 − P (A|¬C)
 P (A|C) = P (A|B ∧ C) + P (A|B ∧ ¬C).

 P (A|B ∨ C) = P (A|B) + P (A|C) − P (A|B ∧ C).

Michael Kohlhase: Artificial Intelligence 2 726 2025-02-06

Bayes’ Rule

 Note: By definition, P (A|B) = P P(A∧B)(B) . In practice, we often know the condi-


tional probability already, and use it to compute the probability of the conjunction
instead: P (A ∧ B) = P (A|B) · P (B) = P (B|A) · P (A).

 Theorem 21.1.22 (Bayes’ Theorem). Given propositions A and B where P (A) ̸=


0 and P (B) ̸= 0, we have:

P (B|A) · P (A)
P (A|B) =
P (B)
496 CHAPTER 21. QUANTIFYING UNCERTAINTY

 Proof:
P (A∧B) P (B|A)·P (A)
1. P (A|B) = P (B) = P (B)
...okay, that was straightforward... what’s the big deal?

 (Somewhat Dubious) Claim: Bayes’ Rule is the entire scientific method con-
densed into a single equation!

This is an extreme overstatement, but there is a grain of truth in it.

Michael Kohlhase: Artificial Intelligence 2 727 2025-02-06

Bayes’ Theorem - Why the Hype?


Say we have a hypothesis H about the world. (e.g. “The universe had a
beginning”)
We have some prior belief P (H).
We gather evidence E. (e.g. “We observe a cosmic microwave background at 2.7K
everywhere”)
Bayes’ Rule tells us how to update our belief in H based on H’s ability to predict
E (the likelihood P (E|H)) – and, importantly, the ability of competing hypotheses to
predict the same evidence. (This is actually how scientific hypotheses should be
evaluated)
likelihood prior
z }| { z }| {
P (E|H) · P (H) P (E|H) · P (H)
P (H|E) = =
| {z } P (E) P (E|H) P (H) + P (E|¬H)P (¬H)
posterior | {z } | {z } | {z }
likelihood prior competition

...if I keep gathering evidence and update, ultimately the impact of the prior belief
will diminish.

“You’re entitled to your own priors, but not your own likelihoods”

Michael Kohlhase: Artificial Intelligence 2 728 2025-02-06

Independence
 Question: What is the probability that S = 7 and the patient has a toothache?
Or less contrived: What is the probability that the patient has a gingivitis and a
cavity?

 Definition 21.1.23. Two events A and B are called independent, iff P (A ∧ B) =


P (A) · P (B).
Two random variables X1 , X2 are called independent, iff for all x1 ∈ dom(X1 ) and
x2 ∈ dom(X2 ), the events X1 = x1 and X2 = x2 are independent.
We write A ⊥ B or X1 ⊥ X2 , respectively.

 Theorem 21.1.24. Equivalently: Given events A and B with P (B) ̸= 0, then A


and B are independent iff P (A|B) = P (A) (equivalently: P (B|A) = P (B)).
21.1. PROBABILITY THEORY 497

 Proof:
1. ⇒ By definition, P (A|B) = P P(A∧B)
(B) =
P (A)·P (B)
P (B) = P (A),
2. ⇐ Assume P (A|B) = P (A). Then P (A ∧ B) = P (A|B) · P (B) = P (A) ·
P (B).
 Note: Independence asserts that two events are “not related” – the probability of
one does not depend on the other.
Mathematically, we can determine independence by checking whether P (A ∧ B) =
P (A) · P (B).
In practice, this is impossible to check. Instead, we assume independence based on
domain knowledge, and then exploit this to compute P (A ∧ B).

Michael Kohlhase: Artificial Intelligence 2 729 2025-02-06

Independence (Examples)
 Example 21.1.25.
 First = 2 and Second = 3 are independent – more generally, First and Second
are independent (The outcome of the first die does not affect the outcome of
the second die)
1
Quick check: P ((First = a) ∧ (Second = b)) = 36 = P (First = a) ·
P (Second = b) ✓
 First and S are not independent.
(The outcome of the first die affects the sum of the two dice.) Counterexample:
1
P ((First = 1) ∧ (S = 4)) = 36 ̸= P (First = 1) · P (S = 4) = 16 · 12 = 72
1

 But: P ((First = a) ∧ (S = 7)) = 361


= 16 · 61 = P (First = a) · P (S = 7) – so
the events First = a and S = 7 are independent. (Why?)

 Example 21.1.26.
 Are cavity and toothache independent?
...since cavities can cause a toothache, that would probably be a bad design
decision...
 Are cavity and gingivitis independent? Cavities do not cause gingivitis, and
gingivitis does not cause cavities, so... yes... right? (...as far as I know. I’m
not a dentist.)
Probably not! A patient who has cavities has probably worse dental hygiene
than those who don’t, and is thus more likely to have gingivitis as well.
⇒ cavity may be evidence that raises the probabilty of gingivitis, even if they
are not directly causally related.

Michael Kohlhase: Artificial Intelligence 2 730 2025-02-06

Conditional Independence – Motivation


 A dentist can diagnose a cavity by using a probe, which may (or may not) catch in
a cavity.
498 CHAPTER 21. QUANTIFYING UNCERTAINTY

 Say we know from clinical studies that P (cavity) = 0.2, P (toothache|cavity) =


0.6, P (toothache|¬cavity) = 0.1, P (catch|cavity) = 0.9, and P (catch|¬cavity) =
0.2.
 Assume the patient complains about a toothache, and our probe indeed catches in
the aching tooth. What is the likelihood of having a cavity P (cavity|toothache ∧
catch)?

⇒ Use Bayes’ rule:

P (toothache ∧ catch|cavity) · P (cavity)


P (cavity|toothache ∧ catch) =
P (toothache ∧ catch)

 Note: P (toothache∧catch) = P (toothache∧catch|cavity)·P (cavity)+P (toothache∧


catch|¬cavity) · P (¬cavity)
⇒ Now we’re only missing P (toothache ∧ catch|cavity = b) for b ∈ {T, F}.
... Now what?
 Are toothache and catch independent, maybe? No: Both have a common (possi-
ble) cause, cavity.
Also, there’s this pesky P (·|cavity) in the way. . . ...wait a minute...

Michael Kohlhase: Artificial Intelligence 2 731 2025-02-06

Conditional Independence – Definition


 Assuming the patient has (or does not have) a cavity, the events toothache and
catch are independent: Both are caused by a cavity, but they don’t influence each
other otherwise.
i.e. cavity “contains all the information” that links toothache and catch in the first
place.

 Definition 21.1.27. Given events A, B, C with P (C) ̸= 0, then A and B are


called conditionally independent given C, iff P (A ∧ B|C) = P (A|C) · P (B|C).
Equivalently: iff P (A|B ∧ C) = P (A|C), or P (B|A ∧ C) = P (B|C).

Let Y be a random variable. We call two random variables X1 , X2 conditionally


independent given Y , iff for all x1 ∈ dom(X1 ), x2 ∈ dom(X2 ) and y ∈ dom(Y ),
the events X1 = x1 and X2 = x2 are conditionally independent given Y = y.

 Example 21.1.28. Let’s assume toothache and catch are conditionally indepen-
dent given cavity/¬cavity. Then we can finally compute:
P (toothache∧catch|cavity)·P (cavity)
P (cavity|toothache ∧ catch) = P (toothache∧catch)
P (toothache|cavity)·P (catch|cavity)·P (cavity)
= P (toothache|cavity)·P (catch|cavity)·P (cavity)+P (toothache|¬cavity)·P (catch|¬cavity)·P (¬cavity)
0.6·0.9·0.2
= 0.6·0.9·0.2+0.1·0.2·0.8 =0.87

Michael Kohlhase: Artificial Intelligence 2 732 2025-02-06


21.1. PROBABILITY THEORY 499

Conditional Independence
 Lemma 21.1.29. If A and B are conditionally independent given C, then P (A|B ∧
C) = P (A|C)
Proof:
P (A∧B∧C) P (A∧B|C)·P (C) P (A|C)·P (B|C)·P (C) P (A|C)·P (B∧C)
P (A|B∧C) = P (B∧C) = P (B∧C) = P (B∧C) = P (B∧C) =
P (A|C)

 Question: If A and B are conditionally independent given C, does this imply that
A and B are independent? No. See previous slides for a counterexample.
 Question: If A and B are independent, does this imply that A and B are also
conditionally independent given C? No. For example: First and Second are inde-
pendent, but not conditionally independent given S = 4.

 Question: Okay, so what if A, B and C are all pairwise independent? Are A


and B conditionally independent given C now ? Still no. Remember: First =
a, Second = b and S = 7 are all independent, but First and Second are not
conditionally independent given S = 7.
 Question: When can we infer conditional independence from a “more general”
notion of independence?
We need mutual independence. Roughly: A set of events is called mutually inde-
pendent, if every event is independent from any conjunction of the others. (Not
really relevant for this course though)

Michael Kohlhase: Artificial Intelligence 2 733 2025-02-06

Summary
 Probability spaces serve as a mathematical model (and hence justification) for
everything related to probabilities.
 The “atoms” of any statement of probability are the random variables. (Important
special cases: Boolean and finite domain)
 We can define probabilities on compund (propositional logical) statements, with
(outcomes of) random variables as “propositional variables”.
 Conditional probabilities represent posterior probabilities given some observed out-
comes.
 independence and conditional independence are strong assumptions that allow us
to simplify computations of probabilities
 Bayes’ Theorem

Michael Kohlhase: Artificial Intelligence 2 734 2025-02-06


500 CHAPTER 21. QUANTIFYING UNCERTAINTY

So much about the math...


We now have a mathematical setup for probabilities.
But: The math does not tell us what probabilities are:
Assume we can mathematically derive this to be the case: the probability of rain
tomorrow is 0.3. What does this even mean?

 Frequentist: The probability of an event is the limit of its relative frequency in a


large number of trials.
In other words: “In 30% of the cases where we have similar weather conditions, it
rained the next day.”
Objection: Okay, but what about unique events? “The probability of me passing the
exam is 80%” – does this mean anything, if I only take the exam once? Am I
comparable to “similar students”? What counts as sufficiently “similar”?
 Bayesian: Probabilities are degrees of belief. It means you should be 30% confident
that it will rain tomorrow.
Objection: And why should I? Is this not purely subjective then?

Michael Kohlhase: Artificial Intelligence 2 735 2025-02-06

Pragmatics
Pragmatically, both interpretations amount to the same thing: I should act as if
I’m 30% confident that it will rain tomorrow. (Whether by fiat, or because in 30% of
comparable cases, it rained.)

Objection: Still: why should I? And why should my beliefs follow the seemingly
arbitrary Kolmogorov axioms?

 [DF31]: If an agent has a belief that violates the Kolmogorov axioms, then there
exists a combination of “bets” on propositions so that the agent always loses money.
 In other words: If your beliefs are not consistent with the mathematics, and you
act in accordance with your beliefs, there is a way to exploit this inconsistency to
your disadvantage.

 ...and, more importantly, your AI agents! ,

Michael Kohlhase: Artificial Intelligence 2 736 2025-02-06

21.2 Probabilistic Reasoning Techniques

Okay, now how do I implement this?


This is a computer science course. We need to implement this stuff.

Do we... implement random variables as functions? Is a probability space a... class


maybe?
No. As mentioned, we rarely know the probability space entirely. Instead we will
21.2. PROBABILISTIC REASONING TECHNIQUES 501

use probability distributions, which are just arrays (of arrays of...) of probabilities.
And then we represent those are sparse as possible, by exploiting independence,
conditional independence, ...

Michael Kohlhase: Artificial Intelligence 2 737 2025-02-06

Probability Distributions
 Definition 21.2.1. The probability distribution for a random variable X, written
P(X), is the vector of probabilities for the (ordered) domain of X.
 Note: The values in a probability distribution are all positive and sum to 1.
(Why?)
 Example 21.2.2. P(First) = P(Second) = ⟨ 61 , 16 , 16 , 16 , 16 , 61 ⟩. (Both First and
Second are uniformly distributed)
 Example 21.2.3. The probability distribution P(S) is ⟨ 36
1 1
, 18 1 1 5 1 5 1 1
, 12 1
, 9 , 36 , 6 , 36 , 9 , 12 , 18 1
, 36 ⟩.
Note the symmetry, with a “peak” at 7 – the random variable is (approximately,
because our domain is discrete rather than continuous) normally distributed (or
gaussian distributed, or follows a bell-curve,...).

 Example 21.2.4. Probability distributions for Boolean random variables are natu-
rally pairs (probabilities for T and F), e.g.:

P(toothache) = ⟨0.15, 0.85⟩


P(cavity) = ⟨0.122, 0.878⟩
 More generally:
Definition
P 21.2.5. A probability distribution is a vector v of values vi ∈ [0,1] such
that i vi = 1.

Michael Kohlhase: Artificial Intelligence 2 738 2025-02-06

The Full Joint Probability Distribution


 Definition 21.2.6. Given random variables X 1 , . . ., X n , the full joint probability
distribution, denoted P(X 1 , . . ., X n ), is the n-dimensional array of size |D1 × . . . ×
Dn | that lists the probabilities of all conjunctions of values of the random variables.

 Example 21.2.7. P(cavity, toothache, gingivitis) could look something like this:

toothache ¬toothache
gingivitis ¬gingivitis gingivitis ¬gingivitis
cavity 0.007 0.06 0.005 0.05
¬cavity 0.08 0.003 0.045 0.75

 Example 21.2.8. P(First, S)


502 CHAPTER 21. QUANTIFYING UNCERTAINTY

First \ S 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1 1
1 36 36 36 36 36 36
0 0 0 0 0
1 1 1 1 1 1
2 0 36 36 36 36 36 36
0 0 0 0
1 1 1 1 1 1
3 0 0 36 36 36 36 36 36
0 0 0
1 1 1 1 1 1
4 0 0 0 36 36 36 36 36 36
0 0
1 1 1 1 1 1
5 0 0 0 0 36 36 36 36 36 36
0
1 1 1 1 1 1
6 0 0 0 0 0 36 36 36 36 36 36

Note that if we know the value of First, the value of S is completely determined by
the value of Second.

Michael Kohlhase: Artificial Intelligence 2 739 2025-02-06

Conditional Probability Distributions


 Definition 21.2.9. Given random variables X and Y , the conditional probability
distribution of X given Y , written P(X|Y ) is the table of all conditional probabilities
of values of X given values of Y .

 For sets of variables analogously: P(X 1 , . . ., X n |Y 1 , . . ., Y m ).


 Example 21.2.10. P(cavity|toothache):

toothache ¬toothache
cavity P (cavity|toothache) = 0.45 P (cavity|¬toothache) = 0.065
¬cavity P (¬cavity|toothache) = 0.55 P (¬cavity|¬toothache) = 0.935

 Example 21.2.11. P(First|S)

First \ S 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1
1 1 2 3 4 5 6
0 0 0 0 0
1 1 1 1 1 1
2 0 2 3 4 5 6 5
0 0 0 0
1 1 1 1 1 1
3 0 0 3 4 5 6 5 4
0 0 0
1 1 1 1 1 1
4 0 0 0 4 5 6 5 4 3
0 0
1 1 1 1 1 1
5 0 0 0 0 5 6 5 4 3 2
0
1 1 1 1 1
6 0 0 0 0 0 6 5 4 3 2
1

 Note: Every “column” of a conditional probability distribution is itself a probability


distribution. (Why?)

Michael Kohlhase: Artificial Intelligence 2 740 2025-02-06

Convention
We now “lift” multiplication and division to the level of whole probability distribu-
tions:

 Definition 21.2.12. Whenever we use P in an equation, we take this to mean a


system of equations, for each value in the domains of the random variables involved.
Example 21.2.13.
 P(X, Y ) = P(X|Y ) · P(Y ) represents the system of equations P (X = x ∧ Y =
y) = P (X = x|Y = y) · P (Y = y) for all x, y in the respective domains.
21.2. PROBABILISTIC REASONING TECHNIQUES 503

P(X,Y )
 P(X|Y ) := P(Y ) represents the system of equations P (X = x|Y = y) :=
P ((X=x)∧(Y =y))
P (Y =y)

P(Y |X)·P(X)
 Bayes’ Theorem: P(X|Y ) = P(Y ) represents the system of equations P (X =
P (Y =y|X=x)·P (X=x)
x|Y = y) = P (Y =y)

Michael Kohlhase: Artificial Intelligence 2 741 2025-02-06

So, what’s the point?


 Obviously, the probability distribution contains all the information about a specific
random variable we need.
 Observation: The full joint probability distribution of variables X 1 , . . ., X n con-
tains all the information about the random variables and their conjunctions we need.

 Example 21.2.14. We can read off the probability P (toothache) from the full
joint probability distribution as 0.007+0.06+0.08+0.003=0.15, and the probability
P (toothache ∧ cavity) as 0.007 + 0.06 = 0.067
 We can actually implement this! (They’re just (nested) arrays)

But just as we often don’t have a fully specified probability space to work in, we often
don’t have a full joint probability distribution for our random variables either.

Also: Given random variables X 1 , . . ., X n , the full joint probability distribution has
Q
n
i=1 |dom(X i )| entries! (P(First, S) already has 60 entries!)
⇒ The rest of this section deals with keeping things small, by computing probabilities
instead of storing them all.

Michael Kohlhase: Artificial Intelligence 2 742 2025-02-06

Probabilistic Reasoning
 Probabilistic reasoning refers to inferring probabilities of events from the proba-
bilities of other events
as opposed to determining the probabilities e.g. empirically, by gathering (sufficient
amounts of representative) data and counting.
 Note: In practice, we are primarily interested in, and have access to, conditional
probabilities rather than the unconditional probabilities of conjunctions of events:

 We don’t reason in a vacuum: Usually, we have some evidence and want to infer
the posterior probability of some related event. (e.g. infer a plausible cause
given some symptom)
⇒ we are interested in the conditional probability P (hypothesis|observation).
 “80% of patients with a cavity complain about a toothache” (i.e. P (toothache|cavity))
is more the kind of data people actually collect and publish than “1.2% of the gen-
eral population have both a cavity and a toothache” (i.e. P (cavity∧toothache)).
504 CHAPTER 21. QUANTIFYING UNCERTAINTY

 Consider the probe catching in a cavity. The probe is a diagnostic tool, which
is usually evaluated in terms of its sensitivity P (catch|cavity) and specificity
P (¬catch|¬cavity). (You have probably heard these words a lot since 2020...)

Michael Kohlhase: Artificial Intelligence 2 743 2025-02-06

Naive Bayes Models


Consider again the dentistry example with random variables cavity, toothache, and
catch. We assume cavity causes both toothache and catch, and that toothache and
catch are conditionally independent given cavity:
Cavity

Toothache Catch

We likely know the sensitivity P (catch|cavity) and specificity P (¬catch|¬cavity),


which jointly give us P(catch|cavity), and from medical studies, we should be able to de-
termine P (cavity) (the prevalence of cavities in the population) and P(toothache|cavity).

This kind of situation is surprisingly common, and deserves a name

Michael Kohlhase: Artificial Intelligence 2 744 2025-02-06

Naive Bayes Models


Cavity

Toothache Catch

Definition 21.2.15. A naive Bayes model (or, less accurately, Bayesian classifier, or,
derogatorily, idiot Bayes model) consists of:
1. random variables C, E 1 , . . ., E n such that all the E 1 , . . ., E n are conditionally inde-
pendent given C,
2. the probability distribution P(C), and

3. the conditional probability distributions P(E i |C).


We call C the cause and the E 1 , . . ., E n the effects of the model.
Convention: Whenever we draw a graph of random variables, we take the arrows to
connect causes to their direct effects, and assert that unconnected nodes are condi-
tionally independent given all their ancestors. We will make this more precise later.

Can we compute the full joint probability distribution P(cavity, toothache, catch)
from this information?
Michael Kohlhase: Artificial Intelligence 2 745 2025-02-06
21.2. PROBABILISTIC REASONING TECHNIQUES 505

Recovering the Full Joint Probability Distribution


 Lemma 21.2.16 (Product rule). P(X, Y ) = P(X|Y ) · P(Y ).

We can generalize this to more than two variables, by repeatedly applying the prod-
uct rule:
 Lemma 21.2.17 (Chain rule). For any sequence of random variables X 1 , . . ., X n :

P(X 1 , . . ., X n ) = P(X 1 |X 2 , . . ., X n )·P(X 2 |X 3 , . . .X n )·. . .·P(X n−1 |X n )·P (X n )

.
Hence:

 Theorem 21.2.18. Given a naive Bayes model with effects E 1 , . . ., E n and cause
C, we have
n
Y
P(C, E 1 , . . ., E n ) = P(C) · ( P(E i |C)).
i=1

Proof: Using the chain rule:


1. P(E 1 , . . ., E n , C) = P(E 1 |E 2 , . . ., E n , C) · . . . · P(E n |C) · P(C)
2. Since all the E i are conditionally independent, we can drop them on the right
hand sides of the P(E j |..., C)

Michael Kohlhase: Artificial Intelligence 2 746 2025-02-06

Marginalization
1 ,...,E n )
P(E 1 ,...,E n ) ...
Great, so now we can compute P(C|E 1 , . . ., E n ) = P(C,E
...except that we don’t know P(E 1 , . . ., E n ) :-/
...except that we can compute the full joint probability distribution, so we can recover
it:
Lemma 21.2.19 (Marginalization).
P Given random variables X 1 , . . ., X n and Y 1 , . . ., Y m ,
we have P(X 1 , . . ., X n ) = y1 ∈dom(Y 1 ),...,ym ∈dom(Y m ) P(X 1 , . . ., X n , Y 1 = y1 , . . ., Y m = ym ).
(This is just a fancy way of saying “we can add the relevant entries of the full joint
probability distribution”)
Example 21.2.20. Say we observed toothache = T and catch = T. Using marginal-
ization, we can compute

P (cavity ∧ toothache ∧ catch)


P (cavity|toothache ∧ catch)=
P (toothache ∧ catch)
P (cavity ∧ toothache ∧ catch)
=P
c∈{cavity,¬cavity} P (c ∧ toothache ∧ catch)
P (cavity) · P (toothache|cavity) · P (catch|cavity)
=P
c∈{cavity,¬cavity} P (c) · P (toothache|c) · P (catch|c)

Michael Kohlhase: Artificial Intelligence 2 747 2025-02-06


506 CHAPTER 21. QUANTIFYING UNCERTAINTY

Unknowns
What if we don’t know catch? (I’m not a dentist, I don’t have a probe...)
We split our effects into {E 1 , . . ., E n } = {O1 , . . ., OnO } ∪ {U 1 , . . ., U nU } – the
observed and unknown random variables.
Let DU := dom(U 1 ) × . . . × dom(U nu ). Then

P(C, O1 , . . ., OnO )
P(C|O1 , . . ., OnO )=
P(O1 , . . ., OnO )
P
u∈DU P(C, O 1 , . . ., O nO , U 1 = u1 , . . ., U nu = unu )
=P P
c∈dom(C) u∈DU P(O 1 , . . ., O nO , C = c, U 1 = u1 , . . ., U nu = unu )
P QnO QnU
u∈DU P(C) · ( i=1 P(O i |C)) · ( j=1 P(U j = uj |C))
=P P QnO QnU
c∈dom(C) u∈DU P (C = c) · ( i=1 P(O i |C = c)) · ( j=1 P (U j = uj |C = c))
QnO P QnU
P(C) · ( i=1 P(Oi |C)) · ( u∈DU j=1 P(U j = uj |C))
=P QnO P QnU
c∈dom(C) P (C = c) · ( i=1 P(O i |C = c)) · ( u∈DU j=1 P (U j = uj |C = c))

...oof...
Michael Kohlhase: Artificial Intelligence 2 748 2025-02-06

Unknowns
QnO P QnU
P(C) · ( i=1 P(Oi |C)) · ( u∈DU j=1 P(U j = uj |C))
P(C|O1 , . . ., OnO ) = P QnO P QnU
c∈dom(C) P (C = c) · ( i=1 P(O i |C = c)) · ( u∈DU j=1 P (U j = uj |C = c))
P QnU
First, note that u∈DU j=1 P (U j = uj |C = c) = 1 (We’re summing over all
possible events on the (conditionally independent) U 1 , . . ., U nU given C = c)
QnO
P(C) · ( i=1 P(Oi |C))
P(C|O1 , . . ., OnO ) = P QnO
c∈dom(C) P (C = c) · ( i=1 P(O i |C = c))

Secondly, note that the denominator is


1. the same for any given observations O1 , . . ., OnO , independent of the value of C,
and
2. the sum over all the numerators in the full distribution.

That is: The denominator only serves to scale what is almost already the distribution
P(C|O1 , . . ., OnO ) to sum up to 1.

Michael Kohlhase: Artificial Intelligence 2 749 2025-02-06

Normalization
Definition 21.2.21 (Normalization). Given a vector w := ⟨w1 , . . ., wk ⟩ of numbers
Pk
in [0,1] where i=1 wi ≤ 1.
Then the normalized vector α(w) is defined (component-wise) as
wi
(α(w))i := Pk .
j=1 wj
21.2. PROBABILISTIC REASONING TECHNIQUES 507

Pk
Note that i=1 α(w)i = 1, i.e. α(w) is a probability distribution.

This finally gives us:


Theorem 21.2.22 (Inference in a Naive Bayes model). Let C, E 1 , . . ., E n a naive
Bayes model and E 1 , . . ., E n = O1 , . . ., OnO , U 1 , . . ., U nU .
Then
nO
Y
P(C|O1 = o1 , . . ., OnO = onO ) = α(P(C) · ( P(Oi = oi |C)))
i=1

Note, that this is entirely independent of the unknown random variables U 1 , . . ., U nU !


Also, note that this is just a fancy way of saying “first, compute all the numerators,
then divide all of them by their sums”.

Michael Kohlhase: Artificial Intelligence 2 750 2025-02-06

Dentistry Example
Putting things together, we get:

P(cavity|toothache = T)=α(P(cavity) · P(toothache = T|cavity))


=α(⟨P (cavity) · P (toothache|cavity), P (¬cavity) · P (toothache|¬cavity)⟩)

Say we have P (cavity) = 0.1, P (toothache|cavity) = 0.8, and P (toothache|¬cavity) =


0.05. Then

P(cavity|toothache = T) = α(⟨0.1 · 0.8, 0.9 · 0.05⟩) = α(⟨0.08, 0.045⟩)

0.08 + 0.045 = 0.125, hence


0.08 0.045
P(cavity|toothache = T) = ⟨ , ⟩ = ⟨0.64, 0.36⟩
0.125 0.125

Michael Kohlhase: Artificial Intelligence 2 751 2025-02-06

Naive Bayes Classification


We can use a naive Bayes model as a very simple classifier :

 Assume we want to classify newspaper articles as one of the categories politics,


sports, business, fluff, etc. based on the words they contain.
 Given a large set of articles, we can determine the relevant probabilities by counting
the occurrences of the categories P(category), and of words per category – i.e.
P(wordi |category) for some (huge) list of words (wordi )ni=1 .

 We assume that the occurrence of each word is conditionally independent of the


occurrence of any other word given the category of the document. (This
assumption is clearly wrong, but it makes the model simple and often works well in
practice.) (⇒ “Idiot Bayes model”)
508 CHAPTER 21. QUANTIFYING UNCERTAINTY

 Given a new article, we just count the occurrences ki of the words in it and compute
n
Y
P(category|word1 = k1 , . . ., wordn = kn ) = α(P(category)·( P(wordi = ki |category)))
i=1

 We then choose the category with the highest probability.

Michael Kohlhase: Artificial Intelligence 2 752 2025-02-06

Inference by Enumeration
The rules we established for naive Bayes models, i.e. Bayes’s theorem, the prod-
uct rule and chain rule, marginalization and normalization, are general techniques for
probabilistic reasoning, and their usefulness is not limited to the naive Bayes models.
More generally:
Theorem 21.2.23. Let Q, E 1 , . . ., E nE , U 1 , . . ., U nU be random variables and D :=
dom(U 1 ) × . . . × dom(U nU ). Then
X
P(Q|E 1 = e1 , . . ., E nE = ene ) = α( DP(Q, E 1 = e1 , . . ., E nE = ene , U 1 = u1 , . . ., U nU = unU ))
u
.
We call Q the query variable, E 1 , . . ., E nE the evidence, and U 1 , . . ., U nU the
unknown (or hidden) variables, and computing a conditional probability this way
enumeration.
Note that this is just a “mathy” way of saying we
1. sum over all relevant entries of the full joint probability distribution of the variables,
and
2. normalize the result to yield a probability distribution.

Michael Kohlhase: Artificial Intelligence 2 753 2025-02-06

We will fortify our intuition about naive Bayes models with a variant of the Wumpus world we
looked at ?? to understand whether logic was up to the job of guiding an agent in the Wumpus
cave.

Example: The Wumpus is Back


 We have a maze where
 Every cell except [1, 1] possibly contains a pit, with 20%
probability.
 pits cause a breeze in neighboring cells (we forget the
wumpus and the gold for now)

 Where should the agent go, if there is a breeze at [1, 2] and


[2, 1]?
 Pure logical inference can conclude nothing about which
square is most likely to be safe!

We can model this using the Boolean random variables:


21.2. PROBABILISTIC REASONING TECHNIQUES 509

 P i,j for i, j ∈ {1, 2, 3, 4}, stating there is a pit at square [i, j], and
 B i,j for (i, j) ∈ {(1, 1), (1, 2), (2, 1)}, stating there is a breeze at square [i, j]
⇒ let’s apply our machinery!

Michael Kohlhase: Artificial Intelligence 2 754 2025-02-06

Wumpus: Probabilistic Model


First: Let’s try to compute the full joint probability distribution
P(P 1,1 , . . ., P 4,4 , B 1,1 , B 1,2 , B 2,1 ).
1. By the product rule, this is equal to
P(B 1,1 , B 1,2 , B 2,1 |P 1,1 , . . ., P 4,4 ) · P(P 1,1 , . . ., P 4,4 ).
2. Note that P(B 1,1 , B 1,2 , B 2,1 |P 1,1 , . . ., P 4,4 ) is either 1 (if all
the B i,j are consistent with the positions of the pits P k,l ) or
0 (otherwise).
3. Since the pits are Q4,4 spread independently, we have
P(P 1,1 , . . ., P 4,4 ) = i,j=1,1 P(P i,j )
⇒ We know all of these probabilities.
⇒ We can now use enumeration
P to compute
P(P i,j | < known >) = α( <unknowns> P(P i,j , < known >, < unknowns >))

Michael Kohlhase: Artificial Intelligence 2 755 2025-02-06

Wumpus Continued
Problem: We only know P i,j for three fields. If we want to compute e.g. P 1,3 via
2
enumeration, that leaves 24 −4 = 4096 terms to sum over!
Let’s do better.
 Let b := ¬B 1,1 ∧ B 1,2 ∧ B 2,1 (All the breezes we know
about)
 Let p := ¬P 1,1 ∧ ¬P 1,2 ∧ ¬P 2,1 . (All the pits we know
about)
 Let F := {P 3,1 ∧ P 2,2 , ¬P 3,1 ∧ P 2,2 , P 3,1 ∧ ¬P 2,2 , ¬P 3,1 ∧
P 2,2 } (the current “frontier”)
 Let O be (the set of assignments for) all the other variables
P i,j . (i.e. except p, F and our query P 1,3 )

Then the observed breezes b are conditionally independent of O


given p and F . (Whether there is a pit anywhere else does not
influence the breezes we observe.)

⇒ P (b|P 1,3 , p, O, F ) = P (b|P 1,3 , p, F ). Let’s exploit this!

Michael Kohlhase: Artificial Intelligence 2 756 2025-02-06


510 CHAPTER 21. QUANTIFYING UNCERTAINTY

Optimized Wumpus
X X
P(P 1,3 |p, b)=α( P(P 1,3 , b, p, f , o))=α( P (b|P 1,3 , p, o, f ) · P(P 1,3 , p, f , o))
o∈O,f ∈F o∈O,f ∈F
XX X X
=α( P (b|P 1,3 , p, f ) · P(P 1,3 , p, f , o))=α( P (b|P 1,3 , p, f ) · ( P(P 1,3 , p, f , o)))
f ∈F o∈O f ∈F o∈O
X X
=α( P (b|P 1,3 , p, f ) · ( P(P 1,3 ) · P (p) · P (f ) · P (o)))
f ∈F o∈O
X X
=α(P(P 1,3 ) · P (p) · ( P (b|P 1,3 , p, f ) ·P (f ) · ( P (o))))
f ∈F
| {z } o∈O
∈{0,1} | {z }
=1

⇒ this is just a sum over the frontier, i.e. 4 terms ,


So: P(P 1,3 |p, b) = α(⟨0.2 · (0.8)3 · (1 · 0.04 + 1 · 0.16 + 1 · 0.16 + 0), 0.8 · (0.8)3 · (1 ·
0.04 + 1 · 0.16 + 0 + 0)⟩) ≈ ⟨0.31, 0.69⟩
Analogously: P(P 3,1 |p, b) = ⟨0.31, 0.69⟩ and P(P 2,2 |p, b) = ⟨0.86, 0.14⟩ (⇒ avoid
[2, 2]!)

Michael Kohlhase: Artificial Intelligence 2 757 2025-02-06

Cooking Recipe
In general, when you want to reason probabilistically, a good heuristic is:

1. Try to frame the full joint probability distribution in terms of the probabilities you
know. Exploit product rule/chain rule, independence, conditional independence,
marginalization and domain knowledge (as e.g. P(b|p, f ) ∈ {0, 1})

⇒ the problem can be solved at all!


2. Simplify: Start with the equation for enumeration:
X
P(Q|E1 , ...) = α( P(Q, E1 , ..., U1 = u1 , ...))
u∈U

3. Substitute by the result of 1., and again, exploit all of our machinery
4. Implement the resulting (system of) equation(s)

5. ???
6. Profit

Michael Kohlhase: Artificial Intelligence 2 758 2025-02-06

Summary
 Probability distributions and conditional probability distributions allow us to repre-
sent random variables as convenient datastructures in an implementation
(Assuming they are finite domain...)
21.2. PROBABILISTIC REASONING TECHNIQUES 511

 The full joint probability distribution allows us to compute all probabilities of state-
ments about the random variables contained (But possibly
inefficient)
 Marginalization and normalization are the specific techniques for extracting the
specific probabilities we are interested in from the full joint probability distribution.
 The product and chain rule, exploiting (conditional) independence, Bayes’ Theorem,
and of course domain specific knowledge allow us to do so much more efficiently.
 Naive Bayes models are one example where all these techniques come together.

Michael Kohlhase: Artificial Intelligence 2 759 2025-02-06


512 CHAPTER 21. QUANTIFYING UNCERTAINTY
Chapter 22

Probabilistic Reasoning: Bayesian


Networks

22.1 Introduction
John, Mary, and My Brand-New Alarm
Example 22.1.1 (From Russell/Norvig).

 I got very valuable stuff at home. So I bought an alarm. Unfortunately, the alarm
just rings at home, doesn’t call me on my mobile.
 I’ve got two neighbors, Mary and John, who’ll call me if they hear the alarm.
 The problem is that, sometimes, the alarm is caused by an earthquake.

 Also, John might confuse the alarm with his telephone, and Mary might miss the
alarm altogether because she typically listens to loud music.
⇒ Random variables: Burglary, Earthquake, Alarm, John, Mary.
Given that both John and Mary call me, what is the probability of a burglary?

⇒ This is almost a naive Bayes model, but with multiple causes (Burglary and
Earthquake) for the Alarm, which in turn may cause John and/or Mary.

Michael Kohlhase: Artificial Intelligence 2 760 2025-02-06

John, Mary, and My Alarm: Assumptions

513
514 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

We assume:
 We (should) know Burglary Earthquake
P(Alarm|Burglary, Earthquake),
P(John|Alarm), and P(Mary|Alarm).
 Burglary and Earthquake are independent.
Alarm
 John and Mary are conditionally independent
given Alarm.

 Moreover: Both John and Mary are condition-


ally independent of any other random variables
in the graph given Alarm. JohnCalls MaryCalls
(Only Alarm causes them, and everything else
only causes them indirectly through Alarm)
First Step: Construct the full joint probability distribution,
Second Step: Use enumeration to compute P(Burglary|John = T, Mary = T).

Michael Kohlhase: Artificial Intelligence 2 761 2025-02-06

John, Mary, and My Alarm: The Distribution

P(John, Mary, Alarm, Burglary, Earthquake)


=P(John|Mary, Alarm, Burglary, Earthquake) · P(Mary|Alarm, Burglary, Earthquake)
· P(Alarm|Burglary, Earthquake) · P(Burglary|Earthquake) · P(Earthquake)
=P(John|Alarm) · P(Mary|Alarm) · P(Alarm|Burglary, Earthquake) · P(Burglary) · P(Earthquake)

We plug into the equation for enumeration:


X
P(Burglary|John = T, Mary = T)=α(P(Burglary) P (John|Alarm = a) · P (Mary|Alarm = a)
a∈{T,F}
X
· P(Alarm = a|Burglary, Earthquake = q)P (Earthquake = q))
q∈{T,F}

⇒ Now let’s scale things up to arbitrarily many variables!

Michael Kohlhase: Artificial Intelligence 2 762 2025-02-06

Bayesian Networks: Definition


Definition 22.1.2. A Bayesian network consists of
1. a directed acyclic graph ⟨X , E⟩ of random variables X = {X 1 , . . ., X n }, and

2. a conditional probability distribution P(X i |Parents(X i )) for every X i ∈ X (also


called the CPT for conditional probability table)
such that every X i is conditionally independent of any conjunctions of non-descendents
of X i given Parents(X i ).
Definition 22.1.3. Let ⟨X , E⟩ be a directed acyclic graph, X ∈ X , and E ∗ the
reflexive transitive closure of E. The non-descendents of X are the elements of the set
22.2. CONSTRUCTING BAYESIAN NETWORKS 515

NonDesc(X) := {Y | (X,Y ) ̸∈ E ∗ }\Parents(X).


Note that the roots of the graph are conditionally independent given the empty set;
i.e. they are independent.
Theorem 22.1.4. The full joint probability distribution of a Bayesian network ⟨X , E⟩
is given by Y
P(X 1 , . . ., X n ) = P(X i |Parents(X i ))
X i ∈X

Michael Kohlhase: Artificial Intelligence 2 763 2025-02-06

Some Applications
 A ubiquitous problem: Observe “symptoms”, need to infer “causes”.
Medical Diagnosis Face Recognition

Self-Localization Nuclear Test Ban

Michael Kohlhase: Artificial Intelligence 2 764 2025-02-06

22.2 Constructing Bayesian Networks

Compactness of Bayesian Networks


 Definition 22.2.1. Given random variables X 1 , . . ., X n with finite domains D1 , . . ., Dn ,
the size of B := ⟨{X 1 , . . ., X n }, E⟩ is defined as
n
X Y
size(B):= |Di | · ( |Dj |)
i=1 X j ∈Parents(X i )

 Note: size(B) =
b The total number of entries in the conditional probability distri-
butions.
 Note: Smaller BN ; need to assess less probabilities, more efficient inference.
516 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

Qn
 Observation 22.2.2. Explicit full joint probability distribution has size i=1 |Di |.
 Observation 22.2.3. If |Parents(X i )| ≤ k for every X i , and Dmax is the largest
k+1
random variable domain, then size(B) ≤ n|Dmax | .
 Example 22.2.4. For |Dmax | = 2, n = 20, k = 4 we have 220 = 1048576
probabilities, but a Bayesian network of size ≤ 20 · 25 = 640 . . . !
Q1
 In the worst case, size(B) = n · ( ·=i n)|Di |, namely if every variable depends on
all its predecessors in the chosen variable ordering.
 Intuition: BNs are compact – i.e. of small size – if each variable is directly
influenced only by few of its predecessor variables.

Michael Kohlhase: Artificial Intelligence 2 765 2025-02-06

Keeping Networks Small


To keep our Bayesian networks small, we can:
1. Reduce the number of edges: ⇒ Order the variables to allow for exploiting
conditional independence (causes before effects), or
2. represent the conditional probability distributions efficiently:
(a) For Boolean random variables X, we only need to store P(X = T|Parents(X))
(P(X = F|Parents(X)) = 1 − P(X = T|Parents(X))) (Cuts the number of
entries in half!)
(b) Introduce different kinds of nodes exploiting domain knowledge; e.g. determin-
istic and noisy disjunction nodes.

Michael Kohlhase: Artificial Intelligence 2 766 2025-02-06

Reducing Edges: Variable Order Matters


Given a set of random variables X 1 , . . ., X n , consider the following (impractical,
but illustrative) pseudo-algorithm for constructing a Bayesian network:

 Definition 22.2.5 (BN construction algorithm).


1. Initialize BN := ⟨{X 1 , . . ., X n }, E⟩ where E = ∅.
2. Fix any variable ordering, X 1 , . . ., X n .
3. for i := 1, . . . , n do
a. Choose a minimal set Parents(X i ) ⊆ {X 1 , . . . ,X i−1 } such that

P(X i |X i−1 , . . . ,X 1 ) = P(X i |Parents(X i ))

b. For each X j ∈ Parents(X i ), insert (X j ,X i ) into E.


c. Associate X i with P(X i |Parents(X i )).
 Attention: Which variables we need to include into Parents(X i ) depends on what
“{X 1 , . . . ,X i−1 }” is . . . !
22.2. CONSTRUCTING BAYESIAN NETWORKS 517

 Thus: The size of the resulting BN depends on the chosen variable ordering
X 1 , . . ., X n .
 In Particular: The size of a Bayesian network is not a fixed property of the domain.
It depends on the skill of the designer.

Michael Kohlhase: Artificial Intelligence 2 767 2025-02-06

John and Mary Depend on the Variable Order!


 Example 22.2.6. Mary, John, Alarm, Burglary, Earthquake.

Michael Kohlhase: Artificial Intelligence 2 768 2025-02-06

Note: For ?? we try to determine whether – given different value assignments to potential parents
– the probability of Xi being true differs? If yes, we include these parents. In the particular case:
1. M to J yes because the common cause may be the alarm.

2. M, J to A yes because they may have heard alarm.


3. A to B yes because if A then higher chance of B.
4. However, M/J to B no because M/J only react to the alarm so if we have the value of A then
values of M/J don’t provide more information about B.

5. A to E yes because if A then higher chance of E.


6. B to E yes because, if A and not B then chances of E are higher than if A and B.

John and Mary Depend on the Variable Order! Ctd.


 Example 22.2.7. Mary, John, Earthquake, Burglary, Alarm.
518 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

Michael Kohlhase: Artificial Intelligence 2 769 2025-02-06

Again: Given different value assignments to potential parents, does the probability of Xi being
true differ? If yes, include these parents.
1. M to J as before.
2. M, J to E as probability of E is higher if M/J is true.
3. Same for B; E to B because, given M and J are true, if E is true as well then prob of B is
lower than if E is false.
4. M /J/B/E to A because if M /J/B/E is true (even when changing the value of just one of
these) then probability of A is higher.

John and Mary, What Went Wrong?

 Intuition: These BNs link from effects to their causes!


⇒ Even though Mary and John are conditionally independent given Alarm, this is
not exploited, since Alarm is not ordered before Mary and John!
⇒ Rule of Thumb: We should order causes before symptoms.

Michael Kohlhase: Artificial Intelligence 2 770 2025-02-06

Representing Conditional Distributions: Deterministic Nodes


Definition 22.2.8. A node X in a Bayesian network is called deterministic, if its value
22.2. CONSTRUCTING BAYESIAN NETWORKS 519

is completely determined by the values of Parents(X).

Example 22.2.9. The sum of two dice throws S is entirely determined by the values
of the two dice F irst and Second.
Example 22.2.10. In the Wumpus example, the breezes are entirely determined by
the pits

⇒ Deterministic nodes model direct, causal relationships.


⇒ If X is deterministic, then P (X|Parents(X)) ∈ {0, 1}
⇒ we can replace the conditional probability distribution P(X|Parents(X)) by a
boolean function.
Michael Kohlhase: Artificial Intelligence 2 771 2025-02-06

Representing Conditional Distributions: Noisy Nodes


Sometimes, values of nodes are “almost deterministic”:
Example 22.2.11 (Inhibited Causal Dependencies).
Assume the network on the right contains all possible causes of Flu
fever. (Or add a dummy-node for “other causes”)
If there is a fever, then one of them (at least) must be the cause, Cold Malaria
but none of them necessarily cause a fever: The causal relation
between parent and child is inhibited. Fever

⇒ We can model the inhibitions by individual inhibition factors q d .

Definition 22.2.12. The conditional probability distribution of a noisy disjunction node


X with
Q Parents(X) = X 1 , . . ., X n in a Bayesian network is given by P (X|X 1 , . . ., X n ) =
1−( {j | X j =T} q j ), where the q i are the inhibition factors of X i ∈ Parents(X), defined
as q i := P (¬X|¬X 1 , . . ., ¬X i−1 , X i , ¬X i+1 , . . ., ¬X n )

⇒ Instead of a distribution with 2k parameters, we only need k parameters!

Michael Kohlhase: Artificial Intelligence 2 772 2025-02-06

Representing Conditional Distributions: Noisy Nodes


 Example 22.2.13. Assume the following inhibition factors for ??:

q cold = P (¬fever|cold, ¬flu, ¬malaria) = 0.6


q flu = P (¬fever|¬cold, flu, ¬malaria) = 0.2
q malaria = P (¬fever|¬cold, ¬flu, malaria) = 0.1

If we model Fever as a noisy disjunction node, then the general rule P (X i |Parents(X i )) =
520 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

Q
{j | X j =T} q j for the CPT gives the following table:

Cold Flu Malaria P (Fever) P (¬Fever)


F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 · 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 · 0.1
T T F 0.88 0.12 = 0.6 · 0.2
T T T 0.988 0.012 = 0.6 · 0.2 · 0.1

Michael Kohlhase: Artificial Intelligence 2 773 2025-02-06

Representing Conditional Distributions: Summary


 Note that deterministic nodes and noisy disjunction nodes are just two examples of
“specialized” kinds of nodes in a Bayesian network.
 In general, noisy logical relationships in which a variable depends on k parents can
be described by O(k) parameters instead of O(2k ) for the full conditional probability
table. This can make assessment (and learning) tractable.
 Example 22.2.14. The CPCS network [Pra+94] uses noisy-OR and noisy-MAX dis-
tributions to model relationships among diseases and symptoms in internal medicine.
With 448 nodes and 906 links, it requires only 8,254 values instead of 133,931,430
for a network with full conditional probability distributions.

Michael Kohlhase: Artificial Intelligence 2 774 2025-02-06

22.3 Inference in Bayesian Networks

Probabilistic Inference Tasks in Bayesian Networks


Remember:
Definition 22.3.1 (Probabilistic Inference Task). Let X 1 , . . ., X n = Q1 , . . ., QnQ , E 1 , . . ., E nE , U 1 , . . ., U nU
be a set of random variables, a probabilistic inference task.
We wish to compute the conditional probability distribution P(Q1 , . . ., QnQ |E 1 =
e1 , . . ., E nE = enE ).
We call
 a Q1 , . . ., QnQ the query variables,
 a E 1 , . . ., E nE the evidence variables, and

 U 1 , . . ., U nU the hidden variables.


Qn
We know the full joint probability distribution: P(X 1 , . . ., X n ) = i=1 P(X i |Parents(X i ))
22.3. INFERENCE IN BAYESIAN NETWORKS 521

And we know about enumeration:

P(Q1 , . . ., QnQ |E 1 = e1 , . . ., E nE = enE )=


X
α( P(Q1 , . . ., QnQ , E 1 = e1 , . . ., E nE = enE , U 1 = u1 , . . ., U nU = unU ))
u∈DU

(where DU = dom(U 1 ) × . . . × dom(U nU ) )

Michael Kohlhase: Artificial Intelligence 2 775 2025-02-06

Enumeration: The Alarm-Example


Remember our example: P(Burglary|John, Mary)
(hidden variables: Alarm, Earthquake)
P
=α( ba ,be ∈{T,F} P (John, Mary, Alarm = ba , Earthquake = be , Burglary))
P
=α( ba ,be ∈{T,F} P (John|Alarm = ba ) · P (Mary|Alarm = ba )
·P(Alarm = ba |Earthquake = be , Burglary)·P (Earthquake = be )·P(Burglary))
⇒ These are 5 factors in 4 summands (ba , be ∈ {T, F}) over two cases (Burglary ∈
{T, F}),
⇒ 38 arithmetic operations (+3 for α)
General worst case: O(n2n )

Let’s do better!
Michael Kohlhase: Artificial Intelligence 2 776 2025-02-06

Enumeration: First Improvement


Some abbreviations: j := John, m := Mary, a := Alarm, e := Earthquake, b :=
Burglary,

X
P(b|j, m) = α( P (j|a = ba ) · P (m|a = ba ) · P(a = ba |e = be , b) · P (e = be ) · P(b))
ba ,be ∈{T,F}

Let’s “optimize”:
X X
P(b|j, m) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}

⇒ 3 factors in 2 summand + 2 factors in 2 summands + two factors in the outer


product, over two cases = 28 arithmetic operations (+3 for α)

Michael Kohlhase: Artificial Intelligence 2 777 2025-02-06

Second Improvement: Variable Elimination 1


522 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

Consider P(j|b = T). Using enumeration:


X X X
=α(P (b)·( P (e = be ) · ( P (a = ae |e = be , b) · P(j|a = ae ) · ( P (m = am |a = ae )))))
be ∈{T,F} ae ∈{T,F} am ∈{T,F}
| {z }
=1

⇒ P(John|Burglary = T) does not depend on Mary (duh...)


More generally:
Lemma 22.3.2. Given a query P(Q1 , . . ., QnQ |E 1 = e1 , . . ., E nE = enE ), we can
ignore (and remove) all hidden leafs of the Bayesian network.
...doing so yields new leafs, which we can then ignore again, etc., until:
Lemma 22.3.3. Given a query P(Q1 , . . ., QnQ |E 1 = e1 , . . ., E nE = enE ), we can
ignore (and remove) all hidden variables that are not ancestors of any of the Q1 , . . ., QnQ
or E 1 , . . ., E nE .

Michael Kohlhase: Artificial Intelligence 2 778 2025-02-06

Enumeration: First Algorithm


Assume the X 1 , . . ., X n are topologically sorted (causes before effects)

function Enumerate-Query(Q,⟨E 1 = e1 , . . ., E nE = enE ⟩)


P := ⟨⟩ /* = P(Q|E i = ei ) */
X 1 , . . ., X n := variables filtered according to ??, topologically sorted
for all q ∈ dom(Q) do
Pi :=EnumAll(⟨X 1 , . . ., X n ⟩,⟨E 1 = e1 , . . ., E nE = enE , Q = q⟩)
return α(P )
function EnumAll(⟨Y 1 , . . ., Y nY ⟩,⟨A1 = a1 , . . ., AnA = anA ⟩)
/* By construction, Parents(Y 1 )⊂{A1 , . . ., AnA } */
if ny = 0 then return 1.0
else if Y 1 = Aj then return P (Aj = aj |Parents(Aj ))·EnumAll(⟨Y 2 , . . ., Y nY ⟩,⟨A1 =
a1 , . . ., AnAP= anA ⟩)
else return y∈dom(Y 1 ) P (Y 1 = y|Parents(Y 1 ))·EnumAll(⟨Y 2 , . . ., Y nY ⟩,⟨A1 =
a1 , . . ., AnA = anA , Y 1 = y⟩)

General worst case: O(2n ) – better, but still not great

Michael Kohlhase: Artificial Intelligence 2 779 2025-02-06

Enumeration: Example
Variable order: b, e, a, j, m
 
P (a|b, e) · P (j|a) · P (m|a) · 1.0
 P (e) · + P (¬a|b, e) · P (j|¬a) · P (m|¬a) · 1.0
 P0 := P (b) · +
 
P (a|b, ¬e) · P (j|a) · P (m|a) · 1.0
P (¬e) · +
P (¬a|b, ¬e) · P (j|¬a) · P (m|¬a) · 1.0
 
P (a|¬b, e) · P (j|a) · P (m|a) · 1.0
 P (e) · + P (¬a|¬b, e) · P (j|¬a) · P (m|¬a) · 1.0
 P1 := P (¬b) · +
 
P (a|¬b, ¬e) · P (j|a) · P (m|a) · 1.0
P (¬e) · +
P (¬a|¬b, ¬e) · P (j|¬a) · P (m|¬a) · 1.0
P0 P1
⇐ ⟨P , ⟩
0 +P1 P0 +P1
22.3. INFERENCE IN BAYESIAN NETWORKS 523

X X
P(b|j = T, m = T) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}

Michael Kohlhase: Artificial Intelligence 2 780 2025-02-06

The Evaluation of P (b|j, m) as a “Search Tree”


X X
P(b|j, m) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}

Note: Enumerate-Query corresponds to depth-first traversal of an arithmetic


expression-tree:

Michael Kohlhase: Artificial Intelligence 2 781 2025-02-06

Variable Elimination 2
X X
P(b|j, m) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}

The last two factors P (j|a = ba ), P (m|a = ba ) only depend on a, but are “trapped”
behind the summation over e, hence computed twice in two distinct recursive calls to
EnumAll
Idea: Instead of left-to-right (top-down DFS), operate right-to-left (bottom-up) and
524 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

store intermediate “factors” along with their “dependencies”:


X X
α(P(b) · ( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
|{z} | {z } | {z } | {z } | {z }
be ∈{T,F} ba ∈{T,F}
f7 (b) f5 (e) f3 (a,b,e) f2 (a) f1 (a)
| {z }
f4 (b,e)
| {z }
f6 (b)

Michael Kohlhase: Artificial Intelligence 2 782 2025-02-06

Variable Elimination: Example


We only show variable elimination by example: (implementation details get tricky,
but the idea
P is simple) P
P(b)·( be ∈{T,F} P (e = be ) · ( ba ∈{T,F} P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba )))

Assume reverse topological order of variables: m, j, a, e, b


 m is an evidence variable with value T and dependency a, which is a hidden variable.
We introduce a new “factor” f (a):=f1 (a) := ⟨P (m|a), P (m|¬a)⟩.
 j works analogously, f2 (a) := ⟨P (j|a), P (j|¬a)⟩. We “multiply” with the existing
factor, yielding f (a) := ⟨f1 (a) · f2 (a), f1 (¬a) · f2 (¬a)⟩=⟨P (m|a) · P (j|a), P (m|¬a) ·
P (j|¬a)⟩
 a is a hidden variable with dependencies e (hidden) and b (query).
1. We introduce a new “factor” f3 (a, e, b), a 2×2×2 table with the relevant conditional
probabilities P(a|e, b).
2. We multiply each entry of f3 with the relevant entries of the existing factor f ,
yielding f (a, e, b).
3. We “sum out” the resulting factor over a, yielding a new factor f (e, b) = f (a, e, b)+
f (¬a, e, b).
 ...

⇒ can speed things up by a factor of 1000! (or more, depending on the order of
variables!)

Michael Kohlhase: Artificial Intelligence 2 783 2025-02-06

The Complexity of Exact Inference


 Definition 22.3.4. A graph G is called singly connected, or a polytree (otherwise
multiply connected), if there is at most one undirected path between any two nodes
in G.
 Theorem 22.3.5 (Good News). On singly connected Bayesian networks, variable
elimination runs in polynomial time.
 Is our BN for Mary & John a polytree? (Yes.)
 Theorem 22.3.6 (Bad News). For multiply connected Bayesian networks, prob-
abilistic inference is #P-hard. (#P is harder than NP, i.e.
NP ⊆ #P)
22.4. CONCLUSION 525

 So?: Life goes on . . . In the hard cases, if need be we can throw exactitude to
the winds and approximate.
 Example 22.3.7. Sampling techniques as in MCTS.

Michael Kohlhase: Artificial Intelligence 2 784 2025-02-06

22.4 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/29228.

Summary
 Bayesian networks (BN) are a wide-spread tool to model uncertainty, and to reason
about it. A BN represents conditional independence relations between random vari-
ables. It consists of a graph encoding the variable dependencies, and of conditional
probability tables (CPTs).
 Given a variable ordering, the BN is small if every variable depends on only a few
of its predecessors.
 Probabilistic inference requires to compute the probability distribution of a set
of query variables, given a set of evidence variables whose values we know. The
remaining variables are hidden.
 Inference by enumeration takes a BN as input, then applies Normalization+Marginalization,
the chain rule, and exploits conditional independence. This can be viewed as a tree
search that branches over all values of the hidden variables.

 Variable elimination avoids unnecessary computation. It runs in polynomial time for


poly-tree BNs. In general, exact probabilistic inference is #P-hard. Approximate
probabilistic inference methods exist.

Michael Kohlhase: Artificial Intelligence 2 785 2025-02-06

Topics We Didn’t Cover Here


 Inference by sampling: A whole zoo of methods for doing this exists.

 Clustering: Pre-combining subsets of variables to reduce the running time of in-


ference.
 Compilation to SAT: More precisely, to “weighted model counting” in CNF for-
mulas. Model counting extends DPLL with the ability to determine the number
of satisfying interpretations. Weighted model counting allows to define a mass for
each such interpretation (= the probability of an atomic event).
 Dynamic BN: BN with one slice of variables at each “time step”, encoding proba-
bilistic behavior over time.
 Relational BN: BN with predicates and object variables.
526 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS

 First-order BN: Relational BN with quantification, i.e. probabilistic logic. E.g.,


the BLOG language developed by Stuart Russel and co-workers.

Michael Kohlhase: Artificial Intelligence 2 786 2025-02-06

Reading:
• Chapter 14: Probabilistic Reasoning of [RN03].
– Section 14.1 roughly corresponds to my “What is a Bayesian Network?”.
– Section 14.2 roughly corresponds to my “What is the Meaning of a Bayesian Network?” and
“Constructing Bayesian Networks”.The main change I made here is to define the semantics
of the BN in terms of the conditional independence relations, which I find clearer than RN’s
definition that uses the reconstructed full joint probability distribution instead.
– Section 14.4 roughly corresponds to my “Inference in Bayesian Networks”. RN give full details
on variable elimination, which makes for nice ongoing reading.
– Section 14.3 discusses how CPTs are specified in practice.
– Section 14.5 covers approximate sampling-based inference.
– Section 14.6 briefly discusses relational and first-order BNs.
– Section 14.7 briefly discusses other approaches to reasoning about uncertainty.

All of this is nice as additional background reading.


Chapter 23

Making Simple Decisions Rationally

23.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/30338.

Overview
We now know how to update our world model, represented as (a set of) random
variables, given observations. Now we need to act.
For that we need to answer two questions:
Questions:
 Given a world model and a set of actions, what will the likely consequences of each
action be?
 How “good” are these consequences?

Idea:
 Represent actions as “special random variables”:
Given disjoint actions a1 , . . ., an , introduce a random variable A with domain {a1 , . . ., an }.
Then we can model/query P(X|A = ai ).
 Assign numerical values to the possible outcomes of actions (i.e. a function
u : dom(X) → R) indicating their desirability.

 Choose the action that maximizes the expected value of u

Definition 23.1.1. Decision theory investigates decision problems, i.e. how a model-
based agent a deals with choosing among actions based on the desirability of their
outcomes given by a real-valued utility function u on states s ∈ S: i.e. u : S → R.

Michael Kohlhase: Artificial Intelligence 2 787 2025-02-06

Decision Theory
If our states are random variables, then we obtain a random variable for the utility
function:
Observation: Let X i : Ω → Di random variables on a probability model ⟨Ω, P ⟩ and
f : D1 × . . . × Dn → E. Then F (x) := f (X 0 (x), . . ., X n (x)) is a random variable

527
528 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

Ω → E.
Definition 23.1.2. Given a probability
P model ⟨Ω, P ⟩ and a random variable X : Ω →
D with D ⊆ R, then E(X):= x∈D P (X = x) · x is called the expected value (or
expectation) of X. (Assuming the sum/series is actually defined!)
Analogously, let e1 , . . ., en a sequence of events.P Then the expected value of X
given e1 , . . ., en is defined as E(X|e1 , . . ., en ):= x∈D P (X = x|e1 , . . ., en ) · x.
Putting things together:
Definition 23.1.3. Let A : Ω → D a random variable (where D is a set of actions)
X i : Ω → Di random variables (the state), and u : D1 × . . . × Dn → R a utility function.
Then the expected utility of the action a ∈ D is the expected value of u (interpreted
as a random variable) given A = a ; i.e.
X
EU(a) := P (X 1 = x1 , . . ., X n = xn |A = a) · u(x1 , . . ., xn )
⟨x1 ,...,xn ⟩∈D 1 ×...×D n

Michael Kohlhase: Artificial Intelligence 2 788 2025-02-06

Utility-based Agents
 Definition 23.1.4. A utility-based agent uses a world model along with a utility
function that models its preferences among the states of that world. It chooses the
action that leads to the best expected utility.
54 Chapter 2. Intelligent Agents
 Agent Schema:

Sensors
State
What the world
How the world evolves is like now
Environment

What it will be like


What my actions do if I do action A

Utility How happy I will be


in such a state

What action I
should do now

Agent Actuators

Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
Michael Kohlhase: Artificial Intelligence 2 789 2025-02-06
action that leads to the best expected utility, where expected utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.

Maximizing Expected Utility (Ideas)


outcome. (Appendix A defines expectation more precisely.) In Chapter 16, we show that any
Definition
rational 23.1.5 (MEU
agent must behave principle for Rationality).
as if it possesses We whose
a utility function call anexpected
action rational if it
value it tries
to maximize.
maximizes An agent
expected utilitythat possesses
(MEU). Anan explicit utility
utility-based function
agent can make
is called rational
rational, iffdecisions
it always
withaa rational
chooses general-purpose
action.algorithm that does not depend on the specific utility function being
maximized.
Hooray: In this way,
This solves all ofthe
AI.“global” definition of rationality—designating as(in rational those
principle)
agent functions that have the highest performance—is
Problem: There is a long, long way towards an operationalization ;) turned into a “local” constraint on
rational-agent designs that can be expressed in a simple program.
Note: An
Theagent can beagent
utility-based entirely rational
structure (consistent
appears with Utility-based
in Figure 2.14. MEU) without agentever repre-
programs
appear in Part IV, where we design decision-making agents that must handle the uncertainty
inherent in stochastic or partially observable environments.
At this point, the reader may be wondering, “Is it that simple? We just build agents that
maximize expected utility, and we’re done?” It’s true that such agents would be intelligent,
but it’s not simple. A utility-based agent has to model and keep track of its environment,
23.2. DECISION NETWORKS 529

senting or manipulating utilities and probabilities.


Example 23.1.6. A simple reflex agent for tic tac toe based on a perfect lookup table is
rational if we take (the negative of) “winning/drawing in n steps” as the utility function.
Example 23.1.7 (AI1). Heuristics in tree search (greedy search, A∗ ) and game-play
(minimax, alpha-beta pruning) maximize “expected” utility.
⇒ In fully observable, deterministic environments, “expected utility” reduces to a
specific determined utility value:
EU(a) = U (T (S(s, e), a)), where e the most recent percept, s the current state, S
the sensor function and T the transition function.
Now let’s figure out how to actually assign utilities!

Michael Kohlhase: Artificial Intelligence 2 790 2025-02-06

23.2 Decision Networks


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30345.
Now that we understand multi-attribute utility functions, we can complete our design of a utility-
based agent, which we now recapitulate as a refresher. As we already use Bayesian networks for
the belief state of an utility-based agent, integrating utilities and possible actions into the network
suggests itself naturally. This leads to the notion of a decision network.

Decision networks
Definition 23.2.1. A decision network is a Bayesian net-
work with two additional kinds of nodes:

 action nodes, representing a set of possible actions,


and (square nodes)
 A single utility node (also called value node).
(diamond node)

General Algorithm: Given evidence E j = ej , and action nodes A1 , . . ., Ak , compute


the expected utility of each action, given the evidence, i.e. return the sequence of
actions
=expected utility of a1 , . . ., ak
z X }| {
argmax a1 , . . ., ak P (X i = xi |A1 = a1 , . . ., Ak = ak , E j = ej ) ·U (X i = xi )
| {z }
⟨x1 ,...,xn ⟩
usual Bayesian Network inference

Note the sheer amount of summands in the sum above in the general case! (⇒
We will simplify where possible later)

Michael Kohlhase: Artificial Intelligence 2 791 2025-02-06

Decision Networks: Example


 Example 23.2.2 (A Decision-Network for Aortic Coarctation). from [Luc96]
530 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

Michael Kohlhase: Artificial Intelligence 2 792 2025-02-06

23.3 Preferences and Utilities


Preferences in Deterministic Environments
Problem: How do we determine the utility of a state? (We cannot directly measure
our satisfaction/happiness in a possibly future state...) (What unit would we even
use?)
Example 23.3.1. I have to decide whether to go to class today (or sleep in). What is
the utility of this lecture? (obviously 42)
Idea: We can let people/agents choose between two states (subjective preference)
and derive a utility from these choices.
Example 23.3.2. Give me your cell-phone or I will give you a bloody nose. ;
To make a decision in a deterministic environment, the agent must determine whether
it prefers a state without phone to one with a bloody nose?
Definition 23.3.3. Given states A and B (we call them prizes) an agent can express
preferences of the form

 A≻B A prefered over B


 A∼B indifference between A and B
 A⪰B B not prefered over A
i.e. Given a set S (of states), we define binary relations ≻ and ∼ on S.

Michael Kohlhase: Artificial Intelligence 2 793 2025-02-06

Preferences in Non-Deterministic Environments


Problem: In nondeterministic environments we do not have full information about the
states we choose between.
Example 23.3.4 (Airline Food). Do you want chicken or pasta (but we cannot see
23.3. PREFERENCES AND UTILITIES 531

through the tin foil)


Definition 23.3.5.
Let S a set of states. We call a random variable X with domain p A
{A1 , . . ., An } ⊆ S a lottery and write [p1 ,A1 ; . . . ; pn ,An ], where pi = L
P (X = Ai ). 1−p B
Idea: A lottery represents the result of a nondeterministic action that can have out-
comes Ai with prior probability pi . For the binary case, we use [p,A;1−p,B]. We can
then extend preferences to include lotteries, as a measure of how strongly we prefer one
prize over another.
Convention: We assume S to be closed under lotteries, i.e. lotteries themselves are
also states. That allows us to consider lotteries such as [p,A;1−p,[q,B;1−q,C]].

Michael Kohlhase: Artificial Intelligence 2 794 2025-02-06

Rational Preferences
Note: Preferences of a rational agent must obey certain constraints – An agent with
rational preferences can be described as an MEU-agent.
Definition 23.3.6. We call a set ≻ of preferences rational, iff the following constraints
hold:
Orderability A≻B ∨ B≻A ∨ A∼B
Transitivity A≻B ∧ B≻C ⇒ A≻C
Continuity A≻B≻C ⇒ (∃p.[p,A;1−p,C]∼B)
Substitutability A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C]
Monotonicity A≻B ⇒ ((p > q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B])
Decomposability [p,A;1−p,[q,B;1−q,C]]∼[p,A ; ((1 − p)q),B ; ((1 − p)(1 − q)),C]

From a set of rational preferences, we can obtain a meaningful utility function.

Michael Kohlhase: Artificial Intelligence 2 795 2025-02-06

The rationality constraints can be understood as follows:


Orderability: A≻B ∨ B≻A ∨ A∼B Given any two prizes or lotteries, a rational agent must either
prefer one to the other or else rate the two as equally preferable. That is, the agent cannot
avoid deciding. Refusing to bet is like refusing to allow time to pass.
Transitivity: A≻B ∧ B≻C ⇒ A≻C
Continuity: A≻B≻C ⇒ (∃p.[p,A;1−p,C]∼B) If some lottery B is between A and C in preference,
then there is some probability p for which the rational agent will be indifferent between getting
B for sure and the lottery that yields A with probability p and C with probability 1 − p.
Substitutability: A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C] If an agent is indifferent between two lotteries
A and B, then the agent is indifferent between two more complex lotteries that are the same
except that B is substituted for A in one of them. This holds regardless of the probabilities and
the other outcome(s) in the lotteries.
Monotonicity: A≻B ⇒ ((p > q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B]) Suppose two lotteries have the same
two possible outcomes, A and B. If an agent prefers A to B, then the agent must prefer the
lottery that has a higher probability for A (and vice versa).
Decomposability: [p,A;1−p,[q,B;1−q,C]]∼[p,A;((1−p)q),B ;((1−p)(1−q)),C] Compound lotteries
can be reduced to simpler ones using the laws of probability. This has been called the “no fun
in gambling” rule because it says that two consecutive lotteries can be compressed into a single
equivalent lottery: the following two are equivalent:
532 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

p
A p A
(1 − p)q
q
B B
1−p
(1 − p)(1 − q)
C
1−q
C

Rational preferences contd.


 Violating the rationality constraints from ?? leads to self-evident irrationality.
 Example 23.3.7. An agent with intransitive preferences can be induced to give
away all its money:
 If B≻C, then an agent who has C would pay (say) 1 cent to get B
 If A≻B, then an agent who has B would pay (say) 1 cent to get A
 If C≻A, then an agent who has A would pay (say) 1 cent to get C

Michael Kohlhase: Artificial Intelligence 2 796 2025-02-06

23.4 Utilities
Ramseys Theorem and Value Functions
 Theorem 23.4.1. (Ramsey, 1931; von Neumann and Morgenstern, 1944)
Given a rational set of preferences there exists a real valued
P function U such that
U (A) ≥ U (B), iff A⪰B and U ([p1 ,S1 ; . . . ; pn ,Sn ]) = i pi U (Si )

 This is an existence theorem, uniqueness not guaranteed.


 Note: Agent behavior is invariant w.r.t. positive linear transformations, i.e. an
agent with utility function U ′ (x) = k 1 U (x) + k 2 where k 1 > 0 behaves exactly like
one with U .

 Observation: With deterministic prizes only (no lottery choices), only a total
ordering on prizes can be determined.
 Definition 23.4.2. We call a total ordering on states a value function or ordinal
utility function. (If we don’t need to care about relative utilities of states, e.g. to
compute non-trivial expected utilities, that’s all we need anyway!)
23.4. UTILITIES 533

Michael Kohlhase: Artificial Intelligence 2 797 2025-02-06

Utilities
 Intuition: Utilities map states to real numbers.
 Question: Which numbers exactly?
 Definition 23.4.3 (Standard approach to assessment of human utilities).
Compare a given state A to a standard lottery Lp that has
 “best possible prize” u⊤ with probability p
 “worst possible catastrophe” u⊥ with probability 1 − p
adjust lottery probability p until A∼Lp . Then U (A) = p.

 Example 23.4.4. Choose u⊤ =


b current state, u⊥ =
b instant death

0.999999 continue as before


pay $30∼L
0.000001 instant death

Michael Kohlhase: Artificial Intelligence 2 798 2025-02-06

Popular Utility Functions


 Definition 23.4.5. Normalized utilities: u⊤ = 1, u⊥ = 0.
(Not very meaningful, but at least it’s independent of the specific problem...)
 Obviously: Money (Very intuitive, often easy to determine, but actually not
well-suited as a utility function (see later))

 Definition 23.4.6. Micromorts: one millionth chance of instant death.


(useful for Russian roulette, paying to reduce product risks, etc.)
But: Not necessarily a good measure of risk, if the risk is “merely” severe injury or
illness. . .
Better:
 Definition 23.4.7. QALYs: quality adjusted life years
QALYs are useful for medical decisions involving substantial risk.

Michael Kohlhase: Artificial Intelligence 2 799 2025-02-06

Comparing Utilities
Problem: What is the monetary value of a micromort?
Just ask people: What would you pay to avoid playing Russian roulette with a million-
barrelled revolver? (Usually: quite a lot!)
534 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

But their behavior suggests a lower price:


 Driving in a car for 370km incurs a risk of one micromort;
 Over the life of your car – say, 150, 000km that’s 400 micromorts.

 People appear to be willing to pay about 10, 000€ more for a safer car that halves
the risk of death. (; 25€ per micromort)
This figure has been confirmed across many individuals and risk types.
Of course, this argument holds only for small risks. Most people won’t agree to kill
themselves for 25M€. (Also: People are pretty bad at estimating and comparing
risks, especially if they are small.) (Various cognitive biases and heuristics are at work
here!)

Michael Kohlhase: Artificial Intelligence 2 800 2025-02-06

Money vs. Utility


 Money does not behave as a utility function should.

 Given a lottery L with expected monetary value EMV(L), usually U (L) < U (EMV(L)),
i.e., people are risk averse.
 Utility curve: For what probability p am I indifferent between a prize x and a
lottery [p,M $;1−p,0$] for large numbers M ?

 Typical empirical data, extrapolated with risk prone behavior for debitors:

 Empirically: Comes close to the logarithm on the natural numbers.

Michael Kohlhase: Artificial Intelligence 2 801 2025-02-06

23.5 Multi-Attribute Utility


Video Nuggets covering this section can be found at https://fau.tv/clip/id/30343 and
https://fau.tv/clip/id/30344.
In this section we will make the ideas introduced above more practical. The discussion above
conceived utility functions as functions on atomic states, which were good enough for introducing
the theory. But when we build decision models for utility-based agent we want to characterize
states by attributes that are already random variables in the Bayesian network we use to represent
the belief state. For factored states, the utility function can be expressed as a multivariate function
on attribute values.
23.5. MULTI-ATTRIBUTE UTILITY 535

Utility Functions on Attributes


Recap: So far we understand how to obtain utility functions u : S → R on states s ∈ S
from (rational) preferences.
But in practice, our actions often impact multiple distinct “attributes” that need to
be weighed against each other.
⇒ Lotteries become complex very quickly
Definition 23.5.1. Let X 1 , . . ., X n be random variables with domains D1 , . . ., Dn .
Then we call a function u : D1 × . . . × Dn → R a (multi-attribute) utility function on
attributes X 1 , . . ., X n .

Note: In the general (worst) case, a multi-attribute utility function on n random


variables with domain sizes k each requires k n parameters to represent.
But: A utility function on multiple attributes often has “internal structure” that we
can exploit to simplify things.
For example, the distinct attributes are often “independent” with respect to their
utility (a higher-quality product is better than a lower-quality one that costs the same,
and a cheaper product is better than an expensive one of the same quality)

Michael Kohlhase: Artificial Intelligence 2 802 2025-02-06

Multi-Attribute Utility: Example


 Example 23.5.2 (Assessing an Airport Site).

Air Traffic Deaths  Attributes: Deaths,


Noise, Cost.
Litigation Noise  Question: What is
U (Deaths, Noise, Cost)
Construction Cost for a projected airport?

 How can complex utility function be assessed from preference behaviour?


 Idea 1: Identify conditions under which decisions can be made without complete
identification of U (X 1 , . . ., X n ).
 Idea 2: Identify various types of independence in preferences and derive consequent
canonical forms for U (X 1 , . . ., X n ).

Michael Kohlhase: Artificial Intelligence 2 803 2025-02-06

Strict Dominance
First Assumption: U is often monotone in each argument. (wlog. growing)
Definition 23.5.3. (Informally) An action B strictly dominates an action A, iff every
possible outcome of B is at least as good as every possible outcome of A,
536 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

If A strictly dominates B, we can just ignore B entirely.

Observation: Strict dominance seldom holds in practice (life is difficult) but is useful
for narrowing down the field of contenders.

Michael Kohlhase: Artificial Intelligence 2 804 2025-02-06

Stochastic Dominance
Definition 23.5.4. Let X1 , X2 distributions with domains ⊆ R.
X1 stochastically dominates X2 iff for all t ∈ R, we have P (X1 ≥ t) ≥ P (X2 ≥ t),
and for some t, we have P (X1 ≥ t) > P (X2 ≥ t).
Observation 23.5.5. If U is monotone in X1 , and P(X1 |a) stochastically dominates
P(X1 |b) for actions a, b, then a is always the better choice than b, with all other
attributes Xi being equal.
⇒ If some action P(Xi |a) stochastically dominates P(Xi |b) for all attributes Xi ,
we can ignore b.
Observation: Stochastic dominance can often be determined without exact distribu-
tions using qualitative reasoning.
Example 23.5.6 (Construction cost increases with distance). If airport location
S 1 is closer to the city than S 2 ; S 1 stochastically dominates S 2 on cost.q

Michael Kohlhase: Artificial Intelligence 2 805 2025-02-06

We have seen how we can do inference with attribute-based utility functions, let us consider the
computational implications. We observe that we have just replaced one evil – exponentially many
states (in terms of the attributes) – by another – exponentially many parameters of the utility
functions.
Wo we do what we always do in AI-2: we look for structure in the domain, do more theory to
be able to turn such structures into computationally improved representations.

Preference structure: Deterministic


 Recall: In deterministic environments an agent has a value function.
 Definition 23.5.7. X 1 and X 2 preferentially independent of X 3 iff preference
between ⟨x1 , x2 , z⟩ and ⟨x′ 1 , x′ 2 , z⟩ does not depend on z. (i.e. the tradeoff
between x1 and x2 is independent of z)
 Example 23.5.8. E.g., ⟨Noise, Cost, Safety⟩: are preferentially independent
⟨20,000 suffer, 4.6 G$, 0.06 deaths/mpm⟩ vs.⟨70,000 suffer, 4.2 G$, 0.06 deaths/mpm⟩

 Theorem 23.5.9 (Leontief, 1947). If every pair of attributes is preferentially


independent of its complement, then every subset of attributes is preferentially in-
dependent of its complement: mutual preferential independence.
 Theorem 23.5.10 (Debreu, 1960). Mutual preferential independence implies
23.6. THE VALUE OF INFORMATION 537

P
that there is an additive value function: V (S) = i Vi (Xi (S)), where Vi is a value
function referencing just one variable Xi .
 Hence assess n single-attribute functions. (often a good approximation)

 Example 23.5.11. The value function for the airport decision might be

V (noise, cost, deaths) = −noise · 104 − cost − deaths · 1012

Michael Kohlhase: Artificial Intelligence 2 806 2025-02-06

Preference structure: Stochastic


Definition 23.5.12. X is utility independent of Y iff preferences over lotteries in X
do not depend on particular values in Y
Definition 23.5.13. A set X is mutually utility independent (MUI), iff each subset is
utility independent of its complement.
Theorem 23.5.14. For a MUI set of attributes X , there is a multiplicative utility
function of the form: [Kee74]

X k
Y
U= Ui (X i = xi )
({X 0 ,...,X k }⊆X ) i=1

⇒ U can be represented using n single-attribute utility functions.


System Support: Routine procedures and software packages for generating preference
tests to identify various canonical families of utility functions.

Michael Kohlhase: Artificial Intelligence 2 807 2025-02-06

Decision networks - Improvements


Ways to improve inference in decision networks:

 Exploit “inner structure” of the utility function to simplify the computation,


 eliminate dominated actions,
 label pairs of nodes with stochastic dominance: If (the utility of) some attribute
dominates (the utility of) another attribute, focus on the dominant one
(e.g. if price is always more important than quality, ignore quality whenever the
price between two choices differs)
 various techniques for variable elimination,
 policy iteration (more on that when we talk about Markov decision procedures)

Michael Kohlhase: Artificial Intelligence 2 808 2025-02-06

23.6 The Value of Information


Video Nuggets covering this section can be found at https://fau.tv/clip/id/30346 and
https://fau.tv/clip/id/30347.
538 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

So far we have tacitly been concentrating on actions that directly affect the environment. We
will now come to a type of action we have hypothesized in the beginning of the course, but have
completely ignored up to now: information gathering actions.

What if we do not have all information we need?


We now know how to exploit the information we have to make decisions. But if we
knew more, we might be able to make even better decisions in the long run - potentially
at the cost of gaining utility. (exploration vs. exploitation)
Example 23.6.1 (Medical Diagnosis).
 We do not expect a doctor to already know the results of the diagnostic tests when
the patient comes in.
 Tests are often expensive, and sometimes hazardous. (directly or by delaying
treatment)
 Therefore: Only test, if

 knowing the results lead to a significantly better treatment plan,


 information from test results is not drowned out by a-priori likelihood.

Definition 23.6.2. Information value theory is concerned with agent making decisions
on information gathering rationally.

Michael Kohlhase: Artificial Intelligence 2 809 2025-02-06

Value of Information by Example


Idea: Compute the expected gain in utility from acquring information.
Example 23.6.3 (Buying Oil Drilling Rights). There are n blocks of drilling rights
available, exactly one block actually has oil worth k€, in particular:
1
 The prior probability of a block having oil is n each (mutually exclusive).
k
 The current price of each block is n €.

 A “consultant” offers an accurate survey of block (say) 3. How much should we be


willing to pay for the survey?

Solution: Compute the expected value of the best action given the information, minus
the expected value of the best action without information.
Example 23.6.4 (Oil Drilling Rights contd.).
 Survey may say oil in block 3 with probability 1
n ; we buy block 3 for k
n€ and
make a profit of (k − nk )€.

 Survey may say no oil in block 3 with probability n−1


n ; we buy another block,
k
and make an expected profit of n−1 − nk €.
 Without the survery, the expected profit is 0
1 (n−1)k n−1 k k
 Expected profit is n · n + n · n(n−1) = n.

k
 So, we should pay up to n€ for the information. (as much as block 3 is worth!)
23.6. THE VALUE OF INFORMATION 539

Michael Kohlhase: Artificial Intelligence 2 810 2025-02-06

General formula (VPI)


Definition 23.6.5. Let A the set of available actions and F a random variable. Given
evidence E i = ei , let α be the action that maximizes expected utility a priori, and αf the
action that maximizes expected utility given F = f , i.e.: α = argmax EU(a|E i = ei )
a∈A
and αf = argmax EU(a|E i = ei , F = f )
a∈A
The value of perfect information (VPI) on F given evidence E i = ei is defined as
X
VPIE i =ei (F ):=( P (F = f |E i = ei ) · EU(αf |E i = ei , F = f ))−EU(α|E i = ei )
f ∈dom(F )

Intuition: The VPI is the expected gain from knowing the value of F relative to
the current expected utility, and considering the relative probabilities of the possible
outcomes of F .
Michael Kohlhase: Artificial Intelligence 2 811 2025-02-06

Properties of VPI
 Observation 23.6.6 (VPI is Non-negative).
VPIE (F ) ≥ 0 for all j and E (in expectation, not post hoc)
 Observation 23.6.7 (VPI is Non-additive).
VPIE (F, G) ̸= VPIE (F ) + VPIE (G) (consider, e.g., obtaining F twice)
 Observation 23.6.8 (VPI is Order-independent).

VPIE (F, G) = VPIE (F ) + VPIE,F (G) = VPIE (G) + VPIE,G (F )

 Note: When more than one piece of evidence can be gathered,


maximizing VPI for each to select one is not always optimal
; evidence-gathering becomes a sequential decision problem.

Michael Kohlhase: Artificial Intelligence 2 812 2025-02-06

Qualitative behavior of VPI


 Question: Say we have three distributions for P (U |Ej )
540 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY

Qualitatively: What is the value of information (VPI) in these three cases?


 Answers: reserved for the plenary sessions ; be there!

Michael Kohlhase: Artificial Intelligence 2 813 2025-02-06

We will now use information value theory to specialize our utility-based agent from above.

A simple Information-Gathering Agent

 Definition 23.6.9. A simple information gathering agent. (gathers info before


acting)
function Information−Gathering−Agent (percept) returns an action
persistent: D, a decision network
integrate percept into D
j := argmax VPIE (Ek )/Cost(Ek )
k
if VPIE (Ej ) > Cost(Ej ) return Request(Ej )
else return the best action from D

The next percept after Request(Ej ) provides a value for Ej .


 Problem: The information gathering implemented here is myopic, i.e. only ac-
quires a single evidence variable, or acts immediately. (cf. greedy
search)
 But it works relatively well in practice. (e.g. outperforms humans for selecting
diagnostic tests)

 Strategies for nonmyopic information gathering exist (Not discussed in this


course)

Michael Kohlhase: Artificial Intelligence 2 814 2025-02-06

Summary
 An MEU agent maximizes expected utility.
 Decision theory provides a framework for rational decision making.

 Decision networks augment Bayesian networks with action nodes and a utility node.
 rational preferences allow us to obtain a utility function (orderability, transitivity,
continuity, substitutability, monotonicity, decomposability)
 multi-attribute utility functions can usually be “destructured” to allow for better
inference and representation (can be monotone, attributes may dominate others,
actions may dominate others, may be multiplicative,...)
 information value theory tells us when to explore rather than exploit, using
 VPI (value of perfect information) to determine how much to “pay” for information.

Michael Kohlhase: Artificial Intelligence 2 815 2025-02-06


Chapter 24

Temporal Probability Models

24.1 Modeling Time and Uncertainty

Stochastic Processes
The world changes in stochastically predictable ways.
Example 24.1.1.
 The weather changes, but the weather tomorrow is somewhat predictable given
today’s weather and other factors, (which in turn (somewhat) depends on
yesterday’s weather, which in turn...)

 the stock market changes, but the stock price tomorrow is probably related to
today’s price,
 A patient’s blood sugar changes, but their blood sugar is related to their blood
sugar 10 minutes ago (in particular if they didn’t eat anything in between)

How do we model this?


Definition 24.1.2. Let ⟨Ω, P ⟩ a probability space and ⟨S, ⪯⟩ a (not necessarily totally )
ordered set.
A sequence of random variables (X t )t∈S with dom(X t ) = D is called a stochastic
process over the time structure S.
Intuition: X t models the outcome of the random variable X at time step t. The
sample space Ω corresponds to the set of all possible sequences of outcomes.
Note: We will almost exclusively use ⟨S, ⪯⟩ = ⟨N, ≤⟩.
Definition 24.1.3. Given a stochastic process X t over S and a, b ∈ S with a ⪯ b, we
write Xa:b for the sequence X a , X a+1 , . . ., X b−1 , X b and E =e
a:b for E a = ea , . . ., E b =
eb .
Michael Kohlhase: Artificial Intelligence 2 816 2025-02-06

Stochastic Processes (Running Example)


Example 24.1.4 (Umbrellas). You are a security guard in a secret underground
facility, want to know it if is raining outside. Your only source of information is whether
the director comes in with an umbrella.
 We have a stochastic process Rain0 , Rain1 , Rain2 , . . . of hidden variables, and

541
542 CHAPTER 24. TEMPORAL PROBABILITY MODELS

 a related stochastic process Umbrella0 , Umbrella1 , Umbrella2 , . . . of evidence


variables.
...and a combined stochastic process ⟨Rain0 , Umbrella0 ⟩, ⟨Rain1 , Umbrella1 ⟩, . . .
Note that Umbrellat only depends on Raint , not on e.g. Umbrellat−1 (except
indirectly through Raint / Raint−1 ).
Definition 24.1.5. We call a stochastic process of hidden variables a state variable.

Michael Kohlhase: Artificial Intelligence 2 817 2025-02-06

Markov Processes
Idea: Construct a Bayesian network from these variables (parents?)
...without everything exploding in size...?
Definition 24.1.6. Let (X t )t∈S a stochastic process. X has the (nth order) Markov
property iff X t only depends on a bounded subset of X0:t−1 – i.e. for all t ∈ S we have
P(X t |X 0 , . . .X t−1 ) = P(X t |X t−n , . . .X t−1 ) for some n ∈ S.
A stochastic process with the Markov property for some n is called a (nth order)
Markov process.
Important special cases:
Definition 24.1.7.

 First-order Markov property: P(Xt |X0:t−1 ) = P(Xt |Xt−1 )

Xt−2 Xt−1 Xt Xt+1 Xt+2

A first order Markov process is called a Markov chain.


 Second-order Markov property: P(Xt |X0:t−1 ) = P(Xt |Xt−2 , Xt−1 )

Xt−2 Xt−1 Xt Xt+1 Xt+2

Michael Kohlhase: Artificial Intelligence 2 818 2025-02-06

Markov Process Example: The Umbrella


Example 24.1.8 (Umbrellas continued). We model the situation in a Bayesian net-
work:
Raint−1 Raint Raint+1

Umbrellat−1 Umbrellat Umbrellat+1

Problem: This network does not actually have the First-order Markov property...
Possible fixes: We have two ways to fix this:
24.1. MODELING TIME AND UNCERTAINTY 543

1. Increase the order of the Markov process. (more dependencies ⇒ more complex
inference)
2. Add more state variables, e.g., Tempt , Pressuret . (more information sources)

Michael Kohlhase: Artificial Intelligence 2 819 2025-02-06

Markov Process Example: Robot Motion


Example 24.1.9 (Random Robot Motion). Assume we want to track a robot wan-
dering randomly on the X/Y plane, whose position we can only observe roughly (e.g.
by approximate GPS coordinates:) Markov chain
Vt−1 Vt Vt+1

Xt−1 Xt Xt+1

Zt−1 Zt Zt+1

 the velocity V i may change unpredictably.


 the exact position X i depends on previous position X i−1 and velocity V i−1
 the position X i influences the observed position Z i .

Example 24.1.10 (Battery Powered Robot). If the robot has a battery, the Markov
property is violated!
 Battery exhaustion has a systematic effect on the change in velocity.
 This depends on how much power was used by all previous manoeuvres.

Michael Kohlhase: Artificial Intelligence 2 820 2025-02-06

Markov Process Example: Robot Motion


Idea: We can restore the Markov property by including a state variable for the charge
level B t . (Better still: Battery level sensor)
Example 24.1.11 (Battery Powered Robot Motion).
Mt−1 Mt Mt+1

Bt−1 Bt Bt+1

Vt−1 Vt Vt+1

Xt−1 Xt Xt+1

Zt−1 Zt Zt+1

 Battery level B i is influenced by previous level B i−1 and velocity V i−1 .


 Velocity V i is influenced by previous level B i−1 and velocity V i−1 .
544 CHAPTER 24. TEMPORAL PROBABILITY MODELS

 Battery meter M i is only influenced by Battery level B i .

Michael Kohlhase: Artificial Intelligence 2 821 2025-02-06

Stationary Markov Processes as Transition Models


Remark 24.1.12. Given a stochastic process with state variables X t and evidence vari-
ables E t , then P(X t |X0:t ) is a transition model and P(E t |X0:t , E1:t−1 ) a sensor model
in the sense of a model-based agent.
Note that we assume that the X t do not depend on the E t .
Also note that with the Markov property, the transition model simplifies to P(Xt |Xt−n ).
Problem: Even with the Markov property the transition model is infinite. (t ∈ N)
Definition 24.1.13. A Markov chain is called stationary if the transition model is
independent of time, i.e. P(X t |X t−1 ) is the same for all t.
Example 24.1.14 (Umbrellas are stationary). P(Raint |Raint−1 ) does not depend
on t. (need only one table)

Raint−1 Raint Raint+1


Rt−1 P (Rt )
T 0.7
F 0.3

Umbrellat−1 Umbrellat Umbrellat+1

Don’t confuse “stationary” (Markov processes) with “static” (environments).


We restrict ourselves to stationary Markov processes in AI-2.

Michael Kohlhase: Artificial Intelligence 2 822 2025-02-06

Markov Sensor Models


Recap: The sensor model P(E t |X0:t , E1:t−1 ) allows us (using Bayes rule et al) to
update our belief state about X t given the observations E0:t .
Problem: The evidence variables E t could depend on any of the variables X0:t , E1:t−1 ...

Definition 24.1.15. We say that a sensor model has the sensor Markov property, iff
P(E t |X0:t , E1:t−1 ) = P(E t |X t ) – i.e., the sensor model depends only on the current
state.

Assumptions on Sensor Models: We usually assume the sensor Markov property and
make it stationary as well: P(E t |X t ) is fixed for all t.
Definition 24.1.16 (Note).
 If a Markov chain X is stationary and discrete, we can represent the transition
model as a matrix Tij := P (X t = j|X t−1 = i).

 If a sensor model has the sensor Markov property, we can represent each observation
E t = et at time t as the diagonal matrix Ot with Otii := P (E t = et |X t = i).
 A pair ⟨X, E⟩ where X is a (stationary) Markov chains, E i only depends on X i ,
and E has the sensor Markov property is called a (stationary) Hidden Markov Model
(HMM). (X and E are single variables)
24.2. INFERENCE: FILTERING, PREDICTION, AND SMOOTHING 545

Michael Kohlhase: Artificial Intelligence 2 823 2025-02-06

Umbrellas, the full Story


Example 24.1.17 (Umbrellas, Transition & Sensor Models).

Raint−1 Raint Raint+1


Rt−1 P (Rt )
T 0.7
F 0.3 Rt P (Ut )
T 0.9
F 0.2

Umbrellat−1 Umbrellat Umbrellat+1

This is a hidden Markov model


Observation 24.1.18. If we know the initial prior probabilities P(X 0 ) (=
b time t = 0),
then we can compute the full joint probability distribution as
t
Y
P(X0:t , E1:t ) = P(X 0 ) · ( P(X i |X i−1 ) · P(E i |X i ))
i=1

Michael Kohlhase: Artificial Intelligence 2 824 2025-02-06

24.2 Inference: Filtering, Prediction, and Smoothing

Inference tasks
Definition 24.2.1. Given a Markov process with state variables X t and evidence
variables E t , we are interested in the following Markov inference tasks:
 Filtering (or monitoring) P(X t |E =e
1:t ): Given the sequence of observations up until
time t, compute the likely state of the world at current time t.
 Prediction (or state estimation) P(X t+k |E =e
1:t ) for k > 0: Given the sequence of
observations up until time t, compute the likely future state of the world at time
t + k.
 Smoothing (or hindsight) P(X t−k |E =e
1:t ) for 0 < k < t: Given the sequence of
observations up until time t, compute the likely past state of the world at time
t − k.
 Most likely explanation argmax (P (X =x =e
1:t |E 1:t )): Given the sequence of observa-
x1:t
tions up until time t, compute the most likely sequence of states that led to these
observations.

Note: The most likely sequence of states is not (necessarily) the sequence of most
likely states ;-)
In this section, we assume X and E to represent multiple variables, where X jointly
forms a Markov chain and the E jointly have the sensor Markov property.
In the case where X and E are stationary single variables, we have a stationary
hidden Markov model and can use the matrix forms.
Michael Kohlhase: Artificial Intelligence 2 825 2025-02-06
546 CHAPTER 24. TEMPORAL PROBABILITY MODELS

Filtering (Computing the Belief State given Evidence)


Note:

 Using the full joint probability distribution, we can compute any conditional prob-
ability we want, but not necessarily efficiently.
 We want to use filtering to update our ‘‘world model” P(X t ) based on a new
observation E t = et and our previous world model P(X t−1 ).

⇒ We want a function P(X t |E =e =e


1:t ) = F (et , P(X t−1 |E 1:t−1 ))
| {z }
F (et−1 ,...)

Spoiler:
T
F (et , P(X t−1 |E =e =e
1:t−1 )) = α(Ot · T · P(X t−1 |E 1:t−1 ))

Michael Kohlhase: Artificial Intelligence 2 826 2025-02-06

Filtering Derivation

P(X t |E =e =e
1:t ) = P(X t |E t = et , E 1:t−1 ) (dividing up evidence)
= α(P(E t = et |X t , E 1:t−1 ) · P(X t |E =e
=e
1:t−1 )) (using Bayes’ rule)
=e
= α(P(E t = et |X t ) · P(XX t |E 1:t−1 )) (sensor Markov property)
=e =e
= α(P(E t = et |X t ) · ( P(X t |X t−1 = x, E 1:t−1 ) · P (X t−1 = x|E 1:t−1 ))) (marginalization)
x∈dom(X)
X
= α(P(E t = et |X t ) ·( P(X t |X t−1 = x) · P (X t−1 = x|E =e
1:t−1 ))) (conditional independence)
| {z } | {z } | {z }
x∈dom(X)
sensor model transition model recursive call

Reminder: In a stationary HMM, we have the matrices Tij = P (X t = j|X t−1 = i)


and Otii = P (E t = et |X t = i).
Then interpreting P(X t−1 |E =e
1:t−1 ) as a vector, the above corresponds exactly to the
matrix multiplication α(Ot · TT · P(X t−1 |E =e1:t−1 ))

Definition 24.2.2. We call the inner part of the above expression the forward algorithm,
i.e. P(X t |E =e =e
1:t ) = α(FORWARD(et , P(X t−1 |E 1:t−1 ))) =: f 1:t .

Michael Kohlhase: Artificial Intelligence 2 827 2025-02-06

Filtering the Umbrellas


Example 24.2.3. Let’s assume:
 P(R0 ) = ⟨0.5, 0.5⟩, (Note that with growing t (and evidence), the impact of the
prior at t = 0 vanishes anyway)
 P (Rt+1 |Rt ) = 0.6, P (¬Rt+1 |¬Rt ) = 0.8, P (Ut |Rt ) = 0.9 and P (¬Ut |¬Rt ) = 0.85
 
0.6 0.4
⇒T=
0.2 0.8
24.2. INFERENCE: FILTERING, PREDICTION, AND SMOOTHING 547

 The director carries an umbrella on days 1 and 2, and not on day 3.


   
0.9 0 0.1 0
⇒ O1 = O2 = and O3 = .
0 0.15 0 0.85
Then:
X
 f 1:1 := P(R1 |U1 = T) = α(P(U1 = T|R1 ) · ( P(R1 |R0 = b) · P (R0 = b)))
b∈{T,F}
=α(⟨0.9, 0.15⟩ · (⟨0.6, 0.4⟩ · 0.5 + ⟨0.2, 0.8⟩ · 0.5)) = α(⟨0.36, 0.09⟩) = ⟨0.8, 0.2⟩
     
T 0.5 0.9 0 0.6 0.2
 Using matrices: α(O1 · T · ) = α( · ·
   0.5    0  0.15 0.4 0.8 
0.5 0.9 · 0.6 0.9 · 0.2 0.5 0.9 · 0.6 · 0.5 + 0.9 · 0.2 · 0.5
) =α( · ) = α( )=
0.5
  0.15 · 0.4 0.15 · 0.8 0.5 0.15 · 0.4 · 0.5 + 0.15 · 0.8 · 0.5
0.36
α( )
0.09

Michael Kohlhase: Artificial Intelligence 2 828 2025-02-06

Filtering the Umbrellas (Continued)


Example 24.2.4. f 1:1 := P(R1 |U1 = T) = ⟨0.8, 0.2⟩
X
 f 1:2 := P(R2 |U2 = T, U1 = T) = α(O2 ·TT ·f 1:1 ) = α(P(U2 = T|R2 )·( P(R2 |R1 = b) · f 1:1 (b)))
b∈{T,F}
=α(⟨0.9, 0.15⟩ · (⟨0.6, 0.4⟩ · 0.8 + ⟨0.2, 0.8⟩ · 0.2)) = α(⟨0.468, 0.072⟩) = ⟨0.87, 0.13⟩
T
 f 1:3 := P(R3 |U3 = F, UX
2 = T, U1 = T) = α(O3 · T · f 1:2 )
=α(P(U3 = F|R3 ) · ( P(R3 |R2 = b) · f 1:2 (b)))
b∈{T,F}
=α(⟨0.1, 0.85⟩·(⟨0.6, 0.4⟩·0.87+⟨0.2, 0.8⟩·0.13)) = α(⟨0.0547, 0.3853⟩) = ⟨0.12, 0.88⟩

Michael Kohlhase: Artificial Intelligence 2 829 2025-02-06

Prediction in Markov Chains


Prediction: P(X t+k |E =e
1:t ) for k > 0.
Intuition: Prediction is filtering without new evidence – i.e. we can use filtering until
t, and then continue as follows:
Lemma 24.2.5. By the same reasoning as filtering:
X
T
P(X t+k+1 |E =e
1:t ) = P(X t+k+1 |X t+k = x) · P (X t+k = x|E =e =e
1:t ) =T · P(X t+k = x|E 1:t )
| {z } | {z }| {z }
x∈dom(X)
transition model recursive call HMM

Observation 24.2.6. As k → ∞, P(X t+k |E =e 1:t ) converges towards a fixed point called
the stationary distribution of the Markov chain. (which we can compute from the
equation S = TT · S)
⇒ the impact of the evidence vanishes.
⇒ The stationary distribution only depends on the transition model.
⇒ There is a small window of time (depending on the transition model) where
the evidence has enough impact to allow for prediction beyond the mere stationary
548 CHAPTER 24. TEMPORAL PROBABILITY MODELS

distribution, called the mixing time of the Markov chain.


⇒ Predicting the future is difficult, and the further into the future, the more difficult
it is (Who knew...)

Michael Kohlhase: Artificial Intelligence 2 830 2025-02-06

Smoothing
Smoothing: P(X t−k |E =e1:t ) for k > 0.
Intuition: Use filtering to compute P(X t |E =e
1:t−k ), then recurse backwards from t until
t − k.

P(X t−k |E =e
1:t ) = P(X t−k |E =e =e
t−(k−1):t , E 1:t−k ) (Divide the evidence)
=e =e =e
= α(P(E t−(k−1):t |X t−k , E 1:t−k ) · P(X t−k |E 1:t−k )) (Bayes Rule)
= α(P(E =e =e
t−(k−1):t |X t−k ) · P(X t−k |E 1:t−k )) (cond. independence)
| {z } | {z }
=:bt−(k−1):t =f 1:t−k
= α(f 1:t−k × bt−(k−1):t )

(where × denotes component-wise multiplication)

Michael Kohlhase: Artificial Intelligence 2 831 2025-02-06

Smoothing (continued)
Definition 24.2.7 (Backward message). bt−k:t = P(E =e
t−k:t |X t−(k+1) )
X
= P(E =e
t−k:t |X t−k = x, X t−(k+1) ) · P(X t−k = x|X t−(k+1) )
x∈dom(X)
X
= P (E =e
t−k:t |X t−k = x) · P(X t−k = x|X t−(k+1) )
x∈dom(X)
X
= P (E t−k = et−k , E =e
t−(k−1):t |X t−k = x) · P(X t−k = x|X t−(k+1) )
x∈dom(X)
X
= P (E t−k = et−k |X t−k = x) · P (E =e
t−(k−1):t |X t−k = x) · P(X t−k = x|X t−(k+1) )
| {z } | {z } | {z }
x∈dom(X)
sensor model =bt−(k−1):t transition model

Note: in a stationary hidden Markov model, we get the matrix formulation bt−k:t =
T · Ot−k · bt−(k−1):t
Definition 24.2.8. We call the associated algorithm the backward algorithm, i.e.
P(X t−k |E =e
1:t ) = α(FORWARD(et−k , f 1:t−(k+1) ) × BACKWARD(et−(k−1) , bt−(k−2):t )).
| {z } | {z }
f 1:t−k bt−(k−1):t
As a starting point for the recursion, we let bt+1:t the uniform vector with 1 in every
component.

Michael Kohlhase: Artificial Intelligence 2 832 2025-02-06

Smoothing example
Example 24.2.9 (Smoothing Umbrellas). Reminder: We assumed P(R0 ) = ⟨0.5, 0.5⟩,
P (Rt+1 |Rt ) = 0.6, P (¬Rt+1 |¬Rt ) = 0.8, P (Ut |Rt ) = 0.9, P (¬Ut |¬Rt ) = 0.85
24.2. INFERENCE: FILTERING, PREDICTION, AND SMOOTHING 549
     
0.6 0.4 0.9 0 0.1 0
⇒T= , O1 = O2 = and O3 = .
0.2 0.8 0 0.15 0 0.85
(The director carries an umbrella on days 1 and 2, and not on day 3)
f 1:1 = ⟨0.8, 0.2⟩, f 1:2 = ⟨0.87, 0.13⟩ and f 1:3 = ⟨0.12, 0.88⟩
Let’s compute

P(R1 |U1 = T, U2 = T, U3 = F) = α(f 1:1 × b2:3 )

 We need to compute b2:3 and b3:3 :


       
0.6 0.4 0.1 0 1 0.4
 b3:3 = T · O3 · b4:3 = · · =
0.2 0.8 0 0.85 1 0.7
       
0.6 0.4 0.9 0 0.4 0.258
 b2:3 = T · O2 · b3:3 = · · =
0.2 0.8 0 0.15 0.7 0.156
       
0.8 0.258 0.2064 0.87
⇒ α( × ) = α( )=
0.2 0.156 0.0312 0.13
⇒ Given the evidence U2 , ¬U3 , the posterior probability for R1 went up from 0.8 to
0.87!

Michael Kohlhase: Artificial Intelligence 2 833 2025-02-06

Forward/Backward Algorithm for Smoothing


Definition 24.2.10. Forward backward algorithm: returns the sequence of posterior
distributions P(X 1 ). . .P(X t ) given evidence e1 , . . ., et :
function Forward-Backward(⟨e1 , . . ., et ⟩,P(X 0 ))
f := ⟨P(X 0 )⟩
b := ⟨1, 1, . . .⟩
S := ⟨P(X 0 )⟩
for i = 1, . . . , t do
fi := FORWARD(fi−1 , ei ) /* filtering */
for i = t, . . . , 1 do
Si := α(fi × b) /* smoothing */
b := BACKWARD(b, ei )
return S

Time complexity linear in t (polytree inference), Space complexity O(t · |f |).

Michael Kohlhase: Artificial Intelligence 2 834 2025-02-06

Country dance algorithm


Idea: If T and Oi are invertible, we can avoid storing all forward messages in the
smoothing algorithm by running filtering backwards:

f 1:i+1 = α(Oi+1 · TT · f 1:i )


−1
⇒ f 1:i = α(TT · Oi+1 −1 · f 1:i+1 )
⇒ we can trade space complexity for time complexity:
 In the first for-loop, we only compute the final f 1:t (No need to store the
intermediate results)
550 CHAPTER 24. TEMPORAL PROBABILITY MODELS

 In the second for-loop, we compute both f 1:i and bt−i:t (Only one copy of f 1:i ,
bt−i:t is stored)
⇒ constant space.

But: Requires that both matrices are invertible, i.e. every observation must be
possible in every state. (Possible hack: increase the probabilities of 0 to “negligibly
small”)

Michael Kohlhase: Artificial Intelligence 2 835 2025-02-06

Most Likely Explanation


Smoothing allows us to compute the sequence of most likely states X 1 , . . ., X t given
E =e =x =e
1:t . What if we want the most likely sequence of states? i.e. max (P (X 1:t |E 1:t ))?
x1 ,...,xt
Example 24.2.11. Given the sequence U1 , U2 , ¬U3 , U4 , U5 , the most likely state for R3
is F, but the most likely sequence might be that it rained throughout...
Prominent Application: In speech recognition, we want to find the most likely word
sequence, given what we have heard. (can be quite noisy)
Idea:
 For every xt ∈ dom(X) and 0 ≤ i ≤ t, recursively compute the most likely path
X 1 , . . ., X i ending in X i = xi given the observed evidence.

 remember the xi−1 that most likely leads to xi .


 Among the resulting paths, pick the one to the X t = xt with the most likely path,
 and then recurse backwards.

⇒ we want to know max P(X =x =e


1:t−1 , X t |E 1:t ), and then pick the xt with the
x1 ,...,xt−1
maximal value.
Michael Kohlhase: Artificial Intelligence 2 836 2025-02-06

Most Likely Explanation (continued)


By the same reasoning as for filtering:

max P(X =x =e
1:t−1 , X t |E 1:t )
x1 ,...,xt−1

= α(P(E t = et |X t ) ·max (P(X t |X t−1 = xt−1 ) · max (P (X =x =e


1:t−2 , X t−1 = xt−1 |E 1:t−1 ))))
| {z } xt−1 | {z } x1 ,...,xt−2
sensor model transition model | {z }
=:m1:t−1 (xt−1 )

m1:t (i) gives the maximal probability that the most likely path up to t leads to state
X t = i.
Note that we can leave out the α, since we’re only interested in the maximum.
Example 24.2.12. For the sequence [T, T, F, T, T]:
Section 15.2. Inference in Temporal Models 577

Rain 1 Rain 2 Rain 3 Rain 4 Rain 5

24.3. HIDDEN MARKOV MODELS


(a)
true– EXTENDED
true true EXAMPLE
true true 551
false false false false false
Umbrella t true true false true true
.8182 .5155 .0361 .0334 .0210
(b)
.1818 .0491 .1237 .0173 .0024
m1:1 m1:2 m1:3 m1:4 m1:5

Figure 15.5 (a) Possible state sequences for Rain t can be viewed as paths through a graph
bold arrows:of best predecessor
the possible measured
states at each time step. (Statesby “bestas preceding
are shown sequence
rectangles to avoid confusion probability ×
with nodes in a Bayes net.) (b) Operation of the Viterbi algorithm for the umbrella obser-
transition probability”
vation sequence [true, true, false, true, true]. For each t, we have shown the values of the
message m1:t , which gives the probability of the best sequence reaching each state at time t.
Also, Kohlhase:
Michael for each state, theIntelligence
Artificial bold arrow
leading into it indicates837
2 its best predecessor as measured
2025-02-06
by the product of the preceding sequence probability and the transition probability. Following
the bold arrows back from the most likely state in m1:5 gives the most likely sequence.

The Viterbi
butionsAlgorithm
over single time steps, whereas to find the most likely sequence we must consider
joint probabilities over all the time steps. The results can in fact be quite different. (See
Definition 24.2.13.
Exercise 15.4.)The Viterbi algorithm now proceeds as follows:
There is a linear-time algorithm for finding the most likely sequence, but it requires a
functionlittle more thought.
Viterbi(⟨e 1 , . It
. .,relies on the0same
et ⟩,P(X )) Markov property that yielded efficient algorithms for
m :=filtering
P(X 0 )and smoothing. The easiest way to think about the problem is to view each sequence /* m1:i */
prev as
:=a ⟨⟩path through a graph whose nodes/*arethe the most
possible at each time of
statespredecessor
likely step.
eachSuch a
possible xi */
for i graph
= 1, .is. .shown
, t dofor the umbrella world in Figure 15.5(a). Now consider the task of finding
the most likely path through this graph, where the likelihood of any path is the product of
m′ := max (P(E i = ei |X i ) · P(X i |X i−1 = xi−1 ) · mxi−1 )
the transition
xi−1 probabilities along the path and the probabilities of the given observations at
each state.:=Let’s
prev focus in
argmax particular
(P(E on paths that reach the state Rain = true. Because of
i = ei |X i ) · P(X i |X i−1 = xi−15) · mxi−1 )
i−1
the Markov property,
xi−1 it follows that the most likely path to the state Rain 5 = true consists of
mthe←−
mostmlikely
′ path to some state at time 4 followed by a transition to Rain 5 = true; and the
P :=state at time
⟨0, 0, ..., 4 argmax
that will become
mxpart
⟩ of the path to Rain 5 = true is whichever maximizes the
likelihood of that path. In other words, there is a recursive relationship between most likely
(x∈dom(X))
for i paths
= t −to 1,
each
. . .state xt+1 and most likely paths to each state xt . We can write this relationship
, 0 do
Pas an equation connecting the probabilities of the paths:
i := previ,Pi+1
return P max P(x1 , . . . , xt , Xt+1 | e1:t+1 )
x1 ...xt
! "
Observation 24.2.14. Viterbi
= α P(et+1 has
| Xt+1 ) max
x
linear
P(X t+1 | xtime
t ) maxcomplexity
x ...x
P (x1 , . . . , xt−1and linear
, xt | e1:t space complexity
) . (15.11)
t 1 t−1
(needs to keep the most likely sequence leading to each state).
Equation (15.11) is identical to the filtering equation (15.5) except that

Michael Kohlhase: Artificial Intelligence 2 838 2025-02-06

24.3 Hidden Markov Models – Extended Example

Example: Robot Localization using Common Sense


Example 24.3.1 (Robot Localization in a Maze). A robot has four sonar sensors
that tell it about obstacles in four directions: N, S, W, E.
We write the result where the sensor that detects obstacles in the north, south, and
east as N S E.
We filter out the impossible states:

a) Possible robot locations after e1 = N S W


552 CHAPTER 24. TEMPORAL PROBABILITY MODELS

b) Possible robot locations after e1 = N S W and e2 = N S

Remark 24.3.2. This only works for perfect sensors. (else no impossible states)
What if our sensors are imperfect?

Michael Kohlhase: Artificial Intelligence 2 839 2025-02-06

HMM Example: Robot Localization (Modeling)


Example 24.3.3 (HMM-based Robot Localization). We have the following setup:
 A hidden Random variable X t for robot location (domain: 42 empty squares)
 Let N (i) be the set of neighboring fields of the field X i = xi

 The Transition matrix for the move action (T has 422 = 1764 entries)
 1
|N (i)| if j ∈ N (i)
P (X t+1 = j|X t = i) = Tij =
0 else

1
 We do not know where the robot starts: P (X 0 ) = n (here n = 42)
 Evidence variable E t : four bit presence/absence of obstacles in N, S, W, E. Let
dit be the number of wrong bits and ϵ the error rate of the sensor. Then
4−dit
P (E t = et |X t = i) = Otii = (1 − ϵ) · ϵdit

(We assume the sensors are independent)


For example, the probability that the sensor on a square with obstacles in north and
3
south would produce N S E is (1 − ϵ) · ϵ1 .
We can now use filtering for localization, smoothing to determine e.g. the starting
location, and the Viterbi algorithm to find out how the robot got to where it is now.

Michael Kohlhase: Artificial Intelligence 2 840 2025-02-06

HMM Example: Robot Localization


We use HMM filtering equation f 1:t+1 = α · Ot+1 Tt f 1:t to compute posterior
distribution over locations. (i.e. robot localization)
Example 24.3.4. Redoing ??, with ϵ = 0.2.
582 Chapter 15. Probabilistic Reasoning over Time
582 Chapter 15.
24.4. DYNAMIC BAYESIAN NETWORKS Probabilistic Reasoning over Time 553

(a) Posterior distribution over robot location after E = N SW


(a) Posterior after E 11 after
a) distribution
Posterior distribution overover
robotrobot location
location = N SWE1 = N S W

b) Posterior
(b) distribution
(b) Posterior
Posterior overover
distribution
distribution robot
over location
robot
robot locationafter
location after EE111=
after E ==N
NN SW
SW,
SW, E =and
E 22 = N
NSSE2 = N S
Still Figure
the same locations distribution
as in the “perfect sensing” one case, but now other locations
Figure 15.7
15.7 Posterior
Posterior distribution over
over robot
robot location:
location: (a)
(a) one observation
observation EE11 =
=N SW ;;
N SW
have non-zero
(b) after probability.
a second observation E = N S. The size of each disk corresponds to the probability
(b) after a second observation E = N S. The size of each disk corresponds to the probability
2
2
that
that the
the robot
robot is
is at
at that
that location.
location. The
The sensor
sensor error
error rate
rate is
is !! =
= 0.2.
0.2.
Michael Kohlhase: Artificial Intelligence 2 841 2025-02-06

NNS, S, for
for example,
example, toto mean
mean that
that the
the north
north and
and south
south sensors
sensors report
report an
an obstacle
obstacle and
and the
the east and
east and
west do not. Suppose that each sensor’s error rate is ! and that errors occur
west do not. Suppose that each sensor’s error rate is ! and that errors occur independently forindependently for
HMM Example: Further Inference Applications
the
the four
four sensor
sensor directions.
directions. In
In that
that case,
case, the
the probability
probability of
of getting
getting all
all four
four bits
bits right
right is
is (1
(1 !)44
− !)

and the 4 . Furthermore, if d is the discrepancy—the
Idea:andWe the probability
probability
can
of
of getting
getting them
use smoothing: them all
all wrong
bk+1:t wrong
= TO
is
is !!k+1
4 . Furthermore, if dit is the discrepancy—the
bk+2:t to findit out where it started and
number of bits
number algorithm that are
of bits that are different—between
different—between the true
the true values
values for
for square
square ii and
and the
the actual
actual reading
reading
the Viterbi
,,Hidden
then to find the most likely path it took.
Section15.3.
Section then the
15.3. eettHidden the probability
Markov
Markov that
that aa robot
Models
Models
probability robot in
in square
square ii would
would receive
Example 24.3.5.Performance of HMM localization vs. observation length receive aa sensor
sensor reading
reading eett is
is 583
583
(various
4−dit dit
error ratesPPϵ)
(E = eett || X
(Ett = Xtt =
= i) =O
i) = Ottiiii = (1 −
= (1 − !)
!)4−dit !!dit ..
For
For example,
example, the
the probability
probability that that3aa square
square with with obstacles
obstacles to the north and south would produce
66 1 11 to the north and south would produce
aa sensor
5.5 reading
sensor
5.5 reading NSENSE is is (1(1 − !) 3!!1..
−====0.10
0.20
0.20
!) 0.9
0.9
0.10
matrices T O
5
5Given
Given thethe matrices T and
and
==0.05 Ott,, the
0.05 the robot
robot cancan use Equation
Equation (15.12)
0.8
use0.8 (15.12) to to compute
compute the pos-
the pos-
Localization error

4.5
Localization error

4.5 = 0.02
terior distribution over locations—that
= 0.02 is, to work out where it is. Figure 15.7 shows the
accuracy

0.7
Path accuracy

terior 4distribution
4 over locations—that 0.7
is, to work out where it is. Figure 15.7 shows the
==0.00
0.00
distributions
3.5
3.5
distributions P(X 1 | E 1 = N SW ) and P(X 2 | E 1 = N 0.6 E2 = N S). This is the same maze
SW,
0.6
P(X1 | E1 = N SW ) and P(X2 | E1 = N SW, E2 = N S). This is the same maze == 0.00
0.00
we 3 3 before in Figure 4.18 (page 146), but there we 0.5used logical filtering to==find 0.02the loca-
we saw
0.5 0.02
saw
2.5 before in Figure 4.18 (page 146), but there we 0.4used logical filtering to==find 0.05the loca-
Path

2.5 0.4 0.05


tions that
2 were possible, assuming perfect
tions 2that were possible, assuming perfect sensing. 0.3 sensing. Those
Those same locations are
same locations are =still =still
= 0.10the most
0.10
1.5
0.3 0.20the most
= 0.20
likely1.5with
likely 1with noisy sensing, but now every
noisy sensing, but now every location has 0.2 location has some
some
0.2 nonzero probability.
nonzero probability.
1
In
In addition
0.5
0.5 addition to to filtering
filtering to to estimate
estimate its its current location,
current 0.1
location,
0.1 the
the robot
robot cancan use
use smoothing
smoothing
(Equation
(Equation 0 0 (15.13))
5 5 1010 to
(15.13)) 1515work
to work 2020 out
25
out 25 where
30 35
30
where itit was
35 40 at any given
40
was at any given 00 past
55 time—for
past 10 15
10
time—for15 20 example,
20 25 30
25
example, 30 where
35 40
35
where it
40
it
began at time Number
0—and
Number ofofcan
it observations
use
observations the Viterbi algorithm to work out Number
the
Number ofofobservations
most observations
likely path it has
began at time 0—and it can use the Viterbi algorithm to work out the most likely path it has
Localization error(a) (Manhattan dis-
(a) (b)
Viterbi path accuracy
(b) (fraction of
tance from true location) correct states on Viterbi path)
Figure15.8
Figure 15.8 Performance
PerformanceofofHMM HMMlocalization
localizationasasaafunction
functionofofthe
thelength
lengthof
ofthe
theobserva-
observa-
tionsequence
tion sequenceforforvarious
variousdifferent
differentvaluesvaluesofofthe
thesensor
sensorerror
errorprobability
probability!;!;data
dataaveraged
averagedoverover
400runs.
400 runs. (a)
(a) The
The localization
localization error,
error, defined
defined
Michael Kohlhase: Artificial Intelligence 2
as
as the
the Manhattan
Manhattan
842
distance
distance from
from the
the true
true
2025-02-06
location.
location.
(b)The
(b) TheViterbi
Viterbipath
pathaccuracy,
accuracy,defined definedasasthe
thefraction
fractionof
ofcorrect
correctstates
stateson
onthe
theViterbi
Viterbipath.
path.

24.4 Dynamic
takentotoget
taken whereBayesian
getwhere ititisisnow.
now.Figure Networks
Figure15.8
15.8 showsthe
shows thelocalization
localizationerror
errorand
andViterbi
Viterbipath
pathaccuracy
accuracy
forvarious
for variousvalues
valuesofofthe
theper-bit
per-bitsensor
sensorerror
errorrate
rate !.!. Even
Evenwhen
when!!isis20%—which
20%—which means
means that
that
theoverall
A Video the
Nugget covering
overallsensor thisisissection
sensorreading
reading wrong canofof
wrong59%
59% bethefound
the at https://fau.tv/clip/id/30355.
time—the
time—the robotisisusually
robot usuallyable
ableto
towork
workout
outits
its
locationwithin
location withintwo
twosquares
squaresafter
after25
25observations.
observations. This
Thisisisbecause
because of
ofthe
thealgorithm’s
algorithm’s ability
ability
Dynamic
integrateBayesian
totointegrate evidenceover
evidence networks
overtime
timeand
andtototake
takeinto
intoaccount
accountthe
theprobabilistic
probabilisticconstraints
constraints imposed
imposed
onthe
on thelocation
locationsequence
sequence bybythethetransition
transition model.
model. When When !! isis 10%,
10%, thethe performance
performance afterafter
aahalf-dozen
half-dozenobservations
observationsisishard
hardtotodistinguish
distinguishfrom
fromthe
theperformance
performance with with perfect
perfect sensing.
sensing.
Definition
 Exercise 24.4.1.
15.7asks
asksyou A
youto Bayesian
toexplore
explorehowhownetwork D isHMM
robustthe
the calledlocalization
dynamic (a DBN),isifftoits
algorithm random
errors in
Exercise 15.7 robust HMM localization algorithm is to errors in
variables
the prior are indexed
distribution P(X by )a time
and in structure.
the transition We assume
model itself. that D
Broadly is
speaking, high levels
the prior distribution P(X00) and in the transition model itself. Broadly speaking, high levels
ofoflocalization
localizationandandpath
pathaccuracy
accuracyare aremaintained
maintainedevenevenininthe
theface
faceofofsubstantial
substantial errors
errors in
in the
the
models used.
models used.
Thestate
The statevariable
variable for
for the
the example
example we we have
have considered
considered inin this
this section
section isis aa physical
physical
locationininthe
location theworld.
world. Other
Otherproblems
problems can,
can, of
of course,
course, include
include other
other aspects
aspects ofof the
the world.
world.
Exercise 15.8 asks you to consider a version of the vacuum robot that has the policy of going
554 CHAPTER 24. TEMPORAL PROBABILITY MODELS

 time sliced, i.e. that the time slices Dt – the subgraphs of t-indexed random
variables and the edges between them – are isomorphic.
 a stationary Markov chain, i.e. that variables Xt can only have parents in Dt
and Dt−1 .

 Xt , Et contain arbitrarily many variables in a replicated Bayesian network.


 Example 24.4.2.

Umbrellas Robot Motion

Michael Kohlhase: Artificial Intelligence 2 843 2025-02-06

DBNs vs. HMMs


 Observation 24.4.3.
 Every HMM is a single-variable DBN. (trivially)
 Every DBN can be turned into an HMM. (combine variables into tuple ⇒ lose
information about dependencies)
 DBNs have sparse dependencies ; exponentially fewer parameters;

 Example 24.4.4 (Sparse Dependencies). With 20 Boolean state variables, three


parents each, a DBN has 20 · 23 = 160 parameters, the corresponding HMM has
220 · 220 ≈ 1012 .

Michael Kohlhase: Artificial Intelligence 2 844 2025-02-06

Exact inference in DBNs


 Definition 24.4.5 (Naive method). Unroll the network and run any exact algo-
rithm.
24.4. DYNAMIC BAYESIAN NETWORKS 555

P (R0 )
0.7 Rain0 Rain1
R0 P (R1 )
T 0.7
F 0.3 R1 P (U1 )
T 0.9
F 0.2

Umbrella1

R0 P (R1 ) R1 P (R2 ) R2 P (R3 ) R3 P (R4 ) R4 P (R5 )


T 0.7 T 0.7 T 0.7 T 0.7 T 0.7
F 0.3 F 0.3 F 0.3 F 0.3 F 0.3
P (R0 )
0.7 Rain0 Rain1 Rain2 Rain3 Rain4 Rain5
R1 P (U1 ) R2 P (U2 ) R3 P (U3 ) R4 P (U4 ) R5 P (U5 )
T 0.9 T 0.9 T 0.9 T 0.9 T 0.9
F 0.2 F 0.2 F 0.2 F 0.2 F 0.2

Umbrella1 Umbrella2 Umbrella3 Umbrella4 Umbrella5

 Problem: Inference cost for each update grows with t.


 Definition 24.4.6. Rollup filtering: add slice t + 1, “sum out” slice t using variable
elimination.
 Observation: Largest factor is O(dn+1 ), update cost O(dn+2 ), where d is the
maximal domain size.
 Note: Much better than the HMM update cost of O(d2n )

Michael Kohlhase: Artificial Intelligence 2 845 2025-02-06

Summary
 Temporal probability models use state and evidence variables replicated over time.
 Markov property and stationarity assumption, so we need both
 a transition model and P(Xt |Xt−1 )
 a sensor model P(Et |Xt ).
 Tasks are filtering, prediction, smoothing, most likely sequence; (all done
recursively with constant cost per time step)
 Hidden Markov models have a single discrete state variable; (used for speech
recognition)
 DBNs subsume HMMs, exact update intractable.

Michael Kohlhase: Artificial Intelligence 2 846 2025-02-06


556 CHAPTER 24. TEMPORAL PROBABILITY MODELS
Chapter 25

Making Complex Decisions

We will now pick up the thread from ?? but using temporal models instead of simply probabilistic
ones. We will first look at a sequential decision theory in the special case, where the environment is
stochastic, but fully observable (Markov decision processes) and then lift that to obtain POMDPs
and present an agent design based on that.

Outline
We will now combine the ideas of stochastic process with that of acting based on
maximizing expected utility:

 Markov decision processes (MDPs) for sequential environments.

 Value/policy iteration for computing utilities in MDPs.


 Partially observable MDP (POMDPs).
 Decision theoretic agents for POMDPs.

Michael Kohlhase: Artificial Intelligence 2 847 2025-02-06

25.1 Sequential Decision Problems

Sequential Decision Problems


 Definition 25.1.1. In sequential decision problems, the agent’s utility depends on
a sequence of decisions (or their result states).

 Definition 25.1.2. Utility functions on action sequences are often expressed in


terms of immediate rewards that are incurred upon reaching a (single) state.
 Methods: depend on the environment:
 If it is fully observable ; Markov decision process (MDPs)
 else ; partially observable MDP (POMDP).
 Sequential decision problems incorporate utilities, uncertainty, and sensing.
 Preview: Search problems and planning tasks are special cases.

557
558 CHAPTER 25. MAKING COMPLEX DECISIONS

Search
explicit actions uncertainty
and subgoals and utility

Planning Markov Decision


Problems (MDPs)
uncertainty explicit actions
and subgoals uncertain
and utility sensing belief states

Decision-theoretic Partially observable


Planning MDPs (POMDPs)

Michael Kohlhase: Artificial Intelligence 2 848 2025-02-06

We will fortify our intuition by an example. It is specifically chosen to be very simple, but
to exhibit all the peculiarities of Markov decision problems, which we will generalize from this
example.

Markov Decision Problem: Running Example


 Example 25.1.3 (Running Example: The 4x3 World). A (fully observable) 4×3
environment with non-deterministic actions:

 States s ∈ S, actions a ∈ As.


 Transition model: P (s′ |s, a) =
b probability that a in s leads to s′ .
 reward function:

−0.04 if (small penalty) for nonterminal states
R(s) :=
±1 if for terminal states

Michael Kohlhase: Artificial Intelligence 2 849 2025-02-06

Perhaps what is more interesting than the components of an MDP is that is not a component: a
belief and/or sensor model. Recall that MDPs are for fully observable environments.

Markov Decision Process


 Motivation: Let us (for now) consider sequential decision problems in a fully
observable, stochastic environment with a Markovian transition model on a finite
set of states and an additive reward function. (We will switch to partially
observable ones later)
 Definition 25.1.4. A Markov decision process (MDP) ⟨S , A, T , s0 , R⟩ consists of
 a set of S of states (with initial state s0 ∈ S),
25.1. SEQUENTIAL DECISION PROBLEMS 559

 for every state s, a sets of actions As.


 a transition model T (s, a) = P(S|s, a), and
 a reward function R : S → R; we call R(s) a reward.

 Idea: We use the rewards as a utility function: The goal is to choose actions such
that the expected cumulative rewards for the “foreseeable future” is maximized
⇒ need to take future actions and future states into account

Michael Kohlhase: Artificial Intelligence 2 850 2025-02-06

Solving MDPs
 In MDPs, the aim is to find an optimal policy π(s), which tells us the best action
for every possible state s. (because we can’t predict where we might end up, we
need to consider all states)

 Definition 25.1.5. A policy π for an MDP is a function mapping each state s to


an action a ∈ As.
An optimal policy is a policy that maximizes the expected total rewards. (for some
notion of “total”...)

 Example 25.1.6. Optimal policy when state penalty R(s) is 0.04:

Note: When you run against a wall, you stay in your square.

Michael Kohlhase: Artificial Intelligence 2 851 2025-02-06

+1 +1

Risk and Reward –1 –1


3 +1

 Example 25.1.7. Optimal policy depends on the reward function R(s).


R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850
2 –1

+1 +1 +1 +1

1
–1 –1 –1 –1

3 +1
1 2 3 4

R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850 – 0.0221 < R(s) < 0 R(s) > 0
2 –1
(a) (b)
 Question: Explain what you see in a qualitative manner!
+1 +1

1
–1 –1
 Answer: reserved for the plenary sessions ; be there!
1 2 3 4

– 0.0221 < R(s) < 0 R(s) > 0


(a) (b)
560 CHAPTER 25. MAKING COMPLEX DECISIONS

Michael Kohlhase: Artificial Intelligence 2 852 2025-02-06

25.2 Utilities over Time


In this section we address the problem that even if the transition models are stationary, the
utilities may not be. In fact we generally have to take the utilities of state sequences into account
in sequential decision problems. If we can derive a notion of the utility of a (single) state from
that, we may be able to reuse the machinery we introduced above, so that is exactly what we will
attempt.

Utility of state sequences


Why rewards?

 Recall: We cannot observe/assess utility functions, only preferences ; induce


utility functions from rational preferences

 Problem: In MDPs we need to understand preferences between sequences of


states.
 Definition 25.2.1. We call preferences on reward sequences stationary, iff

[r, r0 , r1 , r2 , . . .]≻[r, r0′ , r1′ , r2′ , . . .] ⇔ [r0 , r1 , r2 , . . .]≻[r0′ , r1′ , r2′ , . . .]

(i.e. rewards over time are “independent” of each other)


 Good news:
Theorem 25.2.2. For stationary preferences, there are only two ways to combine
rewards over time.
 additive rewards: U ([s0 , s1 , s2 , . . .]) = R(s0 ) + R(s1 ) + R(s2 ) + · · ·
 discounted rewards: U ([s0 , s1 , s2 , . . .]) = R(s0 )+γR(s1 )+γ 2 R(s2 )+· · · where
0 ≤ γ ≤ 1 is called discount factor.
⇒ we can reduce utilities over time to rewards on individual states

Michael Kohlhase: Artificial Intelligence 2 853 2025-02-06

Utilities of State Sequences


Problem: Infinite lifetimes ; additive rewards may become infinite.
Possible Solutions:

1. Finite horizon: terminate utility computation at a fixed time T

U ([s0 , . . . , s∞ ]) = R(s0 ) + · · · + R(sT )

; nonstationary policy: π(s) depends on time left.


2. If there are absorbing states: for any policy π agent eventually “dies” with probability
1 ; expected utility of every state is finite.
25.2. UTILITIES OVER TIME 561

3. Discounting: assuming γ < 1, R(s) ≤ Rmax ,



X ∞
X
U ([s0 , s1 , . . .]) = γ t R(st ) ≤ γ t Rmax = Rmax /(1 − γ)
t=0 t=0

Smaller γ ; shorter horizon.

We will only consider discounted rewards in this course

Michael Kohlhase: Artificial Intelligence 2 854 2025-02-06

Why discounted rewards?


Discounted rewards are both convenient and (often) realistic:
 stationary preferences imply (additive rewards or) discounted rewards anyway,
 discounted rewards lead to finite utilities for (potentially) infinite sequences of states
(we can compute expected utilities for the entire future),
 discounted rewards lead to stationary policies, which are easier to compute and
often more adequate (unless we know that remaining time matters),
 discounted rewards mean we value short-term gains over long-term gains (all else
being equal), which is often realistic (e.g. the same amount of money gained early
gives more opportunity to spend/invest ⇒ potentially more utility in the long run)
 we can interpret the discount factor as a measure of uncertainty about future
rewards ⇒ more robust measure in uncertain environments.

Michael Kohlhase: Artificial Intelligence 2 855 2025-02-06

Utility of States
Remember: Given a sequence of states S = s0 , s1 , s2 , . . ., and a discount factor
0 ≤ γ < 1, the utility of the sequence is

X
u(S) = γ t R(st )
t=0

Definition 25.2.3. Given a policy π and a starting state s0 , let Ssπ0 be the random
variable giving the sequence of states resulting from executing π at every state starting
at s0 . (Since the environment is stochastic, we don’t know the exact sequence.)
Then the expected utility obtained by executing π starting in s0 is given by

U π (s0 ):=EU(Ssπ0 ).

We define the optimal policy π ∗s0 :=argmax U π (s0 ).


π

Note: This is perfectly well-defined, but almost always computationally infeasible.


(requires considering all possible (potentially infinite) sequences of states)

Michael Kohlhase: Artificial Intelligence 2 856 2025-02-06


562 CHAPTER 25. MAKING COMPLEX DECISIONS

Utility of States (continued)


Observation 25.2.4. π ∗s0 is independent of the state s0 .
Proof sketch: If π ∗a and π ∗b reach point c, then there is no reason to disagree from that
point on – or with π ∗c , and we expect optimal policies to “meet at some state” sooner
or later.
?? does not hold for finite horizon policies!
Definition 25.2.5. We call π ∗ := π ∗s for some s the optimal policy.

Definition 25.2.6. The utility U (s) of a state s is U π (s).
Remark: R(s) =
b “immediate reward”, whereas U =
b “long-term reward”.
Given the utilities of the states, choosing the best action is just MEU: maximize the
expected utility of the immediate successor states
X
π ∗ (s) = argmax ( P (s′ |s, a) · U (s′ ))
a∈A(s) s′

⇒ given the “true” utilities, we can compute the optimal policy and vice versa.

Michael Kohlhase: Artificial Intelligence 2 857 2025-02-06

Utility of States (continued)


 Example 25.2.7 (Running Example Continued).

Expected Utility Optimal Policy

 Question: Why do we go left in (3, 1) and not up? (follow the utility)

Michael Kohlhase: Artificial Intelligence 2 858 2025-02-06

25.3 Value/Policy Iteration


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30359.

Dynamic programming: the Bellman equation



 Problem: We have defined U (s) via the optimal policy: U (s) := U π (s), but how
to compute it without knowing π ∗ ?
25.3. VALUE/POLICY ITERATION 563

 Observation: A simple relationship among utilities of neighboring states:

expected sum of rewards = current reward + γ · exp. reward sum after best action

 Theorem 25.3.1 (Bellman equation (1957)).


X
U (s) = R(s) + γ · max U (s′ ) · P (s′ |s, a)
a∈A(s)
s′

We call this equation the Bellman equation


 Example 25.3.2. U (1, 1) = −0.04
+ γ max{0.8U (1, 2) + 0.1U (2, 1) + 0.1U (1, 1), up
0.9U (1, 1) + 0.1U (1, 2) left
0.9U (1, 1) + 0.1U (2, 1) down
0.8U (2, 1) + 0.1U (1, 2) + 0.1U (1, 1)} right
 Problem: One equation/state ; n nonlinear (max isn’t) equations in n un-
knowns.
; cannot use linear algebra techniques for solving them.

Michael Kohlhase: Artificial Intelligence 2 859 2025-02-06

Value Iteration Algorithm


 Idea: We use a simple iteration scheme to find a fixpoint:
1. start with arbitrary utility values,
2. update to make them locally consistent with the Bellman equation,
3. everywhere locally consistent ; global optimality.

 Definition 25.3.3. The value iteration algorithm for utilitysutility function is given
by
function VALUE−ITERATION (mdp,ϵ) returns a utility fn.
inputs: mdp, an MDP with states S, actions A(s), transition model P (s′ |s, a),
rewards R(s), and discount γ
ϵ, the maximum error allowed in the utility of any state
local variables: U , U ′ , vectors of utilities for states in S, initially zero
δ, the maximum change in the utility of any state in an iteration
repeat
U := U ′ ; δ := 0
for each state s in S do
U ′ [s] := R(s) + γ · max ( s′ U [s′ ] · P (s′ |s, a))
P
a∈A(s)
if |U ′ [s] − U [s]| > δ then δ := |U ′ [s] − U [s]|
until δ < ϵ(1 − γ)/γ
return U
P
 Remark: Retrieve the optimal policy with π[s]:=argmax ( s′ U [s′ ] · P (s′ |s, a))
a∈A(s)

Michael Kohlhase: Artificial Intelligence 2 860 2025-02-06

Value Iteration Algorithm (Example)


repeat
U←
U ←U U !!;; δδ ←
← 00
for each state s in S do !
!
U !! [s] ← R(s) + γ max
max (s!! ||s,
PP(s s,a)
a) U [s!!]]
U[s
∈ A(s)
aa ∈ A(s) !
ss!
!
then δδ ←
if |U ! [s] − U [s]| > δδ then |U!![s]
←|U −U
[s] − U[s]|
[s]|
until δ < !(1 − γ)/γ
return U
564 CHAPTER 25. MAKING COMPLEX DECISIONS
Figure 17.4 iteration algorithm
The value iteration algorithm for
for calculating
calculatingutilities
utilitiesof
ofstates.
states. The
Thetermina-
termina-
tion condition
 Example Equation
25.3.4is (Iteration
from Equation
on(17.8).
(17.8).
4x3).

1e+07
1e+07
1 (4,3)
(4,3) cc==0.0001
0.0001
(3,3)
(3,3) 1e+06
1e+06 cc==0.001
0.001
0.8 cc==0.01
0.01
(1,1)
(1,1)

Iterations required
cc==0.1
0.1

required
100000
100000
Utility estimates

0.6 (3,1)
(3,1)
10000
10000
0.4 (4,1)
(4,1)

Iterations
1000
1000
0.2
100
100
0
10
10
-0.2
11
0 5 10 15 20
20 25
25 30
30 0.5
0.50.55
0.550.6
0.60.65
0.650.7
0.70.75
0.750.8
0.80.85
0.850.9
0.90.95
0.95 11
Number of iterations
iterations Discount
Discountfactor
factor
(a) (where ε = c · Rmax ) (b)
(b)

Figure 17.5 (a) Graph showing showing the the evolution


evolution ofof the
the utilities
utilitiesof
ofselected
selectedstates
statesusing
usingvalue
value
iteration. (b) The number of of value
value
Michael Kohlhase: Artificial Intelligence 2
iterations
iterations kk required
required
861
to
to guarantee an
an error
error of
guarantee2025-02-06 of atatmost
most
max, for different values
! = c · Rmax values of c, as
of c, as aa function
function ofof the
thediscount
discountfactor
factorγ.
γ.

Convergence
where the update is assumed to be
be applied
applied simultaneously
simultaneously to
to all
all the
the states
states atateach
eachiteration.
iteration.
If we apply the Bellman update infinitely
infinitely often,
often, we
we are
are guaranteed
guaranteed to to reach
reach an
anequilibrium
equilibrium
(see
(see Section
Section 17.2.3),
17.2.3), in
in which
which case
case the
the final
final utility
utility values
values
 Definition 25.3.5. The maximum norm is defined as ∥U ∥ = max |U (s)|, must
must be
be solutions
solutions to
to the
the Bellman
Bellman
so
equations. In
equations. In fact,
fact, they
they are
are also
also the unique solutions,
the unique solutions, and
and the
the corresponding
correspondingspolicy
policy(obtained
(obtained
∥U
using V ∥ = maximumisdifference
− Equation between U andcalled V . ALUE -I TERATION , is shown in
using Equation (17.4))
(17.4)) is optimal.
optimal. The The algorithm,
algorithm, called V VALUE -I TERATION , is shown in
Figure
Figure 17.4.
17.4.
 Let U and U
t t+1
be successive approximations to the true utility U during value
We can
We
iteration.can apply
apply value
value iteration
iteration to
to the ×33 world
the 44× world inin Figure
Figure 17.1(a).
17.1(a). Starting
Startingwith
withinitial
initial
values of
values of zero,
zero, the
the utilities
utilities evolve
evolve asas shown
shown inin Figure
Figure 17.5(a).
17.5(a). Notice
Noticehowhowthethestates
statesatatdiffer-
differ-
 Theorem 25.3.6. For any two approximations U t and V t

U t+1 − V t+1 ≤ γ U t − V t

I.e., any distinct approximations get closer to each other over time
In particular, any approximation gets closer to the true U over time
⇒ value iteration converges to a unique, stable, optimal solution.
 Theorem 25.3.7. If U t+1 − U t < ϵ, then U t+1 − U < 2ϵγ/1 − γ
(once the change in U t becomes small, we are almost done.)
 Remark: The policy resulting from U t may be optimal long before the utilities
convergence!

Michael Kohlhase: Artificial Intelligence 2 862 2025-02-06

So we see that iteration with Bellman updates will always converge towards the utility of a state,
even without knowing the optimal policy. That gives us a first way of dealing with sequential
decision problems: we compute utility functions based on states and then use the standard MEU
machinery. We have seen above that optimal policies and state utilities are essentially inter-
changeable: we can compute one from the other. This leads to another approach to computing
state utilities: policy iteration, which we will discuss now.

Policy Iteration
 Recap: Value iteration computes utilities ; optimal policy by MEU.
25.3. VALUE/POLICY ITERATION 565

 This even works if the utility estimate is inaccurate. (⇝ policy loss small)
 Idea: Search for optimal policy and utility values simultaneously [How60]: Iterate
 policy evaluation: given policy πi , calculate Ui = U πi , the utility of each state
were πi to be executed.
 policy improvement: calculate a new MEU policy πi+1 using 1 lookahead
Terminate if policy improvement yields no change in computed utilities.
 Observation 25.3.8. Upon termination Ui is a fixpoint of Bellman update
; Solution to Bellman equation ; πi is an optimal policy.

 Observation 25.3.9. Policy improvement improves policy and policy space is finite
; termination.

Michael Kohlhase: Artificial Intelligence 2 863 2025-02-06

Policy Iteration Algorithm


 Definition 25.3.10. The policy iteration algorithm is given by the following pseu-
docode:
function POLICY−ITERATION(mdp) returns a policy
inputs: mdp, and MDP with states S, actions A(s), transition model P (s′ |s, a)
local variables: U a vector of utilities for states in S, initially zero
π a policy indexed by state, initially random,
repeat
U := POLICY−EVALUATION(π,U ,mdp)
unchanged? := true
foreach state Ps in X do
if max ( s′ P (s′ |s, a) · U (s′ )) > s′ P (s′ |s, π[s]) · U (s′ ) then do
P
a∈A(s)
π[s] := argmax ( s′ P (s′ |s, b) · U (s′ ))
P
b∈A(s)
unchanged? := false
until unchanged?
return π

Michael Kohlhase: Artificial Intelligence 2 864 2025-02-06

Policy Evaluation
 Problem: How to implement the POLICY−EVALUATION algorithm?
 Solution: To compute utilities given a fixed π: For all s we have
X
U (s) = R(s) + γ( U (s′ ) · P (s′ |s, π(s)))
s′

(i.e. Bellman equation with the maximum replaced by the current policy π)
 Example 25.3.11 (Simplified Bellman Equations for π).
566 CHAPTER 25. MAKING COMPLEX DECISIONS

U i (1, 1) = −0.04 + 0.8U i (1, 2) + 0.1U i (1, 1) + 0.1U i (2, 1)


U i (1, 2) = −0.04 + 0.8U i (1, 3) + 0.1U i (1, 2)
..
.

 Observation 25.3.12. n simultaneous linear equations in n unknowns, solve in


O(n3 ) with standard linear algebra methods.

Michael Kohlhase: Artificial Intelligence 2 865 2025-02-06

Modified Policy Iteration


 Value iteration requires many iterations, but each one is cheap.
 Policy iteration often converges in few iterations, but each is expensive.
 Idea: Use a few steps of value iteration (but with π fixed), starting from the value
function produced the last time to produce an approximate value determination step.

 Often converges much faster than pure VI or PI.


 Leads to much more general algorithms where Bellman value updates and Howard
policy updates can be performed locally in any order.
 Remark: Reinforcement learning algorithms operate by performing such updates
based on the observed transitions made in an initially unknown environment.

Michael Kohlhase: Artificial Intelligence 2 866 2025-02-06

25.4 Partially Observable MDPs


We will now lift the last restriction we made in the decision problems for our agents: in the
definition of Markov decision processes we assumed that the environment was fully observable. As
we have seen ?? this entails that the optimal policy only depends on the current state.

Partial Observability
 Definition 25.4.1. A partially observable MDP (a POMDP for short) is a MDP
together with an observation model O that has the sensor Markov property and is
stationary: O(s, e) = P (e|s).
 Example 25.4.2 (Noisy 4x3 World).

Add a partial and/or noisy sensor.


e.g. count number of adjacent walls (1 ≤ w ≤ 2)
with 0.1 error (noise)
If sensor reports 1, we are in (3, ?) (probably)
25.4. PARTIALLY OBSERVABLE MDPS 567

 Problem: Agent does not know which state it is in ; makes no sense to talk
about policy π(s)!
 Theorem 25.4.3 (Astrom 1965). The optimal policy in a POMDP is a function
π(b) where b is the belief state (probability distribution over states).

 Idea: Convert a POMDP into an MDP in belief state space, where T (b, a, b′ ) is
the probability that the new belief state is b′ given that the current belief state is b
and the agent does a. I.e., essentially a filtering update step.

Michael Kohlhase: Artificial Intelligence 2 867 2025-02-06

POMDP: Filtering at the Belief State Level


 Recap: Filtering updates the belief state for new evidence.

 For POMDPs, we also need to consider actions. (but the effect is the same)
 If b is the previous belief state and agent does action A = a and then perceives
E = e, then the new belief state is
X
b′ = α(P(E = e|s′ ) · ( P(s′ |S = s, A = a) · b(s)))
s

We write b′ = FORWARD(b, a, e) in analogy to recursive state estimation.


 Fundamental Insight for POMDPs: The optimal action only depends on the
agent’s current belief state. (good, it does not know the state!)

 Consequence: The optimal policy can be written as a function π ∗ (b) from belief
states to actions.
 Definition 25.4.4. The POMDP decision cycle is to iterate over
1. Given the current belief state b, execute the action a = π ∗ (b)
2. Receive percept e.
3. Set the current belief state to FORWARD(b, a, e) and repeat.
 Intuition: POMDP decision cycle is search in belief state space.

Michael Kohlhase: Artificial Intelligence 2 868 2025-02-06

Partial Observability contd.


 Recap: The POMDP decision cycle is search in belief state space.
 Observation 25.4.5. Actions change the belief state, not just the (physical) state.
 Thus POMDP solutions automatically include information gathering behavior.
 Problem: The belief state is continuous: If there are n states, b is an n-dimensional
real-valued vector.
568 CHAPTER 25. MAKING COMPLEX DECISIONS

 Example 25.4.6. The belief state of the 4x3 world is a 11 dimensional continuous
space. (11 states)
 Theorem 25.4.7. Solving POMDPs is very hard! (actually, PSPACE hard)

 In particular, none of the algorithms we have learned applies. (discreteness


assumption)
 The real world is a POMDP (with initially unknown transition model T and sensor
model O)

Michael Kohlhase: Artificial Intelligence 2 869 2025-02-06

Reducing POMDPs to Belief-State MDPs


 Idea: Calculating the probability that an agent in belief state b reaches belief state
b′ after executing action a.
 if we knew the action and the subsequent percept e, then b′ = FORWARD(b, a, e).
(deterministic update to the belief state)
 but we don’t, since b′ depends on e. (let’s calculate P (e|a, b))
 Idea: To compute P (e|a, b) — the probability that e is perceived after executing
a in belief state b — sum up over all actual states the agent might reach:
X
P (e|a, b) = P (e|a, s′ , b) · P (s′ |a, b)
s′
X
= P (e|s′ ) · P (s′ |a, b)
s′
X X
= P (e|s′ ) · ( P (s′ |s, a), b(s))
s′ s

Write the probability of reaching b′ from b, given action a, as P (b′ |b, a), then
X
P (b′ |b, a) = P (b′ |a, b) = P (b′ |e, a, b) · P (e|a, b)
e
X X X

= P (b |e, a, b) · ( P (e|s′ ) · ( P (s′ |s, a), b(s)))
e s′ s

where P (b′ |e, a, b) is 1 if b′ = FORWARD(b, a, e) and 0 otherwise.

 Observation: This equation defines a transition model for belief state space!
 Idea: We can also define a reward function for belief states:
X
ρ(b):= b(s) · R(s)
s

i.e., the expected reward for the actual states the agent might be in.

 Together, P (b′ |b, a) and ρ(b) define an (observable) MDP on the space of belief
states.
25.4. PARTIALLY OBSERVABLE MDPS 569

 Theorem 25.4.8. An optimal policy π ∗ (b) for this MDP, is also an optimal policy
for the original POMDP.
 Upshot: Solving a POMDP on a physical state space can be reduced to solving
an MDP on the corresponding belief state space.

 Remember: The belief state is always observable to the agent, by definition.

Michael Kohlhase: Artificial Intelligence 2 871 2025-02-06

Ideas towards Value-Iteration on POMDPs


 Recap: The value iteration algorithm from ?? computes one utility value per state.
 Problem: We have infinitely many belief states ; be more creative!
 Observation: Consider an optimal policy π ∗
 applied in a specific belief state b: π ∗ generates an action,
 for each subsequent percept, the belief state is updated and a new action is
generated . . .
For this specific b: π ∗ =
b a conditional plan!
 Idea: Think about conditional plans and how the expected utility of executing a
fixed conditional plan varies with the initial belief state.(instead of optimal policies)

Definition 25.4.9. Given a set of percepts E and a set of actions A, a conditional


plan is either an action a ∈ A, or a tuple ⟨a, E ′ , p1 , p2 ⟩ such that a ∈ A, E ′ ⊆ E, and
p1 , p2 are conditional plans.
It represents the strategy “First execute a, If we subsequently perceive e ∈ E ′ ,
continue with p1 , otherwise continue with p2 .”
The depth of a conditional plan p is the maximum number of actions in any path
from p before reaching a single action plan.

Michael Kohlhase: Artificial Intelligence 2 872 2025-02-06

Expected Utilities of Conditional Plans on Belief States


 Observation 1: Let p be a conditional plan and αp (s) the utility of executing p
in state s.
P
 the expected utility of p in belief state b is b b·αp as vectors.
s b(s) · αp (s) =
 the expected utility of a fixed conditional plan varies linearly with b
 ; the “best conditional plan to execute” corresponds to a hyperplane in belief
state space.
 Observation 2: We can replace the original actions by conditional plans on those
actions!
Let π ∗ be the subsequent optimal policy. At any given belief state b,

 π ∗ will choose to execute the conditional plan with highest expected utility
570 CHAPTER 25. MAKING COMPLEX DECISIONS

 the expected utility of b under the π ∗ is the utility of that plan:



U (b) = U π (b) = max (b·αp )
b

 If the optimal policy π ∗ chooses to execute p starting at b, then it is reasonable


to expect that it might choose to execute p in belief states that are very close
to b;
 if we bound the depth of the conditional plans, then there are only finitely many
such plans
 the continuous space of belief states will generally be divided into regions, each
corresponding to a particular conditional plan that is optimal in that region.

 Observation 3 (conbined): The utility function U (b) on belief states, being the
maximum of a collection of hyperplanes, is piecewise linear and convex.

Michael Kohlhase: Artificial Intelligence 2 873 2025-02-06

A simple Illustrating Example


 Example 25.4.10. A world with states S0 and S1 , where R(S0 ) = 0 and R(S1 ) = 1
and two actions:
 “Stay” stays put with probability 0.9
 “Go” switches to the other state with probability 0.9.
 The sensor reports the correct state with probability 0.6.
Obviously, the agent should “Stay” when it thinks it’s in state S1 and “Go” when it
thinks it’s in state S0 .
 The belief state has dimension 1. (the two probabilities sum up to 1)

 Consider the one-step plans [Stay] and [Go] and their direct utilities:

α([Stay]) (S0 ) = 0.9R(S0 ) + 0.1R(S1 ) = 0.1


α([stay]) (S1 ) = 0.9R(S1 ) + 0.1R(S0 ) = 0.9
α([go]) (S0 ) = 0.9R(S1 ) + 0.1R(S0 ) = 0.9
α([go]) (S1 ) = 0.9R(S0 ) + 0.1R(S1 ) = 0.1
662 Chapter 17. Making Complex Decisions
 Let us visualize the hyperplanes b·α([Stay]) and b·α([Go]) .

3 3

2.5 2.5

2 2
Utility

Utility

1.5 [Stay] 1.5


[Go]
1 1

0.5 0.5

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Probability of state 1 Probability of state 1
(a) (b)

3 7.5

2.5 7

2 6.5
25.4. PARTIALLY OBSERVABLE MDPS 571

 The maximum represents the utility function for the finite-horizon problem that
allows just one action
 in each “piece” the optimal action is the first action of the corresponding plan.
 Here the optimal one-step policy is to “Stay” when b(1) > 0.5 and “Go” other-
wise.

 compute the utilities for conditional plans of depth 2 by considering

 each possible first action,


 each possible subsequent percept, and then
 each way of choosing a depth-1 plan to execute for each percept:
There are eight of depth 2:
Chapter 17. Making Complex Decisions
[Stay, if P = 0 then Stay else Stay fi], [Stay, if P = 0 then Stay else Go fi], . . .

3 3

2.5 2.5 662 Chapter 17. M


2 2
Utility

Utility

1.5 [Stay] 1.5 3 3


[Go] 2.5 2.5
1 1
2 2
0.5 0.5

Utility

Utility
1.5 [Stay] 1.5
0 0 [Go]
1 1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Probability of state 1 Probability of state 1 0.5 0.5

(a) Chapter 17. Making Complex Decisions (b) 0 0


0 0.2 0.4 0.6 0.8 1 0 0.2
Four of them (dashed lines) are suboptimal for the whole belief space Probability of state 1 Pro
3 7.5
3 We call
3 them dominated (they can be ignored)
(a)
2.5 7
2.5 2.5 3 7.5
2 2 2 6.5
2.5 7
Utility

Utility
Utility

1.5 1.5 [Stay] 1.5 6 2 6.5


[Go]
Utility

Utility
1 1 1 5.5 1.5 6
0.5 0.5 1 5.5
0.5 5
0 0 0.5 5
0 0.2 0.4 0 0.6 0.8 1 0 0.2 0.4 0.6 4.50.8 1
Probability of0state 1 0.2 0.4 0.6 0.8 Probability
1 of state 1 0 0.2 0.4 0.6 0.8 1 0 4.5
0 0.2 0.4 0.6 0.8 1 0 0.2
(a) Probability of state 1 (b) Probability of state 1
 There are four undominated plans, each optimal in their region Probability of state 1 Pro
3 (c) 7.5 (d) (c)
2.5
Figure 17.8 (a) Utility of7 two one-step plans as a function of the initial belief state b(1) Figure 17.8 (a) Utility of two one-step plans as a function of th
2 for the two-state world, with
6.5 the corresponding utility function shown in bold. (b) Utilities for the two-state world, with the corresponding utility function sho
for 8 distinct two-step plans. (c) Utilities for four undominated tw
Utility

1.5 for 8 distinct two-step plans.


6 (c) Utilities for four undominated two-step plans. (d) Utility
function for optimal eight-step plans.
1 function for optimal eight-step
5.5 plans.
0.5 5 There are eight distinct depth-2 plans in all, and their utilities are
0 There are eight distinct depth-2 4.5 plans in all, and their utilities are shown in Figure 17.8(b). Notice that four of the plans, shown as dashed lines, are suboptim
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 space—we say these plans are dominated, and they need not be
Notice that four
Probability of state 1
of the plans, shown as dashed lines,
Probability of state 1
are suboptimal across the entire
DOMINATED PLAN
belief
D PLAN space—we say these plans are dominated, and they need not be considered further. are four undominated plans, each of which is optimal in a specific
There
(c) (d) ure 17.8(c). The regions partition the belief-state space.
are four undominated plans,  Idea:eachRepeat
of which forisdepth
optimal 3 andin asospecific
on. region, as shown in Fig-We repeat the process for depth 3, and so on. In general, let p
igure 17.8 (a) Utility of two one-step plans as a function of the initial belief state b(1)
ureworld,
r the two-state 17.8(c).with The regions partition
the corresponding the belief-state
utility function shown in bold. space.
(b) Utilities plan whose initial action is a and whose depth-d − 1 subplan for pe
r 8 distinct two-stepWe repeat
plans. the process
(c) Utilities for fourfor depth 3,two-step
undominated and soplans.on. In(d)general,
Utility let p be a depth-d conditional
! #
" "
nction for optimal eight-step plans. ! ! !
plan whose initial action is a and whose depth-d − 1 subplan for percept e is p.e; then α p (s) = R(s) + γ P (s | s, a) P (e | s )αp.e (s ) .
! # s! e
" " This recursion naturally gives us a value iteration algorithm, which
are eight distinct depth-2
αp (s) = plans
R(s) in all,
+ γand theirPutilities
(s! | s,are
a)shownPin (eFigure 17.8(b).
| s! )αp.e (s! ) . (17.13)
that four of the plans, shown as dashed lines,! are suboptimal across the entire belief The structure of the algorithm and its error analysis are similar to th
s e
—we say these plans are dominated, and they need not be considered further. There ation algorithm in Figure 17.4 on page 653; the main difference is
ur undominated plans, each of which is optimal in a specific region, algorithm,
This recursion naturally gives us a value iteration which is sketched in Figure one
as shown in Fig- 17.9.utility number for each state, POMDP-VALUE -I TERATION
572 CHAPTER 25. MAKING COMPLEX DECISIONS

 Theorem 25.4.11 (POMDP Plan Utility). Let p be a depth-d conditional plan


whose initial action is a and whose depth-d − 1-subplan for percept e is p.e, then
X X
αp (s) = R(s) + γ( P (s′ |s, a)( P (e|s′ ) · αp.e (s′ )))
s′ e

 This recursion naturally gives us a value iteration algorithm,

Michael Kohlhase: Artificial Intelligence 2 877 2025-02-06

A Value Iteration Algorithm for POMDPs


Definition 25.4.12. The POMDP value iteration algorithm for POMDPs is given by
recursively updating
X X
αp (s) = R(s) + γ( P (s′ |s, a)( P (e|s′ ) · αp.e (s′ )))
s′ e

Observations: The complexity depends primarily on the generated plans:


|E|d−1
 Given |A| actions and |E| possible observations, there are are |A| distinct
depth-d plans.
 Even for the example with d = 8, we have 2255 (144 undominated)

 The elimination of dominated plans is essential for reducing this doubly exponential
growth (but they are already constructed)
Hopelessly inefficient in practice – even the 3x4 POMDP is too hard!

Michael Kohlhase: Artificial Intelligence 2 878 2025-02-06

25.5 Online Agents with POMDPs


In the last section we have seen that even though we can in principle compute utilities of states –
and thus use the MEU principle – to make decisions in sequential decision problems, all methods
based on the “lifting idea” are hopelessly inefficient.
This section describes a different, approximate method for solving POMDPs, one based on
look-ahead search. A Video Nugget covering this section can be found at https://fau.tv/
clip/id/30361.

DDN: Decision Networks for POMDPs


 Idea: Let’s try to use the computationally efficient representations (dynamic
Bayesian networks and decision networks) for POMDPs.
 Definition 25.5.1. A dynamic decision network (DDN) is a graph-based represen-
tation of a POMDP, where

 Transition and sensor model are represented as a DBN.


 Action nodes and utility nodes are added as in decision networks.
 In a DDN, a filtering algorithm is used to incorporate each new percept and action
and to update the belief state representation.
25.5. ONLINE AGENTS WITH POMDPS 573

 Decisions are made in DDN by projecting forward possible action sequences and
choosing the best one.
 DDNs – like the DBNs they are based on – are factored representations
; typically exponential complexity advantages!

Michael Kohlhase: Artificial Intelligence 2 879 2025-02-06

Structure of DDNs for POMDPs


 DDN for POMDPs: The generic structure of a dymamic decision network at
664 time t is Chapter 17. Making Complex Decisions

At–2 At–1 At At+1 At+2

Xt–1 Xt Xt+1 Xt+2 Xt+3 Ut+3

Rt–1 Rt Rt+1 Rt+2

Et–1 Et Et+1 Et+2 Et+3

Figure 17.10 The generic structure of a dynamic decision network. Variables with known
 POMDP
valuesstate St becomes
are shaded. a set
The current time is t of
andrandom variables
the agent must Xtto do—that is, choose
decide what
a value for At . The network has been unrolled into the future for three steps and represents
 there future
may rewards,
be multiple
as well asevidence
the utility ofvariables Etlook-ahead horizon.
the state at the
 Action at time t denoted by At . agent must choose a value for At .
17.4.3 Online agents for POMDPs
 Transition model: P(Xt+1 |Xt , At ); sensor model: P(Et |Xt ).
In this section, we outline a simple approach to agent design for partially observable, stochas-
 Reward functions
tic environments. The t and
Rbasic utilityofUthe
elements t of state
design are S t.
already familiar:
 Variables
• The with known
transition valuesmodels
and sensor are gray, rewards for
are represented by at dynamic , t + 2, but
= 0, . . .Bayesian utility for
network
t + 3 (DBN), as described in Chapter 15. (=
b discounted sum of rest)
• The dynamic Bayesian network is extended with decision and utility nodes, as used in
 Problem:decision
How networks
do we compute
in Chapterwith that?
16. The resulting model is called a dynamic decision
DYNAMIC DECISION
NETWORK network, or DDN.
 Answer:• AAll POMDP
filtering algorithms
algorithm can be adapted
is used to incorporate each new to DDNs!
percept and action(only
and toneed
updateCPTs)
the belief state representation.
• Decisions are made by projecting forward possible action sequences and choosing the
best one.
Michael Kohlhase: Artificial Intelligence 2 880 2025-02-06

DBNs are factored representations in the terminology of Chapter 2; they typically have
an exponential complexity advantage over atomic representations and can model quite sub-
Lookahead: Searching over the Possible Action Sequences
stantial real-world problems. The agent design is therefore a practical implementation of the
utility-based agent sketched in Chapter 2.
In the DBN, the single state St becomes a set of state variables Xt , and there may be
 Idea:multiple
Search over variables
evidence the treeEtof possible
. We will use Aaction
t to refersequences
to the action at time t,(like
so thein game-play)
transition
model becomes P(Xt+1 |Xt , At ) and the sensor model becomes P(Et |Xt ). We will use Rt to
 Part of the
refer lookahead
to the solution
reward received oft the
at time and UDDN above
t to refer (three
to the utility of the state atsteps
time t.lookahead)
(Both
of these are random variables.) With this notation, a dynamic decision network looks like the
one shown in Figure 17.10.
Dynamic decision networks can be used as inputs for any POMDP algorithm, including
those for value and policy iteration methods. In this section, we focus on look-ahead methods
that project action sequences forward from the current belief state in much the same way as do
the game-playing algorithms of Chapter 5. The network in Figure 17.10 has been projected
three steps into the future; the current and future decisions A and the future observations
Section 17.4.
574 Partially Observable MDPs 665
CHAPTER 25. MAKING COMPLEX DECISIONS

At in P(Xt | E1:t)

Et+1 ...
... ... ... ...

At+1 in P(Xt+1 | E1:t+1) ...


... ... ...

Et+2 ...
... ... ...

At+2 in P(Xt+2 | E1:t+2) ...


... ... ...

Et+3 ...
... ... ...

U(Xt+3) ...
10 4 6 3

Figure 17.11 Part of the look-ahead solution of the DDN in Figure 17.10. Each decision
 circle =b chance nodes (the environment decides)
will be taken in the belief state indicated.
 triangle =
b belief state (each action decision is taken there)

E and rewards R are all unknown. Notice that the network includes nodes for the rewards
for Xt+1 and MichaelXt+2Kohlhase:
, but Artificial
the utility Intelligence 2
for Xt+3 . This is881because the agent 2025-02-06
must maximize the
(discounted) sum of all future rewards, and U (Xt+3 ) represents the reward for Xt+3 and all
subsequent rewards. As in Chapter 5, we assume that U is available only in some approximate
Designing Online
form: if exact utility values
Agents for POMDPs
were available, look-ahead beyond depth 1 would be unnecessary.
Section 17.4. Partially Observable MDPs 665
Figure 17.11 shows part of the search tree corresponding to the three-step look-ahead
DDN in Figure 17.10. Each of the triangular nodes is a belief state in which the agent makes
A in P(X | E )t t 1:t

a decision At+i for i = 0, 1, 2, . . .. The round (chance) nodes correspond to choices by the
E t+1 ...
... ... ... ...

environment, namely, what evidence Et+i arrives. Notice that there are no chance nodes
A in P(X | E )
t+1 t+1 1:t+1 ...
... ... ...

corresponding to the action outcomes; this is because the belief-state update for an action is
E t+2 ...
... ... ...

deterministic regardless of the actual outcome.


A in P(X | E )
t+2 t+2 1:t+2 ...
... ... ...

E
The belief state at each triangular node can be computed by applying a filtering al-
t+3 ...
... ... ...

gorithm to the sequence of percepts and actions leading to it. In this way, the algorithm
U(X ) t+3 ...
10 4 6 3

takes into account the fact that, for decision At+i , the agent will have available percepts
Figure 17.11 Part of the look-ahead solution of the DDN in Figure 17.10. Each decision
will be taken in the belief state indicated.

Et+1Belief
, . . . , state
Et+i , at even triangle
thoughcomputed at time t it bydoes filtering not know with what actions/percepts
those percepts leading to In
will be. it this
E and rewards R are all unknown. Notice that the network includes nodes for the rewards
way, a decision-theoretic for X agentand X automatically
, but the utility for X . This takes
is becauseintothe agentaccount
must maximizethe the value of information and
 for decision At+i willsumuse percepts (even
reward forifX values at time t unknown)
t+1 t+2 t+3
(discounted) of all future rewards, and U E(Xt+1:t+i ) represents the and all
will execute information-gathering actions where appropriate.
t+3 t+3
subsequent rewards. As in Chapter 5, we assume that U is available only in some approximate
form: if exact utility values were available, look-ahead beyond depth 1 would be unnecessary.
Athus a POMDP
decision can be agent
extracted
Figure 17.11 automatically
showsfrompart of thethe treetakes
search search corresponding into
tree account
to theby backing
three-step theupvalue
look-ahead of information
the utility values from
DDN in Figure 17.10. Each of the triangular nodes is a belief state in which the agent makes
and executes
the leaves, taking an average information
a decision A at
for i
t+i=gathering
the
0, 1, 2,chance
. . .. The round actions
nodes
(chance) nodesandwhere taking
correspond to appropriate.
choices the
by the maximum at the decision
environment, namely, what evidence E arrives. Notice that there are no chance nodes
t+i

nodes. This is similar


 Observation: Time The to the E
complexity
deterministic
XPECTIMINIMAX algorithm
corresponding to the action outcomes; this is because the belief-state update for an action is
regardless of the for exhaustive search up to depth d is O(|A| ·|E|
actual outcome.
for game trees with chance
d nodes,
d
)
except that
(|A| = (1) there
b number of actions, can also
gorithm to the
be |E|
sequence
rewards
of=bpercepts and
at non-leaf
actions of
leading percepts)
to it.
states
belief state at each triangular node can be computed by applying a filtering al-
number In this way,
and
the
(2)
algorithm
the decision nodes corre-
spond to belief states rather E , . . . , Ethan actual at time states. The what time
takes into account the fact that, for decision A , the agent will have available percepts
t+1 , even though
t+i t it does not know complexity
those percepts
t+i
will be. In this of an exhaustive search
todepth is d way, addecision-theoretic agent automatically takes into account the value of information and
than POMDP value iteration with O(|A| and ).
where is the number of available actions
d O(|A| · |E| ), |A| |E|d−1 |E| is the num-
Upshot: Much better will execute information-gathering actions where appropriate.
A decision can be extracted from the search tree by backing up the utility values from
ber of possible percepts. (Notice
the leaves, thatat thethis
taking an average chanceis farandless
nodes taking the than
maximum the at thenumber
decision of depth-d conditional
 Empirically: For except problems inalsowhich
nodes. This is similar to the E
that (1) there can be rewards atthe non-leafdiscount factor
algorithm for game trees with chance nodes,
XPECTIMINIMAX
states and (2) the decision nodes γ is not too close to 1, a
corre-

shallow search is often to depth dgood


is O(|A| · |E| enough
), where |A| isto d giveof available
the number near-optimalactions and |E| is thedecisions.
spond to belief states rather than actual states. The time complexity of an exhaustive search
d num-
ber of possible percepts. (Notice that this is far less than the number of depth-d conditional

Michael Kohlhase: Artificial Intelligence 2 882 2025-02-06

Summary
25.5. ONLINE AGENTS WITH POMDPS 575

 Decision theoretic agents for sequential environments


 Building on temporal, probabilistic models/inference (dynamic Bayesian networks)
 MDPs for fully observable case.

 Value/Policy Iteration for MDPs ; optimal policies.


 POMDPs for partially observable case.
 POMDPs=
b MDP on belief state space.
 The world is a POMDP with (initially) unknown transition and sensor models.

Michael Kohlhase: Artificial Intelligence 2 883 2025-02-06


576 CHAPTER 25. MAKING COMPLEX DECISIONS
Part VI

Machine Learning

577
579

This part introduces the foundations of machine learning methods in AI. We discuss the prob-
lem learning from observations in general, study inference-based techniques, and then go into
elementary statistical methods for learning.
The current hype topics of deep learning, reinforcement learning, and large language models
are only very superficially covered, leaving them to specialized courses.
580
Chapter 26

Learning from Observations

In this chapter we introduce the concepts, methods, and limitations of inductive learning, i.e.
learning from a set of given examples.

Outline
 Learning agents
 Inductive learning

 Decision tree learning


 Measuring learning performance
 Computational Learning Theory
 Linear regression and classification

 Neural Networks
 Support Vector Machines

Michael Kohlhase: Artificial Intelligence 2 884 2025-02-06

26.1 Forms of Learning

Learning (why is this a good idea)


 Learning is essential for unknown environments:
 i.e., when designer lacks omniscience.
 The world is a POMDP with (initially) unknown transition and sensor models.
 Learning is useful as a system construction method.

 i.e., expose the agent to reality rather than trying to write it down
 Learning modifies the agent’s decision mechanisms to improve performance.

Michael Kohlhase: Artificial Intelligence 2 885 2025-02-06

581
582 CHAPTER 26. LEARNING FROM OBSERVATIONS

Recap: Learning Agents

Michael Kohlhase: Artificial Intelligence 2 886 2025-02-06

Recap: Learning Agents (continued)

 Definition 26.1.1. Performance element is what we called “agent” up to now.


 Definition 26.1.2. Critic/learning element/problem generator do the “improving”.
 Definition 26.1.3. Performance standard is fixed; (outside the environment)

 We can’t adjust performance standard to flatter own behaviour!


 No standard in the environment: e.g. ordinary chess and suicide chess look
identical.
 Essentially, certain kinds of percepts are “hardwired” as good/bad (e.g.,pain,
hunger)

 Definition 26.1.4. Learning element may use knowledge already acquired in the
performance element.
 Definition 26.1.5. Learning may require experimentation actions an agent might
not normally consider such as dropping rocks from the Tower of Pisa.
26.2. SUPERVISED LEARNING 583

Michael Kohlhase: Artificial Intelligence 2 887 2025-02-06

Ways of Learning
 Supervised learning: There’s an unknown function f : A → B called the target
function. We do know a set of pairs T := {⟨ai , f (ai )⟩} of examples. The goal is to
find a hypothesis h ∈ H ⊆ A → B based on T , that is “approximately” equal to f .
(Most of the techniques we will consider)
 Unsupervised learning: Given a set of data A, find a pattern in the data; i.e. a
function f : A → B for some predetermined B. (Primarily
clustering /dimensionality reduction)
 Reinforcement learning: The agent receives a reward for each action performed. T
he goal is to iteratively adapt the action function to maximize the total reward.
(Useful in e.g. game play)

Michael Kohlhase: Artificial Intelligence 2 888 2025-02-06

26.2 Supervised Learning

Supervised learning a.k.a. inductive learning (a.k.a. Science)


Definition 26.2.1. A supervised (or inductive) learning problem consists of the follow-
ing data:
 A set of hypotheses H consisting of functions A → B,

 a set of examples T ⊆ A × B called the training set, such that for every a ∈ A,
there is at most one b ∈ B with ⟨a, b⟩ ∈ T , (⇒ T is a function on some subset of
A)
We assume there is an unknown function f : A → B called the target function with
T ⊆ f.
Definition 26.2.2. Inductive learning algorithms solve inductive learning problems by
finding a hypothesis h ∈ H such that h ∼ f (for some notion of similarity).
Definition 26.2.3. We call a supervised learning problem with target function A → B
a classification problem if B is finite, and call the members of B classes.
We call it a regression problem if B = R.

Michael Kohlhase: Artificial Intelligence 2 889 2025-02-06

Inductive Learning Method


 Idea: Construct/adjust hypothesis h ∈ H to agree with a training set T .
 Definition 26.2.4. We call h consistent with f (on a set T ⊆ dom(f )), if it
agrees with f (on all examples in T ).
 Example 26.2.5 (Curve Fitting).
584 CHAPTER 26. LEARNING FROM OBSERVATIONS

Training Set

Linear Hypothesis
partially, approximatively
consistent

Quadratic Hypothesis
partially consistent

Degree-4 Hypothesis
consistent

High-degree Hypothesis
consistent
26.2. SUPERVISED LEARNING 585

 Ockham’s-razor: maximize a combination of consistency and simplicity.

Michael Kohlhase: Artificial Intelligence 2 890 2025-02-06

Choosing the Hypothesis Space


 Observation: Whether we can find a consistent hypothesis for a given training
set depends on the chosen hypothesis space.
 Definition 26.2.6. We say that an supervised learning problem is realizable, iff
there is a hypothesis h ∈ H consistent with the training set T .

 Problem: We do not always know whether a given learning problem is realizable,


unless we have prior knowledge. (depending on the hypothesis space)
 Solution: Make H large, e.g. the class of all Turing machines.
 Tradeoff: The computational complexity of the supervised learning problem is
tied to the size of the hypothesis space. E.g. consistency is not even decidable for
general Turing machines.
 Much of the research in machine learning has concentrated on simple hypothesis
spaces.
 Preview: We will concentrate on propositional logic and related languages first.

Michael Kohlhase: Artificial Intelligence 2 891 2025-02-06

Independent and Identically Distributed


 Problem: We want to learn a hypothesis that fits the future data best.

 Intuition: This only works, if the training set is “representative” for the underlying
process.
 Idea: We think of examples (seen and unseen) as a sequence, and express the
“representativeness” as a stationarity assumption for the probability distribution.

 Method: Each example before we see it is a random variable Ej , the observed


value ej = (xj ,yj ) samples its distribution.
 Definition 26.2.7. A sequence of E 1 , . . ., E n of random variables is independent
and identically distributed (short IID), iff they are
 independent, i.e. P(E j |E (j−1) , E (j−2) , . . .) = P(E j ) and
 identically distributed, i.e. P(E i ) = P(E j ) for all i and j.
 Example 26.2.8. A sequence of die tosses is IID. (fair or loaded does not matter)
 Stationarity Assumption: We assume that the set E of examples is IID in the
future.

Michael Kohlhase: Artificial Intelligence 2 892 2025-02-06


586 CHAPTER 26. LEARNING FROM OBSERVATIONS

26.3 Learning Decision Trees

Attribute-based Representations
 Definition 26.3.1. In attribute-based representations, examples are described by
 attributes: (simple) functions on input samples, (think pre classifiers on
examples)
 their value, and (classify by attributes)
 classifications. (Boolean, discrete, continuous, etc.)

 Example 26.3.2 (In a Restaurant). Situations where I will/won’t wait for a table:

Attributes Target
Example Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait
X1 T F F T Some $$$ F T French 0–10 T
X2 T F F T Full $ F F Thai 30–60 F
X3 F T F F Some $ F F Burger 0–10 T
X4 T F T T Full $ F F Thai 10–30 T
X5 T F T F Full $$$ F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
X7 F T F F None $ T F Burger 0–10 F
X8 F F F T Some $$ T T Thai 0–10 T
X9 F T T F Full $ T F Burger >60 F
X 10 T T T T Full $$$ F T Italian 10–30 F
X 11 F F F F None $ F F Thai 0–10 F
X 12 T T T T Full $ F F Burger 30–60 T

 Definition 26.3.3. For a boolean classification we say that an example is positive


(T) or negative (F) depending on its class.

Michael Kohlhase: Artificial Intelligence 2 893 2025-02-06

Decision Trees
 Decision trees are one possible representation for hypotheses.

 Example 26.3.4 (Restaurant continued). Here is the “true” tree for deciding
whether to wait:
26.3. LEARNING DECISION TREES 587

Michael Kohlhase: Artificial Intelligence 2 894 2025-02-06

We evaluate the tree by going down the tree from the top, and always take the branch whose
attribute matches the situation; we will eventually end up with a Boolean value; the result. Using
the attribute values from X3 in ?? to descend through the tree in ?? we indeed end up with the
result “true”. Note that

1. some of the original set of attributes X3 are irrelevant.


2. the training set in ?? is realizable – i.e. the target is definable in hypothesis class of decision
trees.

Decision Trees (Definition)


 Definition 26.3.5. A decision tree for a given attribute-based representation is a
tree, where the non-leaf nodes are labeled by attributes, their outgoing edges by
disjoint sets of attribute values, and the leaf nodes are labeled by the classifications.
 Definition 26.3.6. We call an attribute together with a set of attribute values (an
inner node) with outgoing edge label an attribute test.
 the target function is a function A1 × . . . × An → C, where Ai are the domains of
the attributes and C is the set of classifications.

Michael Kohlhase: Artificial Intelligence 2 895 2025-02-06

Expressiveness
 Decision trees can express any function of the input attributes ⇒ H = A1 ×. . .×An
 Example 26.3.7. For Boolean functions, a path from the root to a leaf corresponds
to a row in a truth table:
588 CHAPTER 26. LEARNING FROM OBSERVATIONS

⇒ a decision tree corresponds to a truth table (Formula in DNF)

 Trivially, for any training set there is a consistent hypothesis with one path to a
leaf for each example, but it probably won’t generalize to new examples.
 Solution: Prefer to find more compact decision trees.

Michael Kohlhase: Artificial Intelligence 2 896 2025-02-06

Decision Tree learning


 Aim: Find a small decision tree consistent with the training examples.

 Idea: (recursively) choose “most significant” attribute as root of (sub)tree.


 Definition 26.3.8. The following algorithm performs decision tree learning (DTL)
function DTL(examples, attributes, def ault) returns a decision tree
if examples is empty then return def ault
else if all examples have the same classification then return the classification
else if attributes is empty then return MODE(examples)
else
best := Choose−Attribute(attributes, examples)
tree := a new decision tree with root test best
m := MODE(examples)
for each value vi of best do
examplesi := {elements of examples with best = vi }
subtree := DTL(examplesi , attributes \ best, m)
add a branch to tree with label vi and subtree subtree
return tree

MODE(examples)= most frequent value in example.

Michael Kohlhase: Artificial Intelligence 2 897 2025-02-06

Note: We have three base cases:


1. empty examples ⇝ arises for empty branches of non Boolean parent attribute.
2. uniform example classifications ⇝ this is “normal” leaf.

3. attributes empty ⇝ target is not deterministic in input attributes.


The recursive step steps pick an attribute and then subdivides the examples.

Choosing an Attribute
 Idea: A good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”.
26.4. USING INFORMATION THEORY 589

 Example 26.3.9.

Attribute “Patrons?” is a better choice, it gives gives information about the classi-
fication.
 Can we make this more formal? ; Use information theory! (up next)

Michael Kohlhase: Artificial Intelligence 2 898 2025-02-06

26.4 Using Information Theory


Video Nuggets covering this section can be found at https://fau.tv/clip/id/20373 and
https://fau.tv/clip/id/30374.

Information Entropy
Intuition: Information answers questions – the less I know initially, the more Informa-
tion is contained in an answer.
Definition 26.4.1. Let ⟨p1 , . . ., pn ⟩ the distribution of a random variable P . The
information (also called entropy) of P is
n
X
I(⟨p1 , . . ., pn ⟩):= −pi · log2 (pi )
i=1

Note: For pi = 0, we consider pi · log2 (pi ) = 0 (log2 (0) is undefined)


The unit of information is a bit, where 1b := I(⟨ 12 , 12 ⟩)=1
Example 26.4.2 (Information of a Coin Toss).

 For a fair coin toss we have I(⟨ 12 , 12 ⟩) = − 12 log2 ( 12 ) − 12 log2 ( 12 ) = 1b.


1 99
 With a loaded coin (99% heads) we have I(⟨ 100 , 100 ⟩) = 0.08b.
Rightarrow Information goes to 0 as head probability goes to 1.
“How likely is the outcome actually going to tell me something informative?”

Michael Kohlhase: Artificial Intelligence 2 899 2025-02-06

Information Gain in Decision Trees


Idea: Suppose we have p examples classified as positive and n examples as negative.
We can then estimate the probability distribution of the classification C with P(C) =
p n
⟨ p+n , p+n ⟩, and need I(P(C)) bits to correctly classify a new example.
Example 26.4.3. For 12 restaurant examples and p = n = 6, we need I(P(WillWait)) =
6 6
I(⟨ 12 , 12 ⟩) = 1b of information. (i.e. exactly the information which of the two
classes)
590 CHAPTER 26. LEARNING FROM OBSERVATIONS

Treating attributes also as random variables, we can compute how much information
is needed after knowing the value for one attribute:
Example 26.4.4. If we know Pat = Full, we only need I(P(WillWait|Pat = Full)) =
I(⟨ 46 , 26 ⟩) ≊ 0.9 bits of information.
Note: The expected number of bits needed after an attribute test on A is
X
P (A = a) · I(P(C|A = a))
a

Definition 26.4.5. The information gain from an attribute test A is


X
Gain(A):=I(P(C)) − P (A = a) · I(P(C|A = a))
a

Michael Kohlhase: Artificial Intelligence 2 900 2025-02-06

Information Gain (continued)


 Definition 26.4.6. Assume we know the results of some attribute tests b := B1 =
b1 ∧ . . . ∧ Bn = bn . Then the conditional information gain from an attribute test A
is X
Gain(A|b):=I(P(C|b)) − P (A = a|b) · I(P(C|a, b))
a

 Example 26.4.7. If the classification C is Boolean and we have p positive and n


negative examples, the information gain is
p n X pa + na pa na
Gain(A) = I(⟨ , ⟩) − I(⟨ , ⟩)
p+n p+n a
p+n p a + na p a + na

where pa and na are the positive and negative examples with A = a.


 Example 26.4.8.
2 4 6 2 4
Gain(P atrons?) = 1−( I(⟨0, 1⟩) + I(⟨1, 0⟩) + I(⟨ , ⟩))
12 12 12 6 6
≈ 0.541b
2 1 1 2 1 1 4 2 2 4 2 2
Gain(T ype) = 1 − ( I(⟨ , ⟩) + I(⟨ , ⟩) + I(⟨ , ⟩) + I(⟨ , ⟩))
12 2 2 12 2 2 12 4 4 12 4 4
≈ 0b

 Idea: Choose the attribute that maximizes information gain.

Michael Kohlhase: Artificial Intelligence 2 901 2025-02-06

Restaurant Example contd.


 Example 26.4.9. Decision tree learned by DTL from the 12 examples using infor-
mation gain maximization for Choose−Attribute:
26.5. EVALUATING AND CHOOSING THE BEST HYPOTHESIS 591

 Result: Substantially simpler than “true” tree – a more complex hypothesis isn’t
justified by small amount of data.

Michael Kohlhase: Artificial Intelligence 2 902 2025-02-06

26.5 Evaluating and Choosing the Best Hypothesis

Performance measurement
 Question: How do we know that h≊f ? (Hume’s Problem of Induction)
1. Use theorems of computational/statistical learning theory.
2. Try h on a new test set of examples. (use same distribution over example space
as training set)

 Definition 26.5.1. The learning curve =


b percentage correct on test set as a
function of training set size.
 Example 26.5.2. Restaurant data; graph averaged over 20 trials

Michael Kohlhase: Artificial Intelligence 2 903 2025-02-06


592 CHAPTER 26. LEARNING FROM OBSERVATIONS

Performance measurement contd.


 Observation 26.5.3. The learning curve depends on
 realizable (can express target function) vs. non-realizable
non-realizability can be due to missing attributes or restricted hypothesis class
(e.g., thresholded linear function)
 redundant expressiveness (e.g., lots of irrelevant attributes)

Michael Kohlhase: Artificial Intelligence 2 904 2025-02-06

Generalization and Overfitting


 Observation: Sometimes a learned hypothesis is more specific than the experi-
ments warrant.
 Definition 26.5.4. We speak of overfitting, if a hypothesis h describes random error
in the (limited) training set rather than the underlying relationship. Underfitting
occurs when h cannot capture the underlying trend of the data.
 Qualitatively: Overfitting increases with the size of hypothesis space and the
number of attributes, but decreases with number of examples.
 Idea: Combat overfitting by “generalizing” decision trees computed by DTL.

Michael Kohlhase: Artificial Intelligence 2 905 2025-02-06

Decision Tree Pruning


 Idea: Combat overfitting by “generalizing” decision trees ; prune “irrelevant”
nodes.
 Definition 26.5.5. For decision tree pruning repeat the following on a learned
decision tree:
 Find a terminal test node n (only result leaves as children)
 If test is irrelevant, i.e. has low information gain, prune it by replacing n by with
a leaf node.
26.5. EVALUATING AND CHOOSING THE BEST HYPOTHESIS 593

 Question: How big should the information gain be to split (; keep) a node?
 Idea: Use a statistical significance test.
 Definition 26.5.6. A result has statistical significance, if the probability they
could arise from the null hypothesis (i.e. the assumption that there is no underlying
pattern) is very low (usually 5%).

Michael Kohlhase: Artificial Intelligence 2 906 2025-02-06

Determining Attribute Irrelevance


 For decision tree pruning, the null hypothesis is that the attribute is irrelevant.

 Compute the probability that the example distribution (p positive, n negative) for
a terminal node deviates from the expected distribution under the null hypothesis.
 For an attribute A with d values, compare the actual numbers pk and nk in each
subset sk with the expected numbers (expected if A is irrelevant)
pbk = p · pkp+n
+nk
bk = n · pkp+n
and n +nk
.

 A convenient measure of the total deviation is (sum of squared errors)

d
X 2 2
(pk − pbk ) (nk − n
bk )
∆= +
pbk bk
n
k=1

 Lemma 26.5.7 (Neyman-Pearson). Under the null hypothesis, the value of ∆ is


distributed according to the χ2 distribution with d − 1 degrees of freedom. [JN33]
 Definition 26.5.8. Decision tree pruning with Pearson’s χ2 with d − 1 degrees of
freedom for ∆ is called χ2 pruning. (χ2 values from stats library.)
 Example 26.5.9. The type attribute has four values, so three degrees of freedom,
so ∆ = 7.82 would reject the null hypothesis at the 5% level.

Michael Kohlhase: Artificial Intelligence 2 907 2025-02-06

Error Rates and Cross-Validation


 Recall: We want to learn a hypothesis that fits the future data best.
 Definition 26.5.10. Given an inductive learning problem with a set of examples
T ⊆ AB, we define the error rate of a hypothesis h ∈ H as the fraction of errors:

|{⟨x, y⟩ ∈ T | h(x) ̸= y}|


|T |

 Caveat: A low error rate on the training set does not mean that a hypothesis
generalizes well.
 Idea: Do not use homework questions in the exam.
594 CHAPTER 26. LEARNING FROM OBSERVATIONS

 Definition 26.5.11. The practice of splitting the data available for learning into
1. a training set from which the learning algorithm produces a hypothesis h and
2. a test set, which is used for evaluating h

is called holdout cross validation. (no peeking at test set allowed)

Michael Kohlhase: Artificial Intelligence 2 908 2025-02-06

Error Rates and Cross-Validation


 Question: What is a good ratio between training set and test set size?

 small training set ; poor hypothesis.


 small test set ; poor estimate of the accuracy.
 Definition 26.5.12. In k fold cross validation, we perform k rounds of learning,
each with 1/k of the data as test set and average over the k error rates.
 Intuition: Each example does double duty: for training and testing.

 k = 5 and k = 10 are popular ; good accuracy at k times computation time.


 Definition 26.5.13. If k = |dom(f )|, then k fold cross validation is called leave
one out cross validation (LOOCV).

Michael Kohlhase: Artificial Intelligence 2 909 2025-02-06

Model Selection
 Definition 26.5.14. The model selection problem is to determine – given data –
a good hypothesis space.
 Example 26.5.15. What is the best polynomial degree to fit the data

 Observation 26.5.16. We can solve the problem of “learning from observations


f ” in a two-part process:

1. model selection determines a hypothesis space H,


2. optimization solves the induced inductive learning problem.
 Idea: Solve the two parts together by iteration over “size”. (they inform each
other)

 Problem: Need a notion of “size” ⇝ e.g. number of nodes in a decision tree.


26.5. EVALUATING AND CHOOSING THE BEST HYPOTHESIS 595

 Concrete Problem: Find the “size” that best balances overfitting and underfitting
to optimize test set accuracy.

Michael Kohlhase: Artificial Intelligence 2 910 2025-02-06

Model Selection Algorithm (Wrapper)


 Definition 26.5.17. The model selection algorithm (MSA) jointly optimizes model
selection and optimization by partitioning and cross-validation:
function CROSS−VALIDATION−WRAPPER(Learner,k,examples) returns a hypothesis
local variables: errT , an array, indexed by size, storing training−set error rates
errV , an array, indexed by size, storing validation−set error rates
for size = 1 to ∞ do
errT [size], errV [size] := CROSS−VALIDATION(Learner,size,k,examples)
if errT has converged then do
best_size := the value of size with minimum errV [size]
return Learner(best_size,examples)

function CROSS−VALIDATION(Learner,size,k,examples) returns two values:


average training set error rate, average validation set error rate
f old_errT := 0; f old_errV := 0
for fold = 1 to k do
training_set, validation_set := PARTITION(examples,f old,k)
h := Learner(size,training_set)
f old_errT := f old_errT + ERROR−RATE(h,training_set)
f old_errV := f old_errV + ERROR−RATE(h,validation_set)
return f old_errT /k, f old_errV /k

function PARTITION(examples,f old,k) returns two sets:


a validation set of size |examples|/k and the rest; the split is different for each f old value

Michael Kohlhase: Artificial Intelligence 2 911 2025-02-06

Error Rates on Training/Validation Data

 Example 26.5.18 (An Error Curve for Restaurant Decision Trees).


Modify DTL to be breadth-first, information gain sorted, stop after k nodes.

60
Validation Set Error
Training Set Error
50

40
Error rate

30

20

10

0
1 2 3 4 5 6 7 8 9 10
Tree size

Stops when training set error rate converges, choose optimal tree for validation
596 CHAPTER 26. LEARNING FROM OBSERVATIONS

curve. (here a tree with 7 nodes)

Michael Kohlhase: Artificial Intelligence 2 912 2025-02-06

From Error Rates to Loss Functions


 So far we have been minimizing error rates. (better than maximizing ,)
 Example 26.5.19 (Classifying Spam). It is much worse to classify ham (legitimate
mails) as spam than vice versa. (message loss)
 Recall Rationality: Decision-makers should maximize expected utility (MEU).
 So: Machine learning should maximize “utility”. (not only minimize error rates)
 machine learning traditionally deals with utilities in form of “loss functions”.

 Definition 26.5.20. The loss function L is defined by setting L(x, y, yb) to be


the amount of utility lost by prediction h(x) = yb instead of f (x) = y. If L is
independent of x, we often use L(y, yb).
 Example 26.5.21. L(spam, ham) = 1, while L(ham, spam) = 10.

Michael Kohlhase: Artificial Intelligence 2 913 2025-02-06

Generalization Loss
 Note: L(y, y) = 0. (no loss if you are exactly correct)

 Definition 26.5.22 (Popular general loss functions).


absolute value loss L1 (y, yb):=|y − yb| small errors are good
2
squared error loss L2 (y, yb):=(y − yb) ditto, but differentiable
0/1 loss L0/1 (y, yb):=0, if y = yb, else 1 error rate
 Idea: Maximize expected utility by choosing hypothesis h that minimizes expected
loss over all (x,y) ∈ f .
 Definition 26.5.23. Let E be the set of all possible examples and P(X, Y ) the
prior probability distribution over its components, then the expected generalization
loss for a hypothesis h with respect to a loss function L is
X
GenLossL (h):= L(y, h(x)) · P (x, y)
(x,y)∈E

and the best hypothesis h∗ := argmin GenLossL (h).


h∈H

Michael Kohlhase: Artificial Intelligence 2 914 2025-02-06

Empirical Loss
26.5. EVALUATING AND CHOOSING THE BEST HYPOTHESIS 597

 Problem: P(X, Y ) is unknown ; learner can only estimate generalization loss:


 Definition 26.5.24. Let L be a loss function and E a set of examples with |E| = N ,
then we call
1 X
EmpLossL,E (h):= ( L(y, h(x)))
N
(x,y)∈E

the empirical loss and b


h∗ := argmin EmpLossL,E (h) the estimated best hypothesis.
h∈H

 There are four reasons why b


h∗ may differ from f :
1. Realizablility: then we have to settle for an approximation b
h∗ of f .
2. Variance: different subsets of f give different b
h∗ ; more examples.
3. Noise: if f is non deterministic, then we cannot expect perfect results.
4. Computational complexity: if H is too large to systematically explore, we make
due with subset and get an approximation.

Michael Kohlhase: Artificial Intelligence 2 915 2025-02-06

Regularization

 Idea: Directly use empirical loss to solve model selection. (finding a good H)
Minimize the weighted sum of empirical loss and hypothesis complexity. (to avoid
overfitting).
 Definition 26.5.25. Let λ ∈ R, h ∈ H, and E a set of examples, then we call

CostL,E (h):=EmpLossL,E (h) + λComplexity(h)

the total cost of h on E.


 Definition 26.5.26. The process of finding a total cost minimizing hypothesis
b
h∗ := argmin CostL,E (h)
h∈H

is called regularization; Complexity is called the regularization function or hypoth-


esis complexity.
 Example 26.5.27 (Regularization for Polynomials).

A good regularization function for polynomials is


the sum of squares of exponents. ; keep away
from wriggly curves!

Michael Kohlhase: Artificial Intelligence 2 916 2025-02-06

Minimal Description Length


598 CHAPTER 26. LEARNING FROM OBSERVATIONS

 Remark: In regularization, empirical loss and hypothesis complexity are not mea-
sured in the same scale ; λ mediates between scales.
 Idea: Measure both in the same scale ; use information content, i.e. in bits.

 Definition 26.5.28. Let h ∈ H be a hypothesis and E a set of examples, then the


description length of (h,E) is computed as follows:
1. encode the hypothesis as a Turing machine program, count bits.
2. count data bits:
 correctly predicted example ; 0b
 incorrectly predicted example ; according to size of error.

The minimum description length or MDL hypothesis minimizes the total number of
bits required.
 This works well in the limit, but for smaller problems there is a difficulty in that the
choice of encoding for the program affects the outcome.
 e.g., how best to encode a decision tree as a bit string?

Michael Kohlhase: Artificial Intelligence 2 917 2025-02-06

The Scale of Machine Learning


 Traditional methods in statistics and early machine learning concentrated on small-
scale learning (50-5000
examples)
 Generalization error mostly comes from
 approximation error of not having the true f in the hypothesis space
 estimation error of too few training examples to limit variance.

 In recent years there has been more emphasis on large-scale learning. (millions of
examples)
 Generalization error is dominated by limits of computation
 there is enough data and a rich enough model that we could find an h that
is very close to the true f ,
 but the computation to find it is too complex, so we settle for a sub-optimal
approximation.
 Hardware advances (GPU farms, Amazon EC2, Google Data Centers, . . . ) help.

Michael Kohlhase: Artificial Intelligence 2 918 2025-02-06

26.6 Computational Learning Theory


Video Nuggets covering this section can be found at https://fau.tv/clip/id/30377 and
https://fau.tv/clip/id/30378.
26.6. COMPUTATIONAL LEARNING THEORY 599

A (General) Theory of Learning?


 Main Question: How can we be sure that our learning algorithm has produced a
hypothesis that will predict the correct value for previously unseen inputs?
 Formally: How do we know that the hypothesis h is close to the target function
f if we don’t know what f is?

 Other - more recent - Questions:


 How many examples do we need to get a good h?
 What hypothesis space H should we use?
 If the H is very complex, can we even find the best h, or do we have to settle
for a local maximum in H.
 How complex should h be?
 How do we avoid overfitting?
 “Computational Learning Theory” tries to answer these using concepts from AI,
statistics, and theoretical CS.

Michael Kohlhase: Artificial Intelligence 2 919 2025-02-06

PAC Learning
 Basic idea of Computational Learning Theory:
 Any hypothesis h that is seriously wrong will almost certainly be “found out”
with high probability after a small number of examples, because it will make an
incorrect prediction.
 Thus, if h is consistent with a sufficiently large set of training examples is unlikely
to be seriously wrong.
 ; h is probably approximately correct.
 Definition 26.6.1. Any learning algorithm that returns hypotheses that are prob-
ably approximately correct is called a PAC learning algorithm.
 Derive performance bounds for PAC learning algorithms in general, using the
 Stationarity Assumption (again): We assume that the set E of possible examples
is IID ; we have a fixed distribution P(E) = P(X, Y ) on examples.

 Simplifying Assumptions: f is a function (deterministic) and f ∈ H.

Michael Kohlhase: Artificial Intelligence 2 920 2025-02-06

PAC Learning
 Start with PAC theorems for Boolean functions, for which L0/1 is appropriate.
 Definition 26.6.2. The error rate error(h) of a hypothesis h is the probability that
600 CHAPTER 26. LEARNING FROM OBSERVATIONS

h misclassifies a new example.


X
error(h):=GenLossL0/1 (h) = L0/1 (y, h(x)) · P (x, y)
(x,y)∈E

 Intuition: error(h) is the probability that h misclassifies a new example.

 This is the same quantity as measured in the learning curves above.


 Definition 26.6.3. A hypothesis h is called approximatively correct, iff error(h) ≤ ϵ
for some small ϵ > 0.
We write Hb :={h ∈ H | error(h) > ϵ} for the “seriously bad” hypotheses.

Michael Kohlhase: Artificial Intelligence 2 921 2025-02-06

Sample Complexity
 Let’s compute the probability that hb ∈ Hb is consistent with the first N examples.
 We know error(hb ) > ϵ
N
; P (hb agrees with N examples) ≤ (1 − ϵ) . (independence)
N N
; P (Hb contains consistent hyp.)≤|Hb | · (1 − ϵ) ≤|H| · (1 − ϵ) . (Hb ⊆ H)
1 1
; to bound this by a small δ, show the algorithm N ≥ ϵ · (log2 ( δ ) + log2 (|H|))
examples.
 Definition 26.6.4. The number of required examples as a function of ϵ and δ is
called the sample complexity of H.
n
 Example 26.6.5. If H is the set of n-ary Boolean functions, then |H| = 22 .
n
; sample complexity grows with O(log2 (22 )) = O(2n ).
There are 2n possible examples,
; PAC learning for Boolean functions needs to see (nearly) all examples.

Michael Kohlhase: Artificial Intelligence 2 922 2025-02-06

Escaping Sample Complexity


 Problem: PAC learning for Boolean functions needs to see (nearly) all examples.

 H contains enough hypotheses to classify any given set of examples in all possible
ways.
 In particular, for any set of N examples, the set of hypotheses consistent with
those examples contains equal numbers of hypotheses that predict xN +1 to be
positive and hypotheses that predict xN +1 to be negative.

 Idea/Problem: restrict the H in some way (but we may lose realizability)


 Three Ways out of this Dilemma:
1. bring prior knowledge into the problem. (??)
2. prefer simple hypotheses. (e.g. decision tree pruning)
26.6. COMPUTATIONAL LEARNING THEORY 601

3. focus on “learnable subsets” of H. (next)

Michael Kohlhase: Artificial Intelligence 2 923 2025-02-06

PAC Learning: Decision Lists


 Idea: Apply PAC learning to a “learnable hypothesis space”.
 Definition 26.6.6. A decision list consists of a sequence of tests, each of which is
a conjunction of literals.
 If a test succeeds when applied to an example description, the decision list
specifies the value to be returned.
 If the test fails, processing continues with the next test in the list.
 Remark: Like decision trees, but restricted branching, but more complex tests.
 Example 26.6.7 (A decision list for the Restaurant Problem).

No No
P atrons(x, Some) P atrons(x, F ull) ∧ F ri/Sat(x) No
Yes Yes
Yes Yes

 Lemma 26.6.8. Given arbitrary size conditions, decision lists can represent arbi-
trary Boolean functions.

 This directly defeats our purpose of finding a “learnable subset” of H.

Michael Kohlhase: Artificial Intelligence 2 924 2025-02-06

Decision Lists: Learnable Subsets (Size-Restricted Cases)


 Definition 26.6.9. The set of decision lists where tests are of conjunctions of at
most k literals is denoted by k−DL.

 Example 26.6.10. The decision list from ?? is in 2−DL.


 Observation 26.6.11. k−DL contains k−DT, the set of decision trees of depth
at most k.
 Definition 26.6.12. We denote the set of k−DL decision lists with at most n
Boolean attributes with k−DL(n). The set of conjunctions of at most k literals
over n attributes is written as Conj(k, n).
 Decision lists are constructed of optional yes/no tests, so there are at most 3|Conj(k,n)|
distinct sets of component tests. Each of these sets of tests can be in any order, so
|k−DL(n)| ≤ 3|Conj(k,n)| · |Conj(k, n)|!

Michael Kohlhase: Artificial Intelligence 2 925 2025-02-06


602 CHAPTER 26. LEARNING FROM OBSERVATIONS

Decision Lists: Learnable Subsets (Sample Complexity)


 The number of conjunctions of k literals from n attributes is given by
k 
X 
2n
|Conj(k, n)| =
i=1
i

thus |Conj(k, n)|=O(nk ). Hence, we obtain (after some work)


k
log2 (nk ))
|k−DL(n)|=2O(n

 Plug this into the equation for the sample complexity: N ≥ 1ϵ ·(log2 ( 1δ )+log2 (|H|))
to obtain
1 1
N ≥ · (log2 ( ) + log2 (O(nk log2 (nk ))))
ϵ δ
 Intuitively: Any algorithm that returns a consistent decision list will PAC learn a
k−DL function in a reasonable number of examples, for small k.

Michael Kohlhase: Artificial Intelligence 2 926 2025-02-06

Decision Lists Learning


 Idea: Use a greedy search algorithm that repeats
1. find test that agrees exactly with some subset E of the training set,
2. add it to the decision list under construction and removes E,
3. construct the remainder of the DL using just the remaining examples,

until there are no examples left.


 Definition 26.6.13. The following algorithm performs decision list learning
function DLL(E) returns a decision list, or failure
if E is empty then return (the trivial decision list) No
t := a test that matches a nonempty subset Et of E
such that the members of Et are all positive or all negative
if there is no such t then return failure
if the examples in Et are positive then o := Yes else o := No
return a decision list with initial test t and outcome o and remaining tests given by
DLL(E\Et )

Michael Kohlhase: Artificial Intelligence 2 927 2025-02-06

Decision Lists Learning in Comparison


 Learning curves: for DLL (and DTL for comparison)
26.7. REGRESSION AND CLASSIFICATION WITH LINEAR MODELS 603

Proportion correct on test set


0.9

0.8
Decision tree
0.7 Decision list

0.6

0.5

0.4
0 20 40 60 80 100
Training set size

 Upshot: The simpler DLL works quite well!

Michael Kohlhase: Artificial Intelligence 2 928 2025-02-06

26.7 Regression and Classification with Linear Models

Univariate Linear Regression


 Definition 26.7.1. A univariate or unary function is a function with one argument.

 Recall: A mapping f between vector spaces is called linear, iff it preserves plus
and scalar multiplication, i.e. f (α · v1 + v2 ) = α · f (v1 ) + f (v2 ).
 Observation 26.7.2. A univariate, linear function f : R → R is of the form f (x) =
w1 x + w0 for some wi ∈ R.

 Definition 26.7.3. Given a vector w := (w0 ,w1 ), we define hw (x):=w1 x + w0 .


 Definition 26.7.4. Given a set of examples E ⊆ R×R, the task of finding hw that
best fits E is called linear regression.
 Example 26.7.5.
1000
Examples of house price vs. square 900
House price in $1000

feet in houses sold in Berkeley in 800

July 2009. 700

Also: linear function hypothesis 600

that minimizes squared error loss 500


400
y = 0.232x + 246. 300
500 1000 1500 2000 2500 3000 3500
House size in square feet

Michael Kohlhase: Artificial Intelligence 2 929 2025-02-06

Univariate Linear Regression by Loss Minimization


604 CHAPTER 26. LEARNING FROM OBSERVATIONS

 Idea: Minimize squared error loss over {(xi ,yi ) | i ≤ N } (used already by Gauss)

N
X N
X N
X
2 2
Loss(hw ) = L2 (yj , hw (xj )) = (yj − hw (xj )) = (yj − (w1 xj + w0 ))
j=1 j=1 j=1

Task: find w∗ := argmin Loss(hw ).


w
PN 2
 Recall: j=1 (yj − (w1 xj + w0 )) is minimized, when the partial derivatives wrt.
the wi are zero, i.e. when
N N
∂ X 2 ∂ X 2
( (yj − (w1 xj + w0 )) ) = 0 and ( (yj − (w1 xj + w0 )) ) = 0
∂w0 j=1 ∂w1 j=1

 Observation: These equations have a unique solution:


P P P P P
N ( j xj yj ) − ( j xj )( j yj ) ( j yj ) − w1 ( j xj )
w1 = P P 2 w0 =
N ( j xj 2 ) − ( j xj ) N

 Remark: Closed-form solutions only exist for linear regression, for other (dif-
ferentiable) hypothesis spaces use gradient descent methods for adjusting/learning
weights.

Michael Kohlhase: Artificial Intelligence 2 930 2025-02-06

A Picture of the Weight Space


 Remark: Many forms of learning involve adjusting weights to minimize loss.

 Definition 26.7.6. The weight space of a parametric model is the space of all
possible combinations of parameters (called the weights). Loss minimization in a
weight space is called weight fitting.

The weight space of univariate linear re-


 gression is R2 .
; graph the loss function over R2 . Loss

Note: it is convex. w0

w1

 Observation 26.7.7. The squared error loss function is convex for any linear
regression problem ; there are no local minima.

Michael Kohlhase: Artificial Intelligence 2 931 2025-02-06

Gradient Descent Methods


 If we do not have closed form solutions for minimizing loss, we need to search.
26.7. REGRESSION AND CLASSIFICATION WITH LINEAR MODELS 605

 Idea: Use local search (hill climbing) methods.


 Definition 26.7.8. The gradient descent algorithm for finding a minimum of a
continuous function F is hill climbing in the direction of the steepest descent, which
can be computed by the partial derivatives of F .
function gradient−descent(F ,w,α) returns a local minimum of F
inputs: a differentiable function F and initial weights w.
loop until w converges do
for each wi do

wi ←− wi − α ∂w i
F (w)
end for
end loop

The parameter α is called the learning rate. It can be a fixed constant or it can
decay as learning proceeds.

Michael Kohlhase: Artificial Intelligence 2 932 2025-02-06

Gradient-Descent for Loss


 Let’s try gradient descent for Loss.
 Work out the partial derivatives for one example (x,y):
2
∂Loss(w) ∂(y − hw (x)) ∂(y − (w1 x + w0 ))
= = 2(y − hw (x))
∂wi ∂wi ∂wi
and thus
∂Loss(w) ∂Loss(w)
= −2(y − hw (x)) = −2(y − hw (x))x
∂w0 ∂w1
Plug this into the gradient descent updates:

w0 ←− w0 − α · (−2(y − hw (x))) w1 ←− w1 − α · −2((y − hw (x))) · x

Michael Kohlhase: Artificial Intelligence 2 933 2025-02-06

Gradient-Descent for Loss (continued)


 Analogously for n training examples (xj ,yj ):
 Definition 26.7.9.
X X
w0 ←− w0 − α( −2(yj − hw (xj ))) w1 ←− w1 − α( −2(yj − hw (xn ))xn )
j j

These updates constitute the batch gradient descent learning rule for univariate
linear regression.
 Convergence to the unique global loss minimum is guaranteed (as long as we pick
α small enough) but may be very slow.
606 CHAPTER 26. LEARNING FROM OBSERVATIONS

 Doing batch gradient descent on random subsets of the examples of fixed batch
size n is called stochastic gradient descent (SGD). (More computationally efficient
than updating for every example)

Michael Kohlhase: Artificial Intelligence 2 934 2025-02-06

Multivariate Linear Regression


 Definition 26.7.10. A multivariate or n-ary function is a function with one or
more arguments.
 We can use it for multivariate linear regression.
 Idea: Every example ⃗xj is an n element vector and the hypothesis space is the set
of functions
X
hsw (⃗xj ) = w0 + w1 xj,1 + . . . + wn xj,n = w0 + wi xj,i
i

 Trick: Invent xj,0 := 1 and use matrix notation:


X
hsw (⃗xj ) = w·⃗ ⃗ t ⃗xj =
⃗ xj = w wi xj,i
i

 Definition 26.7.11. The best vector ∗


P of weights, w , minimizes squared-error loss
over the examples: w∗ := argmin ( j L2 (yj )(w·⃗xj )).
w

 Gradient descent will reach the (unique) minimum of the loss function; the update
equation for each weight wi is
X
wi ←− wi − α( xj,i (yj − hw (⃗xj )))
j

Michael Kohlhase: Artificial Intelligence 2 935 2025-02-06

Multivariate Linear Regression (Analytic Solutions)


 We can also solve analytically for the w∗ that minimizes loss.
 Let ⃗y be the vector of outputs for the training examples, and X be the data matrix,
i.e., the matrix of inputs with one n-dimensional example per row.
−1
Then the solution w∗ = (XT X) XT ⃗y minimizes the squared error.

Michael Kohlhase: Artificial Intelligence 2 936 2025-02-06

Multivariate Linear Regression (Regularization)


 Remark: Univariate linear regression does not overfit, but in the multivariate case
26.7. REGRESSION AND CLASSIFICATION WITH LINEAR MODELS 607

there might be “redundant dimensions” that result in overfitting.


 Idea: Use regularization with a complexity function based on weights.
P q
 Definition 26.7.12. Complexity(hw ) = Lq (w) = i |wi |

 Caveat: Do not confuse this with the loss functions L1 and L2 .


 Problem: Which q should we pick? (L1 and L2 minimize sum of absolute
values/squares)
 Answer: It depends on the application.

 Remark: L1 -regularization tends to produce a sparse model, i.e. it sets many


weights to 0, effectively declaring the corresponding attributes to be irrelevant.
Hypotheses that discard attributes can be easier for a human to understand, and
may be less likely to overfit. (see [RN03, Section 18.6.2])

Michael Kohlhase: Artificial Intelligence 2 937 2025-02-06

Linear Classifiers with a hard Threshold


 Idea: The result of linear regression can be used for classification.
 Example 26.7.13 (Nuclear Test Ban Verification).

Plots of seismic data parameters: 7.5


7
body wave magnitude x1 vs. sur- 6.5
6
face wave magnitude x2 . White: 5.5
x2

5
earthquakes, black: underground 4.5
explosions 4
3.5
Also: hw∗ as a decision boundary 3
2.5
x2 = 17x1 − 4.9. 4.5 5 5.5 6 6.5 7
x1

 Definition 26.7.14. A decision boundary is a line (or a surface, in higher dimen-


sions) that separates two classes of points. A linear decision boundary is called a
linear separator and data that admits one are called linearly separable.
 Example 26.7.15 (Nuclear Tests continued). The linear separator for ??is
defined by −4.9+1.7x1 −x2 = 0, explosions are characterized by −4.9+1.7x1 −x2 >
0, earthquakes by −4.9 + 1.7x1 − x2 < 0.
 Useful Trick: If we introduce dummy coordinate x0 = 1, then we can write the
classification hypothesis as hw (x) = 1 if w·x > 0 and 0 otherwise.

Michael Kohlhase: Artificial Intelligence 2 938 2025-02-06

Linear Classifiers with a hard Threshold (Perceptron Rule)


 So hw (x) = 1 if w·x > 0 and 0 otherwise is well-defined, how to choose w?
608 CHAPTER 26. LEARNING FROM OBSERVATIONS

 Think of hw (x) = T (w·x), where T (z) = 1, if z > 0 and T (z) = 0 otherwise.


We call T a threshold function.
 Problem: T is not differentiable and ∂T
∂z = 0 where defined ;
∂T
 No closed-form solutions by setting ∂z = 0 and solving.
 Gradient-descent methods in weight-space do not work either.
 We can learn weights by iterating over the following rule:
 Definition 26.7.16.Given an example (x,y), the perceptron learning rule is

wi ←− wi + α · (y − hw (x)) · xi

 as we are considering 0/1 classification, there are three possibilities:


1. If y = hw (x), then wi remains unchanged.
2. If y = 1 and hw (x) = 0, then wi is in/decreased if xi is positive/negative. (we
want to make w·x bigger so that T (w·x) = 1)
3. If y = 0 and hw (x) = 1, then wi is de/increased if xi is positive/negative. (we
want to make w·x smaller so that T (w·x) = 0)

Michael Kohlhase: Artificial Intelligence 2 939 2025-02-06

Learning Curves for Linear Classifiers (Perceptron Rule)


 Example 26.7.17. 7.5
7
Learning curves (plots of total 6.5
6
training set accuracy vs. number 5.5
x2

of iterations) for the perceptron 4.5


4

rule on the earthquake/explosions


3.5
3
2.5
data. 4.5 5 5.5 6 6.5 7
x1

original data noisy, non-separable data learning rate decay


α(t) = 1000/(1000 + t)
1 1 1
0.9 0.9 0.9
Proportion correct

Proportion correct

Proportion correct

0.8 0.8 0.8


0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0 100 200 300 400 500 600 700 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Number of weight updates Number of weight updates Number of weight updates

messy convergence convergence failure slow convergence


700 iterations 100,000 iterations 100,000 iterations

 Theorem 26.7.18. Finding the minimal-error hypothesis is NP hard, but possible


with learning rate decay.

Michael Kohlhase: Artificial Intelligence 2 940 2025-02-06

Linear Classification with Logistic Regression


26.7. REGRESSION AND CLASSIFICATION WITH LINEAR MODELS 609

 So far: Passing the output of a linear function through a threshold function T


yields a linear classifier.
 Problem: The hard nature of T brings problems:

 T is not differentiable nor continuous ; learning via perceptron rule becomes


unpredictable.
 T is “overly precise” near the boundary ⇝ need more graded judgments.
 Idea: Soften the threshold, approximate it with a differentiable function.
1
We use the standard logistic function l(x) = 1+e−x
1
So we have hw (x) = l(w·x) = 1+e−(w·x)

 Example 26.7.19 (Logistic Regression Hypothesis in Weight Space).


Plot of a logistic regression hypothesis for 1

the earthquake/explosion data. 0.8


0.6

The value at (w0 ,w1 ) is the probability 0.4


0.2
-2-4
of belonging to the class labeled 1. 0
-2 0 2 8 64
20x
2
x1 4 6 10

We speak of the cliff in the classifier intuitively.

Michael Kohlhase: Artificial Intelligence 2 941 2025-02-06

Logistic Regression

 Definition 26.7.20. The process of weight fitting in hw (x) = 1


1+e−(w·x)
is called
logistic regression.
 There is no easy closed form solution, but gradient descent is straightforward,
 As our hypotheses have continuous output, use the squared error loss function L2 .

 For an example (x,y) we compute the partial derivatives: (via chain rule)

∂ ∂ 2
L2 (w) = ((y − hw (x)) )
∂wi ∂wi

= 2 · hw (x) · (y − hw (x))
∂wi

= −2 · hw (x) · l′ (w·x) · (w·x)
∂wi
= −2 · hw (x) · l′ (w·x) · xi

Michael Kohlhase: Artificial Intelligence 2 942 2025-02-06

Logistic Regression (continued)


610 CHAPTER 26. LEARNING FROM OBSERVATIONS

 The derivative of the logistic function satisfies l′ (z) = l(z)(1 − l(z)), thus

l′ (w·x) = l(w·x)(1 − l(w·x)) = hw (x)(1 − hw (x))

 Definition 26.7.21. The rule for logistic update (weight update for minimizing the
loss) is
wi ←− wi + α · (y − hw (x)) · hw (x) · (1 − hw (x)) · xi

 Example 26.7.22 (Redoing the Learning Curves).

original data noisy, non-separable data learning rate decay


α(t) = 1000/(1000 + t)
1 1 1
Squared error per example

Squared error per example

Squared error per example


0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0 1000 2000 3000 4000 5000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Number of weight updates Number of weight updates Number of weight updates

messy convergence convergence failure slow convergence


5000 iterations 100,000 iterations 100,000 iterations

 Upshot: Logistic update seems to perform better than perceptron update.

Michael Kohlhase: Artificial Intelligence 2 943 2025-02-06

26.8 Support Vector Machines

Support Vector Machines


Definition 26.8.1. Given a linearly separable data set E the maximum margin separator
is the linear separator s that maximizes the margin, i.e. the distance of the E from s.
Example 26.8.2. All lines on the left are valid linear separators:

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

We expect the maximum margin separator on the right to generalize best


Note: To find the maximum margin separator, we only need to consider the innermost
points (circled above).

Michael Kohlhase: Artificial Intelligence 2 944 2025-02-06

Support Vector Machines (contd.)


26.8. SUPPORT VECTOR MACHINES 611

Definition 26.8.3. Support-vector machines (SVMs; also support-vector networks) are


supervised learning models for classification and regression.
SVMs construct a maximum margin separator by prioritizing critical examples (sup-
port vectors).
SVMs are still one of the most popular approaches for “off-the-shelf” supervised
learning.
Setting:
 We have a training set E = {⟨x1 , y 1 ⟩, . . ., ⟨xn , y n ⟩} where xi ∈ Rp and y i ∈
{ − 1, 1} (instead of {1, 0})

 The goal is to find a hyperplane in Rp that maximally separates the two classes
(i.e. y i = −1 from y i = 1)
Remember A hyperplane can be represented as the set {x | (w·x) + b = 0} for some
vector w and scalar b. (w is orthogonal to the plane, b determines the offset from
the origin)

Michael Kohlhase: Artificial Intelligence 2 945 2025-02-06

Finding the Maximum Margin Separator (Separable Case)


Idea: The margin is bounded by the two hyperplanes de-
scribed by {x | (w·x) + b + 1 = 0} (lower boundary) and 1

{x | (w·x) + b − 1 = 0} (upper boundary). 0.8


2
⇒ The distance between them is ∥w∥ .
2 0.6
Constraints: To maximize the margin, minimize ∥w∥2 while
keeping xi out of the margin: 0.4

(w·xi ) + b ≥ 1 for y i = 1 and (w·xi ) + b ≤ −1 for y i = −1 0.2


; y i ((w·xi ) − b) ≥ 1 for 1 ≤ i ≤ n.
; This is an optimization problem. 0
0 0.2 0.4 0.6 0.8 1

X 1 X
Theorem 26.8.4 (SVM equation). Let α = argmax ( αj − ( αj αk y j y k (xj ·xk )))
α
j
2
P j,k
under the constraints αj ≥ 0 and j αj y j = 0. P
The maximum margin separator is given by w = j αj xj and b = w·xi − y i for
any xi where αi ̸= 0.
Proof sketch: By the duality principle for optimization problems

Michael Kohlhase: Artificial Intelligence 2 946 2025-02-06

Finding the Maximum Margin Separator (Separable Case)


X 1 X X
α = argmax ( αj − ( αj αk y j y k (xj ·xk ))), where αj ≥ 0, αj y j = 0
α
j
2 j
j,k

Important Properties:

 The weights αj associated with each data point are zero except at the support
612 CHAPTER 26. LEARNING FROM OBSERVATIONS

vectors (the points closest to the separator),


 The expression is convex ; the single global maximum can found efficiently,
 Data enter the expression only in the form of dot productsPof point pairs ; once
the optimal αi have been calculated, we have h(x) = sign( j αj yj (x·xj ) − b)
 There are good software packages for solving such quadratic programming opti-
mizations

Michael Kohlhase: Artificial Intelligence 2 947 2025-02-06

Support Vector Machines (Kernel Trick)


What if the data is not linearly separable?
Idea: Transform the data into a feature space where they are.
Definition 26.8.5. A feature for data in Rp is a function Rp → Rq .
Example 26.8.6 (Projecting Up a Non-Separable Data Set).
The true decision boundary is x1 2 + x2 2 ≤ 1.

1.5

0.5
x2

-0.5

-1

-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
x1

; use the feature “distance from center”


Michael Kohlhase: Artificial Intelligence 2 948 2025-02-06

Support Vector Machines (Kernel Trick continued)


Idea: Replace xi ·xj by some other product on the feature space in the SVM equation

Definition 26.8.7. A kernel function is a function K : Rp ×Rp → R of the form


K(x1 , x2 ) = ⟨F (x1 ),F (x2 )⟩ for some feature F and inner product ⟨·, ·⟩ on the codomain
of F .
Smart choices for a kernel function often allow us to compute K(xi , xj ) without
needing to compute F at all.
Example 26.8.8.√ If we encode the distance from the center as the feature F (x) =
⟨x1 2 , x2 2 , 2x1 x2 ⟩ and define the kernel function as K(xi , xj ) = F (xi )·F (xj ), then
2
this simplifies to K(xi , xj ) = (xi ·xj )
26.9. ARTIFICIAL NEURAL NETWORKS 613

1.5
√2x1x2
1
3
2
0.5
1
0

x2
0
-1
-2 2.5
-0.5 -3 2
0 1.5
-1 0.5 2
1 1 x2
1.5 0.5
-1.5 x21 2
-1.5 -1 -0.5 0 0.5 1 1.5
x1

Michael Kohlhase: Artificial Intelligence 2 949 2025-02-06

Support Vector Machines (Kernel Trick continued)


Generally: We can learn non-linear separators by solving
X 1 X
argmax ( αj − ( αj αk yj yk K(xj , xk )))
α
j
2
j,k

where K is a kernel function


Definition 26.8.9. Let X = {x1 , . . ., xn }. A symmetric function K : X×X → R is
called positive definite iff the matrix Ki,j = K(xi , xj ) is a positive definite matrix.
Theorem 26.8.10 (Mercer’s Theorem). Every positive definite function K on X is
a kernel function on X for some feature F .
d
Definition 26.8.11. The function K(xj , xk ) = (1 + (xj ·xj )) is a kernel function
corresponding to a feature space whose dimension is exponential in d. It is called the
polynomial kernel.

Michael Kohlhase: Artificial Intelligence 2 950 2025-02-06

26.9 Artificial Neural Networks


Outline
 Brains

 Neural networks
 Perceptrons
 Multilayer perceptrons
 Applications of neural networks

Michael Kohlhase: Artificial Intelligence 2 951 2025-02-06

Brains
 Axiom 26.9.1 (Neuroscience Hypothesis). Mental activity consists consists
primarily of electrochemical activity in networks of brain cells called neurons.
614 CHAPTER 26. LEARNING FROM OBSERVATIONS

 Definition 26.9.2. The animal brain is a biological neural network


 with 1011 neurons of > 20 types, 1014 synapses, (1ms) − (10ms) cycle time.
 Signals are noisy “spike trains” of electrical potential.

Michael Kohlhase: Artificial Intelligence 2 952 2025-02-06

Neural Networks as an approach to Artificial Intelligence

 One approach to Artificial Intelligence is to model and simulate brains. (and hope
that AI comes along naturally)
 Definition 26.9.3. The AI subfield of neural networks (also called connectionism,
parallel distributed processing, and neural computation) studies computing systems
inspired by the biological neural networks that constitute brains.
 Neural networks are attractive computational devices, since they perform important
AI tasks – most importantly learning and distributed, noise-tolerant computation –
naturally and efficiently.

Michael Kohlhase: Artificial Intelligence 2 953 2025-02-06

Neural Networks – McCulloch-Pitts “unit”


Definition 26.9.4. An artificial neural network is a directed graph such that every edge
ai → aj is associated with a weight wi,j ∈ R, and each node aj with parents a1 , . . ., an
is associated with a function f (w1,j , . . ., wn,j , x1 , . . . , xn ) ∈ R.
We call the output of a node’s function its activation, the matrix wi,j the weight
matrix, the nodes units and the edges links.
In 1943 McCulloch and Pitts proposed a simple model for a neuron/brain:
Definition 26.9.5. A McCulloch-Pitts unit first computes a weighted sum of all inputs
and then applies an activation function g to it.
26.9. ARTIFICIAL NEURAL NETWORKS 615

Bias Weight
a0 = 1 aj = g(inj)
X w0,j
ini = wj,i aj g
wi,j inj
j
X
ai
Σ aj

ai ← g(ini ) = g(+ wj,i aj )


Input Input Activation Output
j Links Function Function Output Links

If g is a threshold function, we call the unit a perceptron unit, if g is a logistic function


a sigmoid perceptron unit.
A McCulloch-Pitts network is a neural network with McCulloch-Pitts units.
Michael Kohlhase: Artificial Intelligence 2 954 2025-02-06

Implementing Logical Functions as Units


 McCulloch-Pitts units are a gross oversimplification of real neurons, but its purpose
is to develop understanding of what neural networks of simple units can do.
 Theorem 26.9.6 (McCulloch and Pitts). Every Boolean function can be imple-
mented as McCulloch-Pitts networks.
 Proof: by construction
P
1. Recall that ai ←− g( j wj,i aj ). Let g(r) = 1 iff r > 0, else 0.
2. As for linear regression we use a0 = 1 ; w0,i as a bias weight (or intercept)
(determines the threshold)
w0 = −1 w0 = −0.5 w0 = 0.5
w1 = 1 w1 = 1 w1 = −1

w =1 w =1
3. 2 AND 2
OR NOT

4. Any Boolean function can be implemented as a DAG of McCulloch-Pitts units.

Michael Kohlhase: Artificial Intelligence 2 955 2025-02-06

Network Structures: Feed-Forward Networks


 We have models for neurons ; connect them to neural networks.
 Definition 26.9.7. A neural network is called a feed-forward network, if it is acyclic.

 Intuition: Feed-forward networks implement functions, they have no internal state.


 Definition 26.9.8.Feed-forward networks are usually organized in layers: a n layer
network has a partition {L0 , . . ., Ln } of the nodes, such that edges only connect
nodes from subsequent layer.
L0 is called the input layer and its members input units, and Ln the output layer
and its members output units. Any unit that is not in the input layer or the output
layer is called hidden.

Michael Kohlhase: Artificial Intelligence 2 956 2025-02-06


616 CHAPTER 26. LEARNING FROM OBSERVATIONS

Network Structures: Recurrent Networks


 Definition 26.9.9. A neural network is called recurrent (a RNNs), iff it has cycles.
 Hopfield networks have symmetric weights (wi,j = wj,i ) g(x) = sign(x), ai =
±1; (holographic associative memory)
 Boltzmann machines use stochastic activation functions.
 Recurrent neural networks have cycles with delay ; have internal state (like flip-
flops), can oscillate etc.

Recurrent neural networks follow largely the same principles as feed-forward networks,
so we will not go into details here.

Michael Kohlhase: Artificial Intelligence 2 957 2025-02-06

Single-layer Perceptrons
 Definition 26.9.10. A perceptron network is a feed-forward network of perceptron
units. A single layer perceptron network is called a perceptron.

 Example 26.9.11.

1
0.8
0.6
0.4
0.2
-2-4
0 20x
Input w Output -2 0
x1
2 4 6 10 8
6 4 2

i,j
Layer Layer

 All input units are directly connected to output units.


 Output units all operate separately, no shared weights ; treat as the combination
of n perceptron units.
 Adjusting weights moves the location, orientation, and steepness of cliff.

Michael Kohlhase: Artificial Intelligence 2 958 2025-02-06

Feed-forward Neural Networks (Example)


 Feed-forward network =
b a parameterized family of nonlinear functions:
 Example 26.9.12. We show two feed-forward networks:
26.9. ARTIFICIAL NEURAL NETWORKS 617

w1,3 w1,3 w3,5


1 3 1 3 5
w1,4 w1,4 w3,6

w2,3 w2,3 w4,5


2 w2,4 4 2 w2,4 4 w4,6 6

a) single layer (perceptron network) b) 2 layer feed-forward network

a5 = g(w3,5 · a3 + w4,5 · a4 )
= g(w3,5 · g(w1,3 · a1 + w2,3 a2 ) + w4,5 · g(w1,4 · a1 + w2,4 a2 ))

 Idea: Adjusting weights changes the function: do learning this way!

Michael Kohlhase: Artificial Intelligence 2 959 2025-02-06

Expressiveness of Perceptrons
 Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
 Can represent AND, OR, NOT, majority, etc., but not XOR (and thus no adders)
 Represents a linear separator in input space:
X
wj xj > 0 or W, x· > 0
j

x1 x1 x1

1 1 1

0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2

 Minsky & Papert (1969) pricked the first neural network balloon!

Michael Kohlhase: Artificial Intelligence 2 960 2025-02-06

Perceptron Learning
For learning, we update the weights using gradient descent based on the generaliza-
tion loss function.
Let e.g. L(w) = (y − hw (x))2 (the squared error loss).
We compute the gradient:
618 CHAPTER 26. LEARNING FROM OBSERVATIONS

X n
∂L(w) ∂(y − hw (x)) ∂
= 2 · (yk − hw (x)k ) · = 2 · (yk − hw (x)k ) · (y − g( wj,k xj ))
∂wj,k ∂wj,k ∂wj,k j=0

= −2 · (yk − hw (x)k ) · g ′ (ink ) · xj

; Replacing the constant factor −2 by a learning rate parameter α we get the


update rule:

wj,k ← wj,k + α · (yk − hw (x)k ) · g ′ (ink ) · xj

Michael Kohlhase: Artificial Intelligence 2 961 2025-02-06

Perceptron learning contd.


The perceptron learning rule converges to a consistent function – for any linearly
separable data set
Majority Restaurant
1 1
Proportion correct on test set

Proportion correct on test set

0.9 0.9

0.8 0.8

0.7 0.7

0.6 Perceptron 0.6


Decision tree
0.5 0.5 Perceptron
Decision tree
0.4 0.4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Training set size Training set size

Perceptron learns the majority function easily, where DTL is hopeless.


Conversely, DTL learns the restaurant function easily, where a perceptron is hopeless.
(not representable)

Michael Kohlhase: Artificial Intelligence 2 962 2025-02-06

Multilayer perceptrons
 Definition 26.9.13. In multi layer perceptron (MLPs), layers are usually fully
connected;
numbers of hidden units typically chosen by hand.

Output Layer ai

wi,j

Hidden Layer aj

wi,j

Input Layer ak
26.9. ARTIFICIAL NEURAL NETWORKS 619

 Definition 26.9.14. Some MLPs have residual connections, i.e. connections that
skip layers.

Michael Kohlhase: Artificial Intelligence 2 963 2025-02-06

Expressiveness of MLPs
 All continuous functions w/ 2 layers, all functions w/ 3 layers.

hW(x1, x2) hW(x1, x2)


1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 4 0.2 4
0 2 0 2
-4 -2 0 x2 -4 -2 0 x2
0 -2 0 -2
x1 2 -4 x1 2 -4
4 4

 Combine two opposite-facing threshold functions to make a ridge.


 Combine two perpendicular ridges to make a bump.
 Add bumps of various sizes and locations to fit any surface.

 Proof requires exponentially many hidden units. (cf. DTL proof)

Michael Kohlhase: Artificial Intelligence 2 964 2025-02-06

Learning in Multilayer Networks


Note: The output layer of a multilayer neural network is a single-layer perceptron
whose input is the output of the last hidden layer.
; We can use the perceptron learning rule to update the weights of the output layer;
e.g. for a squared error loss function: wj,k ← wj,k + α · (yk − hw (x)k ) · g ′ (ink ) · aj
What about the hidden layers?
Idea: The hidden node j is “responsible” for some fraction of the error proportional to
the weight wj,k .
; Back-propagate the error ∆k = (yk −hw (x)k )·g ′ (inj ) from node k in the output
layer to the hidden node j.
Let’s justify this:

∂L(w)k ∂ink
= −2 · (yk − hw (x)k ) · g ′ (ink ) · (as before)
∂wi,j | {z } ∂wi,j
=:∆k
P
∂( ℓ wℓ,k aℓ ) ∂aj ∂g(inj )
= −2 · ∆k · = −2 · ∆k · wj,k · = −2 · ∆k · wj,k ·
∂wi,j ∂wi,j ∂wi,j

= −2 · ∆k · wj,k · g (inj ) ·ai
| {z }
=:∆j,k
620 CHAPTER 26. LEARNING FROM OBSERVATIONS

Michael Kohlhase: Artificial Intelligence 2 965 2025-02-06

Learning in Multilayer Networks (Hidden Layers)


∂L(w)k
= −2 · ∆k · wj,k · g ′ (inj ) ·ai
∂wi,j | {z }
=:∆j,k

Idea: The total “error” of the hidden node j is the sum of all the connected nodes k
in the next layer
Definition 26.9.15. The X back-propagation rule for hidden nodes of a multilayer per-

ceptron is ∆j ← g (inj ) · ( wj,i ∆i ) And the update rule for weights in a hidden layer
i
is wk,j ← wk,j + α · ak · ∆j

Remark: Most neuroscientists deny that back-propagation occurs in the brain.


The back-propagation process can be summarized as follows:
1. Compute the ∆ values for the output units, using the observed error.
2. Starting with output layer, repeat the following for each layer in the network, until
the earliest hidden layer is reached:

(a) Propagate the ∆ values back to the previous (hidden) layer.


(b) Update the weights between the two layers.

Michael Kohlhase: Artificial Intelligence 2 966 2025-02-06

Backprogagation Learning Algorithm


 Definition 26.9.16. The back-propagation learning algorithm is given the following
pseudocode
function BACK−PROP−LEARNING(examples,network) returns a neural network
inputs: examples, a set of examples, each with input vector x and output vector y
network, a multilayer network with L layers, weights wi,j , activation function g
local variables: ∆, a vector of errors, indexed by network node
foreach weight wi,j in network do
wi,j := a small random number
repeat
foreach example (x, y) in examples do
/∗ Propagate the inputs forward to compute the outputs ∗/
foreach node i in the input layer do ai := xi
for l = 2 to L do
foreach node
P j in layer l do
inj := i wi,j ai
aj := g(inj )
/∗ Propagate deltas backward from output layer to input layer ∗/
foreach node j in the output layer do ∆[j] := g ′ (inj ) · (yj − aj )
for l = L − 1 to 1 do
foreach node i in layer l do ∆[i] := g ′ (ini ) · ( j wi,j ∆[j])
P

/∗ Update every weight in network using deltas ∗/


foreach weight wi,j in network do wi,j := wi,j + α · ai · ∆[j]
until some stopping criterion is satisfied
return network
26.9. ARTIFICIAL NEURAL NETWORKS 621

Michael Kohlhase: Artificial Intelligence 2 967 2025-02-06

Back-Propagation – Properties
 Sum gradient updates for all examples in some “batch” and apply gradient descent.

 Learning curve for 100 restaurant examples: finds exact fit.


14
12
Total error on training set
10
8
6
4
2
0
0 50 100 150 200 250 300 350 400
Number of epochs

 Typical problems: slow convergence, local minima.

Michael Kohlhase: Artificial Intelligence 2 968 2025-02-06

Back-Propagation – Properties (contd.)


 Example 26.9.17. Learning curve for MLPs with 4 hidden units:
1
Proportion correct on test set

0.9

0.8

0.7

0.6 Decision tree


Multilayer network
0.5

0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size

 Experience shows: MLPs are quite good for complex pattern recognition tasks,
but resulting hypotheses cannot be understood easily.
 This makes MLPs ineligible for some tasks, such as credit card and loan approvals,
where law requires clear unbiased criteria.

Michael Kohlhase: Artificial Intelligence 2 969 2025-02-06


622 CHAPTER 26. LEARNING FROM OBSERVATIONS

Handwritten digit recognition

 400–300–10 unit MLP = 1.6% error

 LeNet: 768–192–30–10 unit MLP = 0.9% error

 Current best (kernel machines, vision algorithms) ≈ 0.6% error

Michael Kohlhase: Artificial Intelligence 2 970 2025-02-06

Summary
 neural networks can be extremely powerful (hypothesis space intractably complex)
 Perceptrons (one-layer networks) insufficiently expressive for most applications

 Multi-layer networks are sufficiently expressive; can be trained by gradient descent,


i.e., error back-propagation
 Many applications: speech, driving, handwriting, fraud detection, etc.

 Engineering, cognitive modelling, and neural system modelling subfields have largely
diverged
 Drawbacks: take long to converge, require large amounts of data, and are difficult
to interpret (Why is the output what it is?)

Michael Kohlhase: Artificial Intelligence 2 971 2025-02-06

XKCD on Machine Learning


 A Skepticists View: see https://xkcd.com/1838/
26.9. ARTIFICIAL NEURAL NETWORKS 623

Michael Kohlhase: Artificial Intelligence 2 972 2025-02-06

Summary of Inductive Learning


 Learning needed for unknown environments, lazy designers.
 Learning agent = performance element + learning element.
 Learning method depends on type of performance element, available feedback, type
of component to be improved, and its representation.

 For supervised learning, the aim is to find a simple hypothesis that is approximately
consistent with training examples
 Decision tree learning using information gain.
 Learning performance = prediction accuracy measured on test set

 PAC learning as a general theory of learning boundaries.


 Linear regression (hypothesis space of univariate linear functions).
 Linear classification by linear regression with hard and soft thresholds.

Michael Kohlhase: Artificial Intelligence 2 973 2025-02-06


624 CHAPTER 26. LEARNING FROM OBSERVATIONS
Chapter 27

Statistical Learning

?? we learned how to reason in non-deterministic, partially observable environments by quantify-


ing uncertainty and reasoning with it. The key resource there were probabilistic models and their
efficient representations: Bayesian networks.
?? we assumed that these models were given, perhaps designed by the agent developer. We
will now learn how these models can – at least partially – be learned from observing the environ-
ment.

Statistical Learning: Outline


 Definition 27.0.1. Statistical learning has the goal to learn the correct probability
distribution of a random variable.
 Example 27.0.2.
 Bayesian learning, i.e. learning probabilistic models (e.g. the CPTs in Bayesian
networks) from observations.
 Maximum a posteriori and maximum likelihood learning
 Bayesian network learning
 ML Parameter Learning with Complete Data
 Naive Bayes Models/Learning

Michael Kohlhase: Artificial Intelligence 2 974 2025-02-06

27.1 Full Bayesian Learning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30388.

The Candy Flavors Example


 Example 27.1.1. Suppose there are five kinds of bags of candies:
1. 10% are h1 : 100% cherry candies
2. 20% are h2 : 75% cherry candies + 25% lime candies
3. 40% are h3 : 50% cherry candies + 50% lime candies
4. 20% are h4 : 25% cherry candies + 75% lime candies

625
626 CHAPTER 27. STATISTICAL LEARNING

5. 10% are h5 : 100% lime candies


Then we observe candies drawn from some bag:

What kind of bag is it? What flavour will the next candy be?

Note: Every hypothesis is itself a probability distribution over the random variable
“flavour”.
Michael Kohlhase: Artificial Intelligence 2 975 2025-02-06

Candy Flavors: Posterior probability of hypotheses


 Example 27.1.2. Let di be the event that the ith drawn candy is green.
The probability of hypothesis hi after n limes are observed (=
b d1:n =: d) is
Posterior probability of hypothesis

1 P(h1 | d)
P(h2 | d)
0.8 P(h3 | d)
P(h4 | d)
P(h5 | d)
0.6

0.4

0.2

0
0 2 4 6 8 10
Number of observations in d

Q
if the observations are IID, i.e. P (d|hi ) = j P (dj |hi ) and the hypothesis prior is
as advertised. (e.g. P (d|h3 ) = 0.510 = 0.1%)
The posterior probabilities start with the hypothesis priors, change with data.

Michael Kohlhase: Artificial Intelligence 2 976 2025-02-06

Candy Flavors: Prediction Probability


 We calculate that the n + 1-th candy is lime:
X
P (dn+1 = lime|d) = P (dn+1 = lime|hi ) · P (hi |d)
i
27.1. FULL BAYESIAN LEARNING 627

Probability that next candy is lime


1

0.9

0.8

0.7

0.6

0.5

0.4
0 2 4 6 8 10
Number of observations in d

; we compute the expected value of the probability of the next candy being lime
over all hypotheses (i.e. distributions).
; “meta-distribution”
Michael Kohlhase: Artificial Intelligence 2 977 2025-02-06

Full Bayesian Learning


 Idea: View learning as Bayesian updating of a probability distribution over the
hypothesis space:
 H is the hypothesis variable with values h1 , h2 , . . . and prior P(H).
 jth observation dj gives the outcome of random variable Dj .
 d := d1 , . . . , dN constitutes the training set of a inductive learning problem.

 Definition 27.1.3. Bayesian learning calculates the probability of each hypothesis


and makes predictions based on this:
 Given the data so far, each hypothesis has a posterior probability:

P (hi |d) = α(P (d|hi ) · P (hi ))

where P (d|hi ) is called the likelihood (of the data under each hypothesis) and
P (hi ) the hypothesis prior.
 Bayesian predictions use a likelihood-weighted average over the hypotheses:
X X
P(X|d) = P(X|d, hi ) · P (hi |d) = P(X|hi ) · P (hi |d)
i i

 Observation: No need to pick one best-guess hypothesis for Bayesian predictions!


(and that is all an agent cares about)

Michael Kohlhase: Artificial Intelligence 2 978 2025-02-06

Full Bayesian Learning: Properties


 Observation: The Bayesian prediction eventually agrees with the true hypothesis.
 The probability of generating “uncharacteristic” data indefinitely is vanishingly
small.
628 CHAPTER 27. STATISTICAL LEARNING

 Proof sketch: Argument analogous to PAC learning.


 Problem: Summing over the hypothesis space is often intractable.
6
 Example 27.1.4. There are 22 = 18, 446, 744, 073, 709, 551, 616 Boolean func-
tions of 6 arguments.
 Solution: Approximate the learning methods to simplify.

Michael Kohlhase: Artificial Intelligence 2 979 2025-02-06

27.2 Approximations of Bayesian Learning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30389.

Maximum A Posteriori (MAP) Approximation


 Goal: Get rid of summation over the space of all hypotheses in predictions.
 Idea: Make predictions wrt. the “most probable hypothesis”!

 Definition 27.2.1. For maximum a posteriori learning (MAP learning) choose the
MAP hypothesis hMAP that maximizes P (hi |d).
I.e., maximize P (d|hi ) · P (hi ) or (even better) log2 (P (d|hi )) + log2 (P (hi )).
 Predictions made according to a MAP hypothesis hMAP are approximately Bayesian
to the extent that P(X|d) ≈ P(X|hMAP ).

 Example 27.2.2. In our candy example, hMAP = h5 after three limes in a row
 a MAP learner then predicts that candy 4 is lime with probability 1.
 compare with Bayesian prediction of 0.8. (see prediction curves above)
 As more data arrive, the MAP and Bayesian predictions become closer, because the
competitors to the MAP hypothesis become less and less probable.
 For deterministic hypotheses, P (d|hi ) is 1 if consistent, 0 otherwise
; MAP = simplest consistent hypothesis. (cf. science)
 Remark: Finding MAP hypotheses is often much easier than Bayesian learning,
because it requires solving an optimization problem instead of a large summation
(or integration) problem.

Michael Kohlhase: Artificial Intelligence 2 980 2025-02-06

Digression From MAP-learning to MDL-learning


 Idea: Reinterpret the log terms log2 (P (d|hi )) + log2 (P (hi )) in MAP learning:
 Maximizing P (d|hi ) · P (hi ) =
b minimizing −log2 (P (d|hi )) − log2 (P (hi )).
 b number of bits to encode data given hypothesis.
−log2 (P (d|hi )) =
 b additional bits to encode hypothesis.
−log2 (P (hi )) = (??)
27.3. PARAMETER LEARNING FOR BAYESIAN NETWORKS 629

 Indeed if hypothesis predicts the data exactly – e.g. h5 in candy example – then
log2 (1) = 0 ; preferred hypothesis.
 This is more directly modeled by the following approximation to Bayesian learning:

 Definition 27.2.3. In minimum description length learning (MDL learning) the


MDL hypothesis hMDL minimizes the information entropy of the hypothesis likeli-
hood.

Michael Kohlhase: Artificial Intelligence 2 981 2025-02-06

Maximum Likelihood (ML) approximation

 Observation: For large data sets, the prior becomes irrelevant. (we might not
trust it anyways)
 Idea: Use this to simplify learning.
 Definition 27.2.4. Maximum likelihood learning (ML learning): choose the ML
hypothesis hML maximizing P (d|hi ). (simply get the best fit to the data)
 Remark: ML learning = b MAP learning for a uniform prior. (reasonable if all
hypotheses are of the same complexity)
 ML learning is the “standard” (non Bayesian) statistical learning method.

Michael Kohlhase: Artificial Intelligence 2 982 2025-02-06

27.3 Parameter Learning for Bayesian Networks


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30390.

ML Parameter Learning in Bayesian Nets


Bayesian networks (with continuous random variables) often feature nodes with a
particular parametric distribution D(θ) (e.g. normal, binomial, Poisson, etc.).
How do we learn the parameters of these distributions from data?
Example 27.3.1. We get a candy bag from a new manufacturer; what is the fraction
θ of cherry candies? (Note: We use the probability itself as the parameter. This is
somewhat boring, but simple.)

P (F = cherry)
θ
Flavor

New Facet: Any θ is possible: continuum of hypotheses hθ


θ is a parameter for this simple (binomial) family of models; We call hθ a MLP hypothesis
and the process of learning θ MLP learning.
Example 27.3.2. Suppose we unwrap N candies, c cherries and ℓ = N − c limes.
630 CHAPTER 27. STATISTICAL LEARNING

N
Y ℓ
These are IID observations, so the likelihood is P (d|hθ ) = P (dj |hθ ) = θc · (1 − θ)
j=1

Michael Kohlhase: Artificial Intelligence 2 983 2025-02-06

ML Parameter Learning in Bayes Nets


Trick: When optimizing a product, optimize the logarithm instead! (log2 (!) is
monotone and turns products into sums)
Definition 27.3.3. The log likelihood is the binary logarithm of the likelihood. L(d|h):=log2 (P (d|h))
Example 27.3.4. Compute the log likelihood as (using ??)

N
X
L(d|hθ ) = log2 (P (d|hθ )) = log2 (P (dj |hθ )) = clog2 (θ) + ℓlog2 (1 − θ)
j=1

Maximize this w.r.t. θ


∂ c ℓ c c
(L(d|hθ )) = − =0;θ= =
∂θ θ 1−θ c+ℓ N
In English: hθ asserts that the actual proportion of cherries in the bag is equal to the
observed proportion in the candies unwrapped so far! (...exactly what we should
expect!) (⇒ Generalize to more interesting parametric models later)
Warning: This causes problems with 0 counts!

Michael Kohlhase: Artificial Intelligence 2 984 2025-02-06

ML Learning for Multiple Parameters in Bayesian Networks


 Cooking Recipe:

1. Write down an expression for the likelihood of the data as a function of the
parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero

Michael Kohlhase: Artificial Intelligence 2 985 2025-02-06

Multiple Parameters Example


 Example 27.3.5. Red/green wrapper depends probabilistically on flavour:

P (F = cherry)
θ
Flavor

F P (W = red|F )
cherry θ1
lime θ2

Wrapper
27.3. PARAMETER LEARNING FOR BAYESIAN NETWORKS 631

 Likelihood for, e.g., cherry candy in green wrapper:

P (F = cherry, W = green|hθ,θ1 ,θ2 )


= P (F = cherry|hθ,θ1 ,θ2 ) · P (W = green|F = cherry, hθ,θ1 ,θ2 )
= θ · (1 − θ1 )

 Ovservation: For N candies, rc red-wrapped cherry candies, etc. we have


ℓ gc gℓ
P (d|hθ,θ1 ,θ2 ) = θc · (1 − θ) · θ1 rc · (1 − θ1 ) · θ2 rℓ · (1 − θ2 )

Michael Kohlhase: Artificial Intelligence 2 986 2025-02-06

Multiple Parameters Example (contd.)


 Minimize the log likelihood:

L = clog2 (θ) + ℓlog2 (1 − θ)


+ rc log2 (θ1 ) + gc log2 (1 − θ1 )
+ rℓ log2 (θ2 ) + gℓ log2 (1 − θ2 )

 Derivatives of L contain only the relevant parameter:


∂L c ℓ c
∂θ = θ − 1−θ = 0 ; θ= c+ℓ
rc gc
∂L
∂θ1 = θ1 − 1−θ1 = 0 ; θ1 = rcr+g
c
c
rℓ gℓ
∂L
∂θ2 = θ2 − 1−θ2 = 0 ; θ2 = rℓr+g

 Upshot: With complete data, parameters can be learned separately in Bayesian


networks.
 Remaining Problem: Have to be careful with zero values! (division by zero)

Michael Kohlhase: Artificial Intelligence 2 987 2025-02-06

Example: Linear Gaussian Model


A continuous random variable Y has the linear-Gaussian distribution with respect to
a continuous random variable X, if the outcome of Y is determined by a linear function
of the outcome of X plus gaussian noise with a fixed variance σ, i.e.
Z y2 Z y2 
1 −1·
y−(θ1 x+θ2 ) 2
P (y1 ≤ Y ≤ y2 |X = x) = N (y; θ1 x+θ2 , σ 2 ) dy = √ ·e 2 σ
dy
y1 y1 σ 2π

; assuming σ given, we have two parameter θ1 and θ2 ; Hypothesis space is R × R


632 CHAPTER 27. STATISTICAL LEARNING
1

0.8
P(y |x)
4 0.6
3.5

y
3
2.5 0.4
2
1.5
1 1
0.5 0.8 0.2
0 0.6
0 0.2 0.4 y
0.4 0.6 0.2
0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Michael Kohlhase: Artificial Intelligence 2 988 2025-02-06

Example: Linear Gaussian Model


Z y2 2
1 −1·
y−(θ1 x+θ2 )
P (y1 ≤ Y ≤ y2 |X = x) = √ ·e 2 σ
dy
y1 σ 2π
QN (yi −(θ1 xi +θ2 ))2
; Given observations X = X, Y = y, maximize i=1 √2πσ 1
e− 2σ 2 w.r.t.
θ1 , θ2 . (we can ignore the integral for this)
PN
Using the log likelihood, this is equivalent to minimizing i=1 (yi − (θ1 xi + θ2 ))2
; minimizing the sum of squared errors gives the ML solution

Michael Kohlhase: Artificial Intelligence 2 989 2025-02-06

Statistical Learning: Summary


 Full Bayesian learning gives best possible predictions but is intractable.
 MAP learning balances complexity with accuracy on training data.

 Maximum likelihood learning assumes uniform prior, OK for large data sets:
1. Choose a parameterized family of models to describe the data.
; requires substantial insight and sometimes new models.
2. Write down the likelihood of the data as a function of the parameters.
; may require summing over hidden variables, i.e., inference.
3. Write down the derivative of the log likelihood w.r.t. each parameter.
4. Find the parameter values such that the derivatives are zero.
; may be hard/impossible; modern optimization techniques help.
 Naive Bayes models as a fall-back solution for machine learning:

 conditional independence of all attributes as simplifying assumption.

Michael Kohlhase: Artificial Intelligence 2 990 2025-02-06


Chapter 28

Reinforcement Learning

28.1 Reinforcement Learning: Introduction & Motivation


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30399.

Unsupervised Learning
 So far: We have studied “learning from examples”. (functions, logical theories,
probability models)

 Now: How can agents learn “what to do” in the absence of labeled examples of
“what to do”. We call this problem unsupervised learning.
 Example 28.1.1 (Playing Chess). Learn transition models for own moves and
maybe predict opponent’s moves.
 Problem: The agent needs to have some feedback about what is good/bad
; cannot decide “what to do” otherwise. (recall: external performance standard
for learning agents)
 Example 28.1.2. The ultimate feedback in chess is whether you win, lose, or draw.
 Definition 28.1.3. We call a learning situation where there are no labeled examples
unsupervised learning and the feedback involved a reward or reinforcement.
 Example 28.1.4. In soccer, there are intermediate reinforcements in the shape of
goals, penalties, . . .

Michael Kohlhase: Artificial Intelligence 2 991 2025-02-06

Reinforcement Learning as Policy Learning


 Definition 28.1.5. Reinforcement learning is a type of unsupervised learning where
an agent learns how to behave in an environment by performing actions and seeing
the results.
 Recap: In ?? we introduced rewards as parts of MDPs (Markov decision processes)
to define optimal policies.

633
634 CHAPTER 28. REINFORCEMENT LEARNING

 an optimal policy maximizes the expected total reward.


 Idea: The task of reinforcement learning is to use observed rewards to come up
with an optimal policy.

 In MDPs, the agent has total knowledge about the environment and the reward
function, in reinforcement learning we do not assume this. (;
POMDPs+reward-learning)
 Example 28.1.6. You play a game without knowing the rules, and at some time
the opponent shouts you lose!

Michael Kohlhase: Artificial Intelligence 2 992 2025-02-06

Scope and Forms of Reinforcement Learning


 Reinforcement Learning solves all of AI: An agent is placed in an environment
and must learn to behave successfully therein.
 KISS: We will only look at simple environments and simple agent designs:
 A utility-based agent learns a utility function on states and uses it to select
actions that maximize the expected outcome utility. (passive learning)
 A Q-learning agent learns an action-utility function, or Q-function, giving the
expected utility of taking a given action in a given state. (active learning)
 A reflex agent learns a policy that maps directly from states to actions.

Michael Kohlhase: Artificial Intelligence 2 993 2025-02-06

28.2 Passive Learning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/30400.

Passive Learning
 Definition 28.2.1 (To keep things simple). Agent uses a state-based represen-
tation in a fully observable environment:
 In passive learning, the agent’s policy π is fixed: in state s, it always executes
the action π(s).
 Its goal is simply to learn how good the policy is – that is, to learn the utility
function U π (s).
 The passive learning task is similar to the policy evaluation task (part of the policy
iteration algorithm) but the agent does not know

 the transition model P (s′ |s, a), which specifies the probability of reaching state
s′ from state s after doing action a,
 the reward function R(s), which specifies the reward for each state.
Section 17.1. Sequential Decision Problems 651

Remember that πs∗ is a policy, so it recommends an action for every state; its connection
with s in particular is that it’s an optimal policy when s is the starting state. A remarkable
consequence of using discounted utilities with infinite horizons is that the optimal policy is
independent of the starting state. (Of course, the action sequence won’t be independent;
remember that a policy is a function specifying an action for each state.) This fact seems
intuitively obvious: if policy πa∗ is optimal starting in a and policy πb∗ is optimal starting in b,
28.2. PASSIVE LEARNING then, when they reach a third state 635
c, there’s no good reason for them to disagree with each
other, or with πc∗ , about what to do next.2 So we can simply write π ∗ for an optimal policy.

Given this definition, the994true utility of a state
Michael Kohlhase: Artificial Intelligence 2
is just U π (s)—that is, the expected
2025-02-06
sum of discounted rewards if the agent executes an optimal policy. We write this as U (s),
matching the notation used in Chapter 16 for the utility of an outcome. Notice that U (s) and
R(s) are quite different quantities; R(s) is the “short term” reward for being in s, whereas
648 Passive Learning by Example U (s) is the “long term” total reward Chapterfrom s17.onward. Making Complex
Figure 17.3 Decisions
shows the utilities for the
4 × 3 world. Notice that the utilities are higher for states closer to the +1 exit, because fewer
 Example 28.2.2 (Passive steps are required to reach
Learning). We the
useexit.
the 4 × 3 world introduced above
+1 +1

–1 –1
3 +1 3 0.812 0.868 0.918 +1

R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850


2 –1 2 0.762 0.660 –1

+1 +1

1 1 0.705 0.655 0.611 0.388


–1 –1

1 2 3 4 1 2 3 4

Optimal Policy π
Figure 17.3 The utilities Utilities,
of the states given
in <
the π
– 0.0221 < R(s) 0 4 × 3 world, calculated
R(s) > 0 with γ = 1 and
R(s) = − 0.04 for nonterminal states.
 The agent executes a(a)set of trials in the environment using its policy
(b) π.

Figurethe
 In each trial, 17.2agent
(a)starts
AnTheoptimal policy
in state
utility forU the
(1,1)
function and
(s) stochastic
the environment
experiences
allows with
a select
agent to sequence R(s) − 0.04
=using
ofbystate
actions theinprinciple of
the nonterminal states.
maximum (b) Optimal
expected policies
utility for
from four different
Chapter 16—that
transitions until it reaches one of the terminal states, (4,2) or (4,3). ranges
is, of R(s).
choose the action that maximizes the
expected utility of the subsequent state:
 Its percepts supply both the current state! and the reward received in that state.
and (3,3) are as shown, every π ∗ (s) = policy
argmaxis optimal,
P (s# |and
s, a)Uthe(sagent
#
). obtains infinite total reward be- (17.4)
cause it never enters a terminal state. Surprisingly,
a∈A(s) s " it turns out that there are six other optimal
policiesMichael
forKohlhase:
various ranges
Artificial of 2R(s); Exercise 17.5
Intelligence 995 asks you to find them.
2025-02-06
The next two sections describe algorithms for finding optimal policies.
The careful balancing of risk and reward is a characteristic of MDPs that does not
arise in deterministic searchthisproblems;
2 Although moreover,
seems obvious, it does not it is fora finite-horizon
hold characteristic of ormany
policies real-world
for other ways of combining
Passivedecision
Learning problems. byrewardsExample
Forover thistime.
reason, MDPs
The proof follows have
directlybeen
from the studied
uniquenessinofseveral
the utility fields, including
function on states, as shown in
Section 17.2.
AI, operations research, economics, and control theory. Dozens of algorithms have been
 Example
proposed28.2.3. Typical trials
for calculating optimalmight look like
policies. this:
In sections 17.2 and 17.3 we describe two of the
most important algorithm families. First, however, we must complete our investigation of
1. (1,utilities
1)−0.4 and; (1, 2)−0.4for;sequential
policies ; (1, problems.
(1, 3)−0.4decision 2)−0.4 ; (1, 3)−0.4 ; (2, 3)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
2. (1,17.1.1
1)−0.4 ; Utilities
(1, 2)−0.4 over;time(1, 3)−0.4 ; (2, 3)−0.4 ; (3, 3)−0.4 ; (3, 2)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of
3. (1, 1)−0.4 ; (2, 1)−0.4 ; (3, 1)−0.4 ; (3, 2)−0.4 ; (4, 2)−1 .
rewards for the states visited. This choice of performance measure is not arbitrary, but it is
not the only
 Definition 28.2.4. possibility for the
The utility is utility
definedfunction on environment
to be the expected sum histories, which we write as
of (discounted)
rewards obtained
Uh ([s 0 , s1 , . . .if, s n ]). Our
policy π isanalysis
followed.draws on multiattribute utility theory (Section 16.4) and
is somewhat technical; the impatient "reader may wish # to skip to the next section.
X∞
FINITE HORIZON The first question to answer π is whether t there is a finite horizon or an infinite horizon
U (s):=E γ R(S )
INFINITE HORIZON for decision making. A finite horizon means that tthere is a fixed time N after which nothing
t=0
matters—the game is over, so to speak. Thus, Uh ([s0 , s1 , . . . , sN +k ]) = Uh ([s0 , s1 , . . . , sN ])
whereforR(s)
all kis>the 0. Forrewardexample,
for a suppose
state, Stan(aagent startsvariable)
random at (3,1) in × 3 world
is the 4state reachedof Figure
at 17.1,
time and suppose
t when that Npolicy
executing = 3. Then,
π, andtoShave0 = any
s. chance
(for 4 × of
3 wereaching
take the
the +1 state,
discount the
factoragent must
head directly for it, and the optimal action is to go Up. On the other hand, if N = 100,
γ = 1)
then there is plenty of time to take the safe route by going Left. So, with a finite horizon,
Michael Kohlhase: Artificial Intelligence 2 996 2025-02-06
636 CHAPTER 28. REINFORCEMENT LEARNING

Direct Utility Estimation


 A simple method for direct utility estimation was invented in the late 1950s in the
area of adaptive control theory.
 Definition 28.2.5. The utility of a state is the expected total reward from that
state onward (called the expected reward to go).

 Idea: Each trial provides a sample of the reward to go for each state visited.
 Example 28.2.6. The first trial in ?? provides a sample total reward of 0.72 for
state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80 and 0.88
for (1,3), . . .

 Definition 28.2.7. The direct utility estimation algorithm cycles over trials, cal-
culates the reward to go for each state, and updates the estimated utility for that
state by keeping the running average for that for each state in a table.
 Observation 28.2.8. In the limit, the sample average will converge to the true
expectation (utility) from ??.

 Remark 28.2.9. Direct utility estimation is just supervised learning, where each
example has the state as input and the observed reward to go as output.
 Upshot: We have reduced reinforcement learning to an inductive learning problem.

Michael Kohlhase: Artificial Intelligence 2 997 2025-02-06

Adaptive Dynamic Programming


 Problem: The utilities of states are not independent in direct utility estimation!

 The utility of each state equals its own reward plus the expected utility of its
successor states.
 So: The utility values obey a Bellman equation for a fixed policy π.
X
U π (s) = R(s) + γ · ( P (s′ |s, π(s)) · U π (s′ ))
s′

 Observation 28.2.10. By ignoring the connections between states, direct utility


estimation misses opportunities for learning.

 Example 28.2.11. Recall trial 2 in ??; state (3,3) is new.


2 (1, 1)−0.4 ; (1, 2)−0.4 ; (1, 3)−0.4 ; (2, 3)−0.4 ; (3, 3)−0.4 ; (3, 2)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
 The next transition reaches (3,3), (known high utility from trial 1)
 Bellman equation: ; high U (3, 2) because (3, 2)−0.4 ; (3, 3)
π

 But direct utility estimation learns nothing until the end of the trial.
 Intuition: Direct utility estimation searches for U in a hypothesis space that too
large ⇝ many functions that violate the Bellman equations.
28.2. PASSIVE LEARNING 637

 Thus the algorithm often converges very slowly.

Michael Kohlhase: Artificial Intelligence 2 998 2025-02-06

Adaptive Dynamic Programming


 Idea: Take advantage of the constraints among the utilities of states by
 learning the transition model that connects them,
 solving the corresponding Markov decision process using a dynamic programming
method.
This means plugging the learned transition model P(s′ |s, π(s)) and the observed
rewards R(s) into the Bellman equations to calculate the utilities of the states.

 As above: These equations are linear (no maximization involved) (solve with any
any linear algebra package).
 Observation 28.2.12. Learning the model itself is easy, because the environment
is fully observable.

 Corollary 28.2.13. We have a supervised learning task where the input is a


state–action pair and the output is the resulting state.
 In the simplest case, we can represent the transition model as a table of proba-
bilities.
 Count how often each action outcome occurs and estimate the transition prob-
ability P (s′ |s, a) from the frequency with which s′ is reached by action a in
s.
 Example 28.2.14. In the 3 trials from ??, Right is executed 3 times in (1, 3) and
2 times the result is (2, 3), so P ((2, 3)|(1, 3), Right) is estimated to be 2/3.

Michael Kohlhase: Artificial Intelligence 2 999 2025-02-06

Passive ADP Learning Algorithm


 Definition 28.2.15. The passive ADP algorithm is given by
function PASSIVE−ADP−AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s′ and reward signal r′
persistent: π a fixed policy
mdp, an MDP with model P , rewards R, discount γ
U , a table of utilities, initially empty
Nsa , a table of frequencies for state−action pairs, initially zero
Ns′ |sa , a table of outcome frequencies given state−action pairs, initially zero
s, a, the previous state and action, initially null
if s′ is new then U [s′ ] := r′ ; R[s′ ] := r′
if s is not null then
increment Nsa [s, a] and Ns′ |sa [s′ , s, a]
for each t such that Ns]|sa [t, s, a] is nonzero do
P (t|s, a) :=Ns′ |sa [t, s, a]/Nsa [s, a]
638 CHAPTER 28. REINFORCEMENT LEARNING

U := POLICY−EVALUATION(π,mdp)
if s′ .TERMINAL? then s, a := null else s, a := s′ , π[s′ ]
return a
P∞
POLICY−EVALUATION computes U π (s):=E [ t=0 γ t R(st )] in a MDP.

Michael Kohlhase: Artificial Intelligence 2 1000 2025-02-06

Passive ADP Convergence


 Example 28.2.16 (Passive ADP learning curves for the 4x3 world). Given the
optimal policy from ??

utility estimates/trials error for U (1, 1): 20 runs of 100 trials

Note the large changes occurring around the 78th trial – this is the first time that
the agent falls into the -1 terminal state at (4,2).

 Observation 28.2.17. The ADP agent is limited only by its ability to learn the
transition model. (intractable for large state spaces)

 Example 28.2.18. In backgammon, roughly 1050 equations in 1050 unknowns.

 Idea: Use this as a baseline to compare passive learning algorithms

Michael Kohlhase: Artificial Intelligence 2 1001 2025-02-06

28.3 Active Reinforcement Learning

Active Reinforcement Learning


 Recap: A passive learning agent has a fixed policy that determines its behavior.
 An active agent must also decide what actions to take.
 Idea: Adapt the passive ADP algorithm to handle this new freedom.
 learn a complete model with outcome probabilities for all actions, rather than
just the model for the fixed policy. (use PASSIVE-ADP-AGENT)
 choose actions; the utilities to learn are defined by the optimal policy, they obey
28.3. ACTIVE REINFORCEMENT LEARNING 639

the Bellman equation:


X
U (s) = R(s) + γ · max ( U (s′ ) · P (s′ |s, a))
a∈A(s)
s′

 solve with value/policy iteration techniques from ??.


 choose a good action, e.g.
 by one-step lookahead to maximize expected utility, or
 if agent uses policy iteration and has optimal policy, execute that.
This agent/algorithm is greedy, since it only optimizes the next step!

Michael Kohlhase: Artificial Intelligence 2 1002 2025-02-06

Greedy ADP Learning (Evaluation)


 Example 28.3.1 (Greedy ADP learning curves for the 4x3 world).

average error/loss suboptimal policy involved

The agent follows the optimal policy for the learned model at each step.
 It does not learn the true utilities or the true optimal policy!
 instead, in the 39th trial, it finds a policy that reaches the +1 reward along the
lower route via (2,1), (3,1), (3,2), and (3,3).
 After experimenting with minor variations, from the 276th trial onward it sticks
to that policy, never learning the utilities of the other states and never finding
the optimal route via (1,2), (1,3), and (2,3).

Michael Kohlhase: Artificial Intelligence 2 1003 2025-02-06

Exploration in Active Reinforcement Learning


 Observation 28.3.2. Greedy active ADP learning agents very seldom converge
against the optimal solution

 The learned model is not the same as the true environment,


 What is optimal in the learned model need not be in the true environment.
 What can be done? The agent does not know the true environment.
640 CHAPTER 28. REINFORCEMENT LEARNING

 Idea: Actions do more than provide rewards according to the learned model
 they also contribute to learning the true model by affecting the percepts received.
 By improving the model, the agent may reap greater rewards in the future.

 Observation 28.3.3. An agent must make a tradeoff between


 exploitation to maximize its reward as reflected in its current utility estimates
and
 exploration to maximize its long term well-being.

Pure exploitation risks getting stuck in a rut. Pure exploration to improve one’s
knowledge is of no use if one never puts that knowledge into practice.
 Compare with the information gathering agent from ??.

Michael Kohlhase: Artificial Intelligence 2 1004 2025-02-06


Chapter 29

Knowledge in Learning

29.1 Logical Formulations of Learning


Video Nuggets covering this section can be found at https://fau.tv/clip/id/30392 and
https://fau.tv/clip/id/30393.

Knowledge in Learning: Motivation


 Recap: Learning from examples. (last chapter)

 Idea: Construct a function with the input/output behavior observed in data.


 Method: Search for suitable functions in the hypothesis space. (e.g. decision
trees)
 Observation 29.1.1. Every learning task begins from zero. (except for the choice
of hypothesis space)

 Problem: We have to forget everything before we can learn something new.


 Idea: Utilize prior knowledge about the world! (represented e.g. in logic)

Michael Kohlhase: Artificial Intelligence 2 1005 2025-02-06

A logical Formulation of Learning


 Recall: Examples are composed of descriptions (of the input sample) and classi-
fications.
 Idea: Represent examples and hypotheses as logical formulae.
 Example 29.1.2. For attribute-based representations, we can use PL1 : we use
predicate constants for Boolean attributes and classification and function constants
for the other attributes.

 Definition 29.1.3. Logic based inductive learning tries to learn an hypothesis h


that explains the classifications of the examples given their description, i.e. h, D ⊨ C
(the explanation constraint), where

641
642 CHAPTER 29. KNOWLEDGE IN LEARNING

 D is the conjunction of the descriptions, and


 C the conjunction of their classifications.
 Idea: We solve the explanation constraint h, D ⊨ C for h where h ranges over
some hypothesis space.

 Refinement: Use Occam’s razor or additional constraints to avoid h = C. (too


easy otherwise/boring; see below)

Michael Kohlhase: Artificial Intelligence 2 1006 2025-02-06

A logical Formulation of Learning (Restaurant Examples)


 Example 29.1.4 (Restaurant Example again). Descriptions are conjunctions of
literals built up from
 predicates Alt, Bar, Fri/Sat, Hun, Rain, and res
 equations about the functions Pat, Price, Type, and Est.

For instance the first example X 1 from ??, can be described as

Alt(X 1 ) ∧ ¬Bar(X 1 ) ∧ Fri/Sat(X 1 ) ∧ Hun(X 1 ) ∧ . . .

The classification is given by the goal predicate WillWait, in this case WillWait(X 1 )
or ¬WillWait(X 1 ).

Michael Kohlhase: Artificial Intelligence 2 1007 2025-02-06

A logical Formulation of Learning (Restaurant Tree)


 Example 29.1.5 (Restaurant Example again; Tree). The induced decision tree
from ??
29.1. LOGICAL FORMULATIONS OF LEARNING 643

can be represented as

∀r.WillWait(r) ⇔ Pat(r, Some)


∨ Pat(r, Full) ∧ Hun(r) ∧ Type(r, French)
∨ Pat(r, Full) ∧ Hun(r) ∧ Type(r, Thai) ∧ Fri/Sat(r)
∨ Pat(r, Full) ∧ Hun(r) ∧ Type(r, Burger)

Method: Construct a disjunction of all the paths from the root to the positive
leaves interpreted as conjunctions of the attributes on the path.
Note: The equivalence takes care of positive and negative examples.

Michael Kohlhase: Artificial Intelligence 2 1008 2025-02-06

Cumulative Development
 Example 29.1.6. Learning from very few examples using background knowledge:
1. Caveman Zog and the fish on a stick:

2. Generalizing from one Brazilian:


Upon meeting her first Brazilian – Fernando – who speaks Portugese, Sarah
 learns/generalizes that all Brazilians speak Portugese,
 but not that all Brazilians are called Fernando.

3. General rules about effectiveness of antibiotics:


When Sarah – gifted in diagnostics, but clueless in pharmacology – observes a
doctor prescribing the antibiotic Proxadone for an inflamed foot, she learns/infers
that Proxadone is effective against this ailment.
 Observation: The methods/algorithms from ?? cannot replicate this. (why?)
 Missing Piece: The background knowledge!
 Problem: To use background knowledge, need a method to obtain it. (use
learning)
644 CHAPTER 29. KNOWLEDGE IN LEARNING

 Question: How to use knowledge to learn more efficiently?


 Answer: Cumulative development: collect knowledge and use it in learning!

Prior
Knowledge

Logic based
Observations Hypotheses Predictions
inductive learning

 Definition 29.1.7. We call the body of knowledge accumulated by (a group of)


agents their background knowledge. It acts as prior knowledge in logic based learning
processes.

Michael Kohlhase: Artificial Intelligence 2 1009 2025-02-06

Adding Background Knowledge to Learning: Overview


 Explanation based learning (EBL)
 Relevance based learning (RBL)
 Knowledge based inductive learning (KBIL)

Michael Kohlhase: Artificial Intelligence 2 1010 2025-02-06

Three Principal Modes of Inference


 Definition 29.1.8. Deduction =
b knowledge extension
rains ⇒ wet_street rains
 Example 29.1.9. D
wet_street
 Definition 29.1.10. Abduction =
b explanation
rains ⇒ wet_street wet_street
 Example 29.1.11. A
rains
 Definition 29.1.12. Induction =
b learning general rules from examples
wet_street rains
 Example 29.1.13. I
rains ⇒ wet_street

Michael Kohlhase: Artificial Intelligence 2 1011 2025-02-06

29.2 Inductive Logic Programming

Knowledge-based Inductive Learning


 Idea: Background knowledge and new hypothesis combine to explain the examples.
29.2. INDUCTIVE LOGIC PROGRAMMING 645

 Example 29.2.1. Inferring disease D from the symptoms is not enough to explain
the prescription of medicine M .
Need a new general rule: M is effective against D (induction from example)
 Definition 29.2.2. Knowledge based inductive learning (KBIL) replaces the expla-
nation constraint by the KBIL constraint:

Background ∧ Hypothesis ∧ Descriptions ⊨ Classif ications

Michael Kohlhase: Artificial Intelligence 2 1012 2025-02-06

Inductive Logic Programming


 Definition 29.2.3. Inductive logic programming (ILP) is logic based inductive
learning method that uses logic programming as a uniform representation for exam-
ples, background knowledge and hypotheses.
Given an encoding of the known background knowledge and a set of examples repre-
sented as a logical knowledge base of facts, an ILP system will derive a hypothesised
logic program which entails all the positive and none of the negative examples.

 Main field of study for KBIL algorithms.


 Prior knowledge plays two key roles:
1. The effective hypothesis space is reduced to include only those theories that are
consistent with what is already known.
2. Prior knowledge can be used to reduce the size of the hypothesis explaining the
observations.
 Smaller hypotheses are easier to find.
 Observation: ILP systems can formulate hypotheses in first-order logic.

; Can learn in environments not understood by simpler systems.

Michael Kohlhase: Artificial Intelligence 2 1013 2025-02-06

Inductive Logic Programming


 Combines inductive methods with the power of first-order representations.
 Offers a rigorous approach to the general KBIL problem.

 Offers complete algorithms for inducing general, first-order theories from examples.

Michael Kohlhase: Artificial Intelligence 2 1014 2025-02-06

29.2.1 An Example
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/30396.
646 CHAPTER 29. KNOWLEDGE IN LEARNING

ILP: An example
 General knowledge-based induction problem

Background ∧ Hypothesis ∧ Descriptions ⊨ Classif ications

 Example 29.2.4 (Learning family relations from examples).


 Observations are an extended family tree
 mother, father and married relations
 male and female properties

 Target predicates: grandparent, BrotherInLaw, Ancestor


; The goal is to find a logical formula that serves as a definition of the target
predicates
 equivalently: A Prolog program that computes the value of the target predicate
; We obtain a perfectly comprehensible hypothesis

Michael Kohlhase: Artificial Intelligence 2 1015 2025-02-06

British Royalty Family Tree (not quite not up to date)


 The facts about kinship and relations can be visualized as a family tree:

George Mum

Spencer Kydd Elisabeth Philipp Margaret

Diana Charles Anne Mark Andrew Sarah Edward

William Harry Peter Zara Beatrice Eugenie

Michael Kohlhase: Artificial Intelligence 2 1016 2025-02-06

Example
 Descriptions include facts like

 father(P hilip, Charles)


 mother(M um, M argaret)
 married(Diana, Charles)
 male(P hilip)
 female(Beatrice)
 Sentences in classifications depend on the target concept being learned (in the
example: 12 positive, 388 negative)
29.2. INDUCTIVE LOGIC PROGRAMMING 647

 grandparent(M um, Charles)


 ¬grandparent(M um, Harry)
 Goal: Find a set of sentences for hypothesis such that the entailment constraint
is satisfied.
 Example 29.2.5. Without background knowledge, define grandparent in terms of
mother and father.
grandparent(x, y)⇔(∃z.mother(x, z)∧mother(z, y))∨(∃z.mother(x, z)∧father(z, y))∨. . .∨(∃z.father(x, z)∧father(z, y))

Michael Kohlhase: Artificial Intelligence 2 1017 2025-02-06

Why Attribute-based Learning Fails


 Observation: Decision tree learning will get nowhere!

 To express Grandparent as a (Boolean) attribute, pairs of people need to be


objects Grandparent(⟨M um, Charles⟩).
 But then the example descriptions can not be represented

F irstElementIsM otherOf Elizabeth(⟨M um, Charles⟩)

 A large disjunction of specific cases without any hope of generalization to new


examples.

 Generally: Attribute-based learning algorithms are incapable of learning relational


predicates.

Michael Kohlhase: Artificial Intelligence 2 1018 2025-02-06

Background knowledge
 Observation: A little bit of background knowledge helps a lot.
 Example 29.2.6. If the background knowledge contains

parent(x, y)⇔mother(x, y) ∨ father(x, y)

then Grandparent can be reduced to

grandparent(x, y)⇔(∃z.parent(x, z) ∧ parent(z, y))

 Definition 29.2.7. A constructive induction algorithm creates new predicates to


facilitate the expression of explanatory hypotheses.

 Example 29.2.8. Use constructive induction to introduce a predicate parent to


simplify the definitions of the target predicates.

Michael Kohlhase: Artificial Intelligence 2 1019 2025-02-06


648 CHAPTER 29. KNOWLEDGE IN LEARNING

29.2.2 Top-Down Inductive Learning: FOIL


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/30397.

Top-Down Inductive Learning


 Bottom-up learning; e.g. Decision-tree learning: start from the observations and
work backwards.
 Decision tree is gradually grown until it is consistent with the observations.

 Top-down learning method


 start from a general rule and specialize it on every example.

Michael Kohlhase: Artificial Intelligence 2 1020 2025-02-06

Top-Down Inductive Learning: FOIL


 Split positive and negative examples
 Positive: ⟨George, Anne⟩, ⟨P hilip, P eter⟩, ⟨Spencer, Harry⟩
 Negative: ⟨George, Elizabeth⟩, ⟨Harry, Zara⟩, ⟨Charles, P hilip⟩
 Construct a set of Horn clauses with head grandfather(x, y) such that the positive
examples are instances of the grandfather relationship.
 Start with a clause with an empty body ⇒grandfather(x, y).
 All examples are now classified as positive, so specialize to rule out the negative
examples: Here are 3 potential additions:
1. father(x, y) ⇒ grandfather(x, y)
2. parent(x, z) ⇒ grandfather(x, y)
3. father(x, z) ⇒ grandfather(x, y)
 The first one incorrectly classifies the 12 positive examples.
 The second one is incorrect on a larger part of the negative examples.
 Prefer the third clause and specialize to father(x, z)∧parent(z, y)⇒grandfather(x, y).

Michael Kohlhase: Artificial Intelligence 2 1021 2025-02-06

FOIL
function Foil(examples,target) returns a set of Horn clauses
inputs: examples, set of examples
target, a literal for the goal predicate
local variables: clauses, set of clauses, initially empty
while examples contains positive examples do
clause := New−Clause(examples,target)
remove examples covered by clause from examples
add clause to clauses
return clauses
29.2. INDUCTIVE LOGIC PROGRAMMING 649

Michael Kohlhase: Artificial Intelligence 2 1022 2025-02-06

FOIL
function New−Clause(examples,target) returns a Horn clause
local variables: clause, a clause with target as head and an empty body
l, a literal to be added to the clause
extendedExamples, a set of examples with values for new variables
extendedExamples := examples
while extendedExamples contains negative examples do
l := Choose−Literal(New−Literals(clause),extendedExamples)
append l to the body of clause
extendedExamples := map Extend−Example over extendedExamples
return clause
function Extend−Example(example,literal) returns a new example
if example satisfies literal
then return the set of examples created by extending example with each
possible constant value for each new variable in literal
else return the empty set
function New−Literals(clause) returns a set of possibly ‘‘useful’’ literals
function Choose−Literal(literals) returns the ‘‘best’’ literal from literals

Michael Kohlhase: Artificial Intelligence 2 1023 2025-02-06

FOIL: Choosing Literals


 New-Literals: Takes a clause and constructs all possibly “useful” literals

 father(x, z) ⇒ grandfather(x, y)
 Add literals using predicates
 Negated or unnegated
 Use any existing predicate (including the goal)
 Arguments must be variables
 Each literal must include at least one variable from an earlier literal or from the
head of the clause
 Valid: M other(z, u), M arried(z, z), grandfather(v, x)
 Invalid: M arried(u, v)

 Equality and inequality literals

 E.g. z ̸= x, empty list


 Arithmetic comparisons
 E.g. x > y, threshold values

Michael Kohlhase: Artificial Intelligence 2 1024 2025-02-06


650 CHAPTER 29. KNOWLEDGE IN LEARNING

FOIL: Choosing Literals


 The way New-Literal changes the clauses leads to a very large branching factor.
 Improve performance by using type information:
 E.g., parent(x, n) where x is a person and n is a number

 Choose-Literal uses a heuristic similar to information gain.


 Ockham’s razor to eliminate hypotheses.
 If the clause becomes longer than the total length of the positive examples that
the clause explains, this clause is not a valid hypothesis.

 Most impressive demonstration


 Learn the correct definition of list-processing functions in Prolog from a small
set of examples, using previously learned functions as background knowledge.

Michael Kohlhase: Artificial Intelligence 2 1025 2025-02-06

29.2.3 Inverse Resolution


A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/30398.

Inverse Resolution
 Definition 29.2.9. Inverse resolution in a nutshell
 Classifications follows from Background ∧ Hypothesis ∧ Descriptions.
 This can be proven by resolution.
 Run the proof backwards to find hypothesis.
 Problem: How to run the resolution proof backwards?
 Recap: In ordinary resolution we take two clauses C1 = L ∨ R1 and C2 = ¬L ∨ R2
and resolve them to produce the resolvent C = R1 ∨ R2 .

 Idea: Two possible variants of inverse resolution:


 Take resolvent C and produce two clauses C1 and C2 .
 Take C and C1 and produce C2 .

Michael Kohlhase: Artificial Intelligence 2 1026 2025-02-06

Generating Inverse Proofs (Example)

1. Start with an example classified as both positive and negative (Need a


contradiction)
2. Invent clauses that resolve with a fact in our knowledge base
29.2. INDUCTIVE LOGIC PROGRAMMING 651

¬parent(x, z) ∨ ¬parent(z, y) ∨ grandparent(x, y) parent(George, Elizabeth)

[George/x],[Elisabeth/z]

¬parent(Elizabeth, y) ∨ grandparent(George, y) parent(Elizabeth, Anne)

[Anne/y]

grandparent(George, Anne) ¬grandparent(George, Anne)

{}

¬parent(x, z) ∨ ¬parent(z, y) ∨ grandparent(x, y) is equivalent to parent(x, z) ∧


parent(z, y) ⇒ grandparent(x, y)

Michael Kohlhase: Artificial Intelligence 2 1027 2025-02-06

Generating Inverse Proofs


 Inverse resolution is a search algorithm: For any C and C1 there can be several or
even an infinite number of clauses C2 .
 Example 29.2.10. Instead of parent(George, Elizabeth) there were numerous
alternatives we could have picked!
 The clauses C1 that participate in each step can be chosen from Background,
Descriptions, Classifications or from hypothesized clauses already generated.
 ILP needs restrictions to make the search manageable

 Eliminate function symbols


 Generate only the most specific hypotheses
 Use Horn clauses
 All hypothesized clauses must be consistent with each other
 Each hypothesized clause must agree with the observations

Michael Kohlhase: Artificial Intelligence 2 1028 2025-02-06

New Predicates and New Knowledge


 An inverse resolution procedure is a complete algorithm for learning first-order
theories:
 If some unknown hypothesis generates a set of examples, then an inverse reso-
lution procedure can generate hypothesis from the examples.

 Can inverse resolution infer the law of gravity from examples of falling bodies?
 Yes, given suitable background mathematics!
 Monkey and typewriter problem: How to overcome the large branching factor and
the lack of structure in the search space?
652 CHAPTER 29. KNOWLEDGE IN LEARNING

Michael Kohlhase: Artificial Intelligence 2 1029 2025-02-06

New Predicates and New Knowledge


 Inverse resolution is capable of generating new predicates:

 Resolution of C1 and C2 into C eliminates a literal that C1 and C2 share.


 This literal might contain a predicate that does not appear in C.
 When working backwards, one possibility is to generate a new predicate from
which to construct the missing literal.

Michael Kohlhase: Artificial Intelligence 2 1030 2025-02-06

New Predicates and New Knowledge


 Example 29.2.11.

F ather(George; y) ⇒ P (x, y) P (George; y) ⇒ Ancestor(George, y)


[George/x]
F ather(George; y) ⇒ Ancestor(George, y)

P can be used in later inverse resolution steps.


 Example 29.2.12. mother(x, y) ⇒ P (x, y) or father(x, y) ⇒ P (x, y) leading to
the “Parent” relationship.
 Inventing new predicates is important to reduce the size of the definition of the
goal predicate.
 Some of the deepest revolutions in science come from the invention of new predi-
cates. (e.g. Galileo’s invention of
acceleration)

Michael Kohlhase: Artificial Intelligence 2 1031 2025-02-06

Applications of ILP
 ILP systems have outperformed knowledge free methods in a number of domains.
 Molecular biology: the GOLEM system has been able to generate high-quality
predictions of protein structures and the therapeutic efficacy of various drugs.

 GOLEM is a completely general-purpose program that is able to make use of back-


ground knowledge about any domain.

Michael Kohlhase: Artificial Intelligence 2 1032 2025-02-06


Part VII

Natural Language

653
655

A Video Nugget covering this part can be found at https://fau.tv/clip/id/35294.


This part introduces the basics of natural language processing and the use of natural language
for communication with humans.

Fascination of (Natural) Language


 Definition 29.2.13. A natural language is any form of spoken or signed means
of communication that has evolved naturally in humans through use and repetition
without conscious planning or premeditation.

 In other words: the language you use all day long, e.g. English, German, . . .
 Why Should we care about natural language?:
 Even more so than thinking, language is a skill that only humans have.
 It is a miracle that we can express complex thoughts in a sentence in a matter
of seconds.
 It is no less miraculous that a child can learn tens of thousands of words and
complex syntax in a matter of a few years.

Michael Kohlhase: Artificial Intelligence 2 1033 2025-02-06

Natural Language and AI


 Without natural language capabilities (understanding and generation) no AI!
 Ca. 100.000 years ago, humans learned to speak, ca. 7.000 years ago, to write.
 Alan Turing based his test on natural language: (for good reason)
 We want AI agents to be able to communicate with humans.
 We want AI agents to be able to acquire knowledge from written documents.
 In this part, we analyze the problem with specific information-seeking tasks:
 Language models (Which strings are English/Spanish/etc.)
 Text classification (E.g. spam detection)
 Information retrieval (aka. Search Engines)
 Information extraction (finding objects and their relations in texts)

Michael Kohlhase: Artificial Intelligence 2 1034 2025-02-06


656
Chapter 30

Natural Language Processing

30.1 Introduction to NLP


A Video Nugget covering this section can be found at https://fau.tv/clip/id/35295.
The general context of AI-2 is natural language processing (NLP), and in particular natural
language understanding (NLU). The dual side of NLU: natural language generation (NLG) requires
similar foundations, but different techniques is less relevant for the purposes of this course.

What is Natural Language Processing?


 Generally: Studying of natural languages and development of systems that can
use/generate these.
 Definition 30.1.1. Natural language processing (NLP) is an engineering field at
the intersection of computer science, artificial intelligence, and linguistics which is
concerned with the interactions between computers and human (natural) languages.
Most challenges in NLP involve:
 Natural language understanding (NLU) that is, enabling computers to derive
meaning (representations) from human or natural language input.
 Natural language generation (NLG) which aims at generating natural language
or speech from meaning representation.
 For communication with/among humans we need both NLU and NLG.

Michael Kohlhase: Artificial Intelligence 2 1035 2025-02-06

Language Technology
 Language Assistance:
 written language: Spell/grammar/style-checking,
 spoken language: dictation systems and screen readers,
 multilingual text: machine-supported text and dialog translation, eLearning.
 Information management:

657
658 CHAPTER 30. NATURAL LANGUAGE PROCESSING

 search and classification of documents, (e.g. Google/Bing)


 information extraction, question answering. (e.g. http://ask.com)
 Dialog Systems/Interfaces:

 information systems: at airport, tele-banking, e-commerce, call centers,


 dialog interfaces for computers, robots, cars. (e.g. Siri/Alexa)
 Observation: The earlier technologies largely rely on pattern matching, the latter
ones need to compute the meaning of the input utterances, e.g. for database lookups
in information systems.

Michael Kohlhase: Artificial Intelligence 2 1036 2025-02-06

30.2 Natural Language and its Meaning


A Video Nugget covering this section can be found at https://fau.tv/clip/id/35295.
Before we embark on the journey into understanding the meaning of natural language, let us get
an overview over what the concept of “semantics” or “meaning” means in various disciplines.

What is (NL) Semantics? Answers from various Disciplines!


 Observation: Different (academic) disciplines specialize the notion of semantics
(of natural language) in different ways.
 Philosophy: has a long history of trying to answer it, e.g.

 Platon ; cave allegory, Aristotle ; Syllogisms.


 Frege/Russell ; sense vs. referent. (Michael Kohlhase vs. Odysseus)
 Linguistics/Language Philosophy: We need semantics e.g. in translation
Der Geist ist willig aber das Fleisch ist schwach! vs.
Der Schnaps ist gut, aber der Braten ist verkocht! (meaning counts)

 Psychology/Cognition: Semantics =
b “what is in our brains” (; mental models)
 Mathematics has driven much of modern logic in the quest for foundations.
 Logic as “foundation of mathematics” solved as far as possible
 In daily practice syntax and semantics are not differentiated (much).

 Logic@AI/CS tries to define meaning and compute with them. (applied


semantics)
 makes syntax explicit in a formal language (formulae, sentences)
 defines truth/validity by mapping sentences into “world” (interpretation)
 gives rules of truth-preserving reasoning (inference)

Michael Kohlhase: Artificial Intelligence 2 1037 2025-02-06

A good probe into the issues involved in natural language understanding is to look at translations
between natural language utterances – a task that arguably involves understanding the utterances
first.
30.2. NATURAL LANGUAGE AND ITS MEANING 659

Meaning of Natural Language; e.g. Machine Translation

 Idea: Machine translation is very simple! (we have good lexica)


 Example 30.2.1. Peter liebt Maria. ; Peter loves Mary.
 this only works for simple examples!

 Example 30.2.2. Wirf der Kuh das Heu über den Zaun. ̸;Throw the cow the
hay over the fence. (differing grammar; Google Translate)
 Example 30.2.3. Grammar is not the only problem
 Der Geist ist willig, aber das Fleisch ist schwach!
 Der Schnaps ist gut, aber der Braten ist verkocht!
 Observation 30.2.4. We have to understand the meaning for high-quality trans-
lation!

Michael Kohlhase: Artificial Intelligence 2 1038 2025-02-06

If it is indeed the meaning of natural language, we should look further into how the form of the
utterances and their meaning interact.

Language and Information


 Observation: Humans use words (sentences, texts) in natural languages to rep-
resent and communicate information.
 But: What really counts is not the words themselves, but the meaning information
they carry.

 Example 30.2.5 (Word Meaning).

Newspaper ;

 For questions/answers, it would be very useful to find out what words (sentences/texts)
mean.
 Definition 30.2.6. Interpretation of natural language utterances: three problems

schema abstraction ambiguity composition


semantic
intepretation

language
utterance

Michael Kohlhase: Artificial Intelligence 2 1039 2025-02-06

Let us support the last claim a couple of initial examples. We will come back to these phenomena
again and again over the course of the course and study them in detail.
660 CHAPTER 30. NATURAL LANGUAGE PROCESSING

Language and Information (Examples)

 Example 30.2.7 (Abstraction).

Car and automobile have the same meaning.

 Example 30.2.8 (Ambiguity).

A bank can be a financial institution or a geographical feature.

 Example 30.2.9 (Composition).

Every student sleeps ; ∀x.student(x) ⇒ sleep(x)

Michael Kohlhase: Artificial Intelligence 2 1040 2025-02-06

But there are other phenomena that we need to take into account when compute the meaning of
NL utterances.

Context Contributes to the Meaning of NL Utterances


 Observation: Not all information conveyed is linguistically realized in an utterance.
 Example 30.2.10. The lecture begins at 11:00 am. What lecture? Today?

 Definition 30.2.11. We call a piece i of information linguistically realized in an


utterance U , iff, we can trace i to a fragment of U .
 Definition 30.2.12 (Possible Mechanism). Inferring the missing pieces from the
context and world knowledge:

Grammar Inference
relevant
Utterance Meaning information
of utterance
Lexicon World knowledge

We call this process semantic/pragmatic analysis.

Michael Kohlhase: Artificial Intelligence 2 1041 2025-02-06

We will look at another example, that shows that the situation with semantic/pragmatic analysis
is even more complex than we thought. Understanding this is one of the prime objectives of the
AI-2 lecture.
30.3. LOOKING AT NATURAL LANGUAGE 661

Context Contributes to the Meaning of NL Utterances


 Example 30.2.13. It starts at eleven. What starts?
 Before we can resolve the time, we need to resolve the anaphor it.

 Possible Mechanism: More Inference!

Grammar Inference

utterance- relevant
semantic
Utterance specific information
potential
meaning of utterance

Lexicon World/Context Knowledge

; Semantic/pragmatic analysis is quite complex! (prime topic of AI-2)

Michael Kohlhase: Artificial Intelligence 2 1042 2025-02-06

?? is also a very good example for the claim ?? that even for high-quality (machine) translation
we need semantics.

30.3 Looking at Natural Language


A Video Nugget covering this section can be found at https://fau.tv/clip/id/35296.
The next step will be to make some observations about natural language and its meaning, so that
we get an intuition of what problems we will have to overcome on the way to modeling natural
language.

Fun with Diamonds (are they real?) [Dav67]


 Example 30.3.1. We study the truth conditions of adjectival complexes:
 This is a diamond. (|= diamond)
 This is a blue diamond. (|= diamond, |= blue)
 This is a big diamond. (|= diamond, ̸|= big)
 This is a fake diamond. (|= ¬diamond)
 This is a fake blue diamond. (|= blue?, |= diamond?)
 Mary knows that this is a diamond. (|= diamond)
 Mary believes that this is a diamond. (̸|= diamond)

Michael Kohlhase: Artificial Intelligence 2 1043 2025-02-06

Logical analysis vs. conceptual analysis: These examples — mostly borrowed from David-
son:tam67 — help us to see the difference between “logical-analysis” and “conceptual-analysis”.
We observed that from This is a big diamond. we cannot conclude This is big. Now consider the
sentence Jane is a beautiful dancer. Similarly, it does not follow from this that Jane is beautiful,
but only that she dances beautifully. Now, what it is to be beautiful or to be a beautiful dancer
662 CHAPTER 30. NATURAL LANGUAGE PROCESSING

is a complicated matter. To say what these things are is a problem of conceptual analysis. The
job of semantics is to uncover the logical form of these sentences. Semantics should tell us that
the two sentences have the same logical forms; and ensure that these logical forms make the right
predictions about the entailments and truth conditions of the sentences, specifically, that they
don’t entail that the object is big or that Jane is beautiful. But our semantics should provide a
distinct logical form for sentences of the type: This is a fake diamond. From which it follows that
the thing is fake, but not that it is a diamond.

Ambiguity: The dark side of Meaning


 Definition 30.3.2. We call an utterance ambiguous, iff it has multiple meanings,
which we call readings.
 Example 30.3.3. All of the following sentences are ambiguous:
 John went to the bank. (river or financial?)
 You should have seen the bull we got from the pope. (three readings!)
 I saw her duck. (animal or action?)
 John chased the gangster in the red sports car. (three-way too!)

Michael Kohlhase: Artificial Intelligence 2 1044 2025-02-06

One way to think about the examples of ambiguity on the previous slide is that they illustrate a
certain kind of indeterminacy in sentence meaning. But really what is indeterminate here is what
sentence is represented by the physical realization (the written sentence or the phonetic string).
The symbol duck just happens to be associated with two different things, the noun and the verb.
Figuring out how to interpret the sentence is a matter of deciding which item to select. Similarly
for the syntactic ambiguity represented by PP attachment. Once you, as interpreter, have selected
one of the options, the interpretation is actually fixed. (This doesn’t mean, by the way, that as
an interpreter you necessarily do select a particular one of the options, just that you can.) A
brief digression: Notice that this discussion is in part a discussion about compositionality, and
gives us an idea of what a non-compositional account of meaning could look like. The Radical
Pragmatic View is a non-compositional view: it allows the information content of a sentence to
be fixed by something that has no linguistic reflex.
To help clarify what is meant by compositionality, let me just mention a couple of other ways
in which a semantic account could fail to be compositional.
• Suppose your syntactic theory tells you that S has the structure [a[bc]] but your semantics
computes the meaning of S by first combining the meanings of a and b and then combining the
result with the meaning of c. This is non-compositional.
• Recall the difference between:
1. Jane knows that George was late.
2. Jane believes that George was late.
Sentence 1. entails that George was late; sentence 2. doesn’t. We might try to account for
this by saying that in the environment of the verb believe, a clause doesn’t mean what it
usually means, but something else instead. Then the clause that George was late is assumed
to contribute different things to the informational content of different sentences. This is a
non-compositional account.
30.3. LOOKING AT NATURAL LANGUAGE 663

Quantifiers, Scope and Context

 Example 30.3.4. Every man loves a woman. (Keira Knightley or his mother!)
 Example 30.3.5. Every car has a radio. (only one reading!)
 Example 30.3.6. Some student in every course sleeps in every class at least
some of the time. (how many readings?)
 Example 30.3.7. The president of the US is having an affair with an intern.
(2002 or 2000?)
 Example 30.3.8. Everyone is here. (who is everyone?)

Michael Kohlhase: Artificial Intelligence 2 1045 2025-02-06

Observation: If we look at the first sentence, then we see that it has two readings:
1. there is one woman who is loved by every man.
2. for each man there is one woman whom that man loves.
These correspond to distinct situations (or possible worlds) that make the sentence true.
Observation: For the second example we only get one reading: the analogue of 2. The reason
for this lies not in the logical structure of the sentence, but in concepts involved. We interpret
the meaning of the word has as the relation “has as physical part”, which in our world carries a
certain uniqueness condition: If a is a physical part of b, then it cannot be a physical part of c,
unless b is a physical part of c or vice versa. This makes the structurally possible analogue to 1.
impossible in our world and we discard it.
Observation: In the examples above, we have seen that (in the worst case), we can have one
reading for every ordering of the quantificational phrases in the sentence. So, in the third example,
we have four of them, we would get 4! = 24 readings. It should be clear from introspection that
we (humans) do not entertain 12 readings when we understand and process this sentence. Our
models should account for such effects as well.
Context and Interpretation: It appears that the last two sentences have different informational
content on different occasions of use. Suppose I say Everyone is here. at the beginning of class.
Then I mean that everyone who is meant to be in the class is here. Suppose I say it later in the
day at a meeting; then I mean that everyone who is meant to be at the meeting is here. What
shall we say about this? Here are three different kinds of solution:
Radical Semantic View On every occasion of use, the sentence literally means that everyone
in the world is here, and so is strictly speaking false. An interpreter recognizes that the speaker
has said something false, and uses general principles to figure out what the speaker actually
meant.
Radical Pragmatic View What the semantics provides is in some sense incomplete. What the
sentence means is determined in part by the context of utterance and the speaker’s intentions.
The differences in meaning are entirely due to extra-linguistic facts which have no linguistic
reflex.
The Intermediate View The logical form of sentences with the quantifier every contains a slot
for information which is contributed by the context. So extra-linguistic information is required
to fix the meaning; but the contribution of this information is mediated by linguistic form.
We now come to a phenomenon of natural language, that is a paradigmatic challenge for pragmatic
analysis: anaphora – the practice of replacing a (complex) reference with a mere pronoun.
664 CHAPTER 30. NATURAL LANGUAGE PROCESSING

More Context: Anaphora – Challenge for Pragmatic Analysis


 Example 30.3.9 (Anaphoric References).
 John is a bachelor. His wife is very nice. (Uh, what?, who?)
 John likes his dog Spiff even though he bites him sometimes. (who bites?)
 John likes Spiff. Peter does too. (what to does Peter do?)
 John loves his wife. Peter does too. (whom does Peter love?)
 John loves golf, and Mary too. (who does what?)
 Definition 30.3.10. A word or phrase is called anaphoric (or an anaphor), if its
interpretation depends upon another phrase in context. In a narrower sense, an
anaphor refers to an earlier phrase (its antecedent), while a cataphor to a later one
(its postcedent).
Definition 30.3.11. The process of determining the antecedent or postcedent of
an anaphoric phrase is called anaphor resolution.
Definition 30.3.12. An anaphoric connection between anaphor and its antecedent
or postcedent is called direct, iff it can be understood purely syntactically. An
anaphoric connection is called indirect or a bridging reference if additional knowledge
is needed.
 Anaphora are another example, where natural languages use the inferential capa-
bilities of the hearer/reader to “shorten” utterances.

 Anaphora challenge pragmatic analysis, since they can only be resolved from the
context using world knowledge.

Michael Kohlhase: Artificial Intelligence 2 1046 2025-02-06

Anaphora are also interesting for pragmatic analysis, since they introduce (often initially massive
amoungs of) ambiguity that needs to be taken care of in the language understanding process.
We now come to another challenge to pragmatic analysis: presuppositions. Instead of just being
subject to the context of the readers/hearers like anaphora, they even have the potential to change
the context itself or even affect their world knowledge.

Context is Personal and Keeps Changing


 Example 30.3.13. Consider the following sentences involving definite description:
1. The king of America is rich. (true or false?)
2. The king of America isn’t rich. (false or true?)
3. If America had a king, the king of America would be rich. (true or false!)
4. The king of Buganda is rich. (Where is Buganda?)
5. . . . Joe Smith. . . The CEO of Westinghouse announced budget cuts.
(CEO=J.S.!)

How do the interact with your context and world knowledge?


 The interpretation or whether they make sense at all dep
 Note: Last two examples feed back into the context or even world knowledge:
30.4. LANGUAGE MODELS 665

 If 4. is uttered by an Africa expert, we add “Buganda exists and is a monarchy


to our world knowledge
 We add Joe Smith is the CEO of Westinghouse to the context/world knowl-
edge (happens all the time in newpaper
articles)

Michael Kohlhase: Artificial Intelligence 2 1047 2025-02-06

30.4 Language Models


A Video Nugget covering this section can be found at https://fau.tv/clip/id/35200.

Natural Languages vs. Formal Language


 Recap: A formal language is a set of strings.
 Example 30.4.1. Programming languages like Java or C++ are formal languages.

 Remark 30.4.2. Natural languages like English, German, or Spanish are not.
 Example 30.4.3. Let us look at concrete examples
 Not to be invited is sad! (definitely English)
 To not be invited is sad! (controversial)

 Idea: Let’s be lenient, instead of a hard set, use a probability distribution.


 Definition 30.4.4. A (statistical) language model is a probability distribution over
sequences of characters or words.
 Idea: Try to learn/derive language models from text corpora.

 Definition 30.4.5. A text corpus (or simply corpus; plural corpora) is a large and
structured collection of natural language texts called documents.
 Definition 30.4.6. In corpus linguistics, corpora are used to do statistical analysis
and hypothesis testing, checking occurrences or validating linguistic rules within a
specific natural language.

Michael Kohlhase: Artificial Intelligence 2 1048 2025-02-06

N -gram Character Models


 Written text is composed of characters letters, digits, punctuation, and spaces.
 Idea: Let’s study language models for sequences of characters.
 As for Markov processes, we write P (c1:N ) for the probability of a character
sequence c1 . . .cn of length N .
 Definition 30.4.7. We call an character sequence of length n an n gram (unigram,
bigram, trigram for n = 1, 2, 3).
666 CHAPTER 30. NATURAL LANGUAGE PROCESSING

 Definition 30.4.8. An n gram model is a Markov process of order n − 1.


 Remark 30.4.9. For a trigram model, P (ci |c1:i−1 ) = P (ci |c(i−2) , c(i−1) ). Factoring
with the chain rule and then using the Markov property, we obtain
N
Y N
Y
P (c1:N ) = P (ci |c1:i−1 ) = P (ci |c(i−2) , c(i−1) )
i=1 i=1

 Thus, a trigram model for a language with 100 characters, P(ci |ci−2:i−1 ) has
1.000.000 entries. It can be estimated from a corpus with 107 characters.

Michael Kohlhase: Artificial Intelligence 2 1049 2025-02-06

Applications of N -Gram Models of Character Sequences


 What can we do with N gram models?

 Definition 30.4.10. The problem of language identification is given a text, deter-


mine the natural language it is written in.
 Remark 30.4.11. Current technology can classify even short texts like Hello, world,
or Wie geht es Dir correctly with more than 99% accuracy.
 One approach: Build a trigram language model P(ci |ci−2:i−1 , ℓ) for each candi-
date language ℓ by counting trigrams in a ℓ-corpus.
Apply Bayes’ rule and the Markov property to get the most likely language:

ℓ∗ = argmax (P (ℓ|c1:N ))

= argmax (P (ℓ) · P (c1:N |ℓ))

N
Y
= argmax (P (ℓ) · ( P (ci |ci−2:i−1 , ℓ)))
ℓ i=1

The prior probability P (ℓ) can be estimated, it is not a critical factor, since the
trigram language models are extremely sensitive.

Michael Kohlhase: Artificial Intelligence 2 1050 2025-02-06

Other Applications of Character N -Gram Models


 Spelling correction is a direct application of a single-language language model:
Estimate the probability of a word and all off-by-one variants.
 Definition 30.4.12. Genre classification means deciding whether a text is a news
story, a legal document, a scientific article, etc.

 Remark 30.4.13. While many features help make this classification, counts of
punctuation and other character n-gram features go a long way [KNS97].
30.4. LANGUAGE MODELS 667

 Definition 30.4.14. Named entity recognition (NER) is the task of finding names
of things in a document and deciding what class they belong to.
 Example 30.4.15. In Mr. Sopersteen was prescribed aciphex. NER should
recognize that Mr. Sopersteen is the name of a person and aciphex is the name of
a drug.
 Remark 30.4.16. Character-level language models are good for this task because
they can associate the character sequence ex with a drug name and steen with a
person name, and thereby identify words that they have never seen before.

Michael Kohlhase: Artificial Intelligence 2 1051 2025-02-06

N -Grams over Word Sequences


 Idea: n gram models apply to word sequences as well.
 Problems: The method works identically, but
1. There are many more words than characters. (100 vs. 105 in Englisch)
2. And what is a word anyways? (space/punctuation-delimited substrings?)
3. Data sparsity: we do not have enough data! For a language model for 105 words
in English, we have 1015 trigrams.
4. Most training corpora do not have all words.

Michael Kohlhase: Artificial Intelligence 2 1052 2025-02-06

Word N -Grams: Out-of-Vocab Words


 Definition 30.4.17. Out of vocabulary (OOV) words are unknown words that
appear in the test corpus but not training corpus.

 Remark 30.4.18. OOV words are usually content words such as names and locations
which contain information crucial to the success of NLP tasks.
 Idea: Model OOV words by
1. adding a new word token, e.g. <UNK> to the vocabulary,
2. in the training corpus, replacing the respective first occurrence of a previously
unknown word by <UNK>,
3. counting n grams as usual, treating <UNK> as a regular word.
This trick can be refined if we have a word classifier, then use a new token per class,
e.g. <EMAIL> or <NUM>.

Michael Kohlhase: Artificial Intelligence 2 1053 2025-02-06

What can Word N -Gram Models do?


668 CHAPTER 30. NATURAL LANGUAGE PROCESSING

 Example 30.4.19 (Test n-grams). Build unigram, bigram, and trigram language
models over the words [RN03], randomly sample sequences from the models.
1. Unigram: logical are as are confusion a may right tries agent goal the was . . .
2. Bigram: systems are very similar computational approach would be represented . . .
3. Trigram: planning and scheduling are integrated the success of naive bayes model . . .

 Clearly there are differences, how can we measure them to evaluate the models?
 Definition 30.4.20. The perplexity of a sequence c1:N is defined as
1
−( N )
Perplexity(c1:N ):=P (c1:N )

 Intuition: The reciprocal of probability, normalized by sequence length.

 Example 30.4.21. For a language with n characters or words and a language


model that predicts that all are equally likely, the perplexity of any sequence is n.
If some characters or words are more likely than others, and the model reflects that,
then the perplexity of correct sequences will be less than n.
 Example 30.4.22. In ??, the perplexity was 891 for the unigram model, 142 for
the bigram model and 91 for the trigram model.

Michael Kohlhase: Artificial Intelligence 2 1054 2025-02-06

30.5 Part of Speech Tagging

Language Models and Generalization


 Recall: n-grams can predict that a word sequence like a black cat is more likely
than cat black a. (as trigram 1. appears 0.000014% in a corpus and 2. never)

 Native Speakers However: Will tell you that a black cat matches a familiar
pattern: article-adjective-noun, while cat black a does not!
 Example 30.5.1. Consider the fulvous kitten a native speaker reasons that it
 follows the determiner-adjective-noun pattern
 fulvous (=
b brownish yellow) ends in ous ; adjective
So by generalization this is (probably) correct English.
 Observation: The order of syntactical categories of words plays a role in English!
 Problem: How can we compute them? (up next)

Michael Kohlhase: Artificial Intelligence 2 1055 2025-02-06

Part-of-Speech Tagging
 Definition 30.5.2. Part-of-speech tagging (also POS tagging, POST, or gram-
matical tagging) is the process of marking up a word in corpus with tags (called
30.5. PART OF SPEECH TAGGING 669

POS tags) as corresponding to a particular part of speech (a category of words with


similar syntactic properties) based on both its definition and its context.
 Example 30.5.3. A sentence tagged with POS tags from the Penn treebank: (see
below)
From the start , it took a person with great qualities to succeed
IN DT NN , PRP VBD DT NN IN JJ NNS TO VB

1. From is tagged as a preposition (IN)


2. the as a determiner (DT)
3. . . .
 Observation: Even though POS tagging is uninteresting in its own right, it is
useful as a first step in many NLP tasks.

 Example 30.5.4. In text-to-speech synthesis, a POS tag of “noun” for record helps
determine the correct pronunciation (as opposed to the tag “verb”)

Michael Kohlhase: Artificial Intelligence 2 1056 2025-02-06

The Penn Treebank POS tags


 Example 30.5.5. The following 45 POS tags are used by the Penn treebank:

Michael Kohlhase: Artificial Intelligence 2 1057 2025-02-06

Computing Part of Speech Tags


 Idea: Treat the POS tags in a sentence as state variables C1:n in a HMM: the
words are the evidence variables W1:n , use prediction for POS tagging.
670 CHAPTER 30. NATURAL LANGUAGE PROCESSING

 The HMM is a generative model that


 starts in the tag predicted by the prior probability (usually IN) (problematic!)
 and then, for each step makes two choices:
 what word – e.g. From – should be emitted
 what state – e.g. DT – should come next

 This works, but there are problems

 the HMM does not consider context other than the current state (Markov
property)
 it does not have any idea what the sentence is trying to convey
 Idea: Use the Viterbi algorithm to find the most probable sequence of hidden
states (POS tags)
 POS taggers based on the Viterbi algorithm can reach an F1 score of up to 97%.

Michael Kohlhase: Artificial Intelligence 2 1058 2025-02-06

The Viterbi algorithm for POS tagging – Details


 We need a transition model P (Ct |Ct−1 ): the probability of one POS tag following
another.
 Example 30.5.6. P (Ct = V B|Ct−1 = M D) = 0.8 means that given a modal
verb (e.g. would) the following word is a verb (e.g. think) with probability 0.8.

 Question: Where does the number 0.8 come from?


 Answer: From counts in the corpus – with appropriate smoothing!
There are 13124 instances of MD in the Penn treebank and 10471 are followed by
a VB.

 For the sensor model P (Wt = would|Ct = M D) = 0.1 means that if we choose a
modal verb, we will choose would 10% of the time.
 These numbers also come from the corpus with appropriate smoothing.
 Limitations: HMM models only know about the transition and sensor models
In particular, we cannot take into account that e.g. words ending in ous are likely
adjectives.
 We will see methods based on neural networks later.

Michael Kohlhase: Artificial Intelligence 2 1059 2025-02-06

30.6 Text Classification


Text Classification as a NLP Task
 Problem: Often we want to (ideally) automatically see who can best deal with a
30.6. TEXT CLASSIFICATION 671

given document (e.g. e-mails in customer service)


 Definition 30.6.1. Given a set of categories the task of deciding which one a given
document belongs to is called text classification or categorization.

 Example 30.6.2. Language identification and genre classification are examples of


text classification.
 Example 30.6.3. Sentiment analysis – classifying a product review as positive or
negative.
 Example 30.6.4. Spam detection – classifying an email message as spam or ham
(i.e. non-spam).

Michael Kohlhase: Artificial Intelligence 2 1060 2025-02-06

Spam Detection
 Definition 30.6.5. Spam detection – classifying an email message as spam or ham
(i.e. non-spam)
 General Idea: Use NLP/machine learning techniques to learn the categories.

 Example 30.6.6. We have lots of examples of spam/ham, e.g.

Spam (from my spam folder) Ham (in my inbox)


Wholesale Fashion Watches -57% today. De- The practical significance of hypertree width
signer watches for cheap ... in identifying more ...
You can buy ViagraFr$1.85 All Medications Abstract: We will motivate the problem of
at unbeatable prices! ... social identity clustering: ...
WE CAN TREAT ANYTHING YOU SUF- Good to see you my friend. Hey Peter, It
FER FROM JUST TRUST US ... was good to hear from you. ...
Sta.rt earn*ing the salary yo,u d-eserve by PDS implies convexity of the resulting opti-
o’btaining the prope,r crede’ntials! mization problem (Kernel Ridge ...

 Specifically: What are good features to classify e-mails by?


 n-grams like for cheap and You can buy indicate spam(but also occur in ham)
 character-level features: capitalization, punctuation (e.g. in yo,u d-eserve)
 Note: We have two complementary ways of talking about classification:(up next)

 using language models


 using machine learning

Michael Kohlhase: Artificial Intelligence 2 1061 2025-02-06

Spam Detection as Language Modeling


 Idea: Define two n-gram language models:

1. one for P(Message|spam) by training on the spam folder


2. one for P(Message|ham) by training on the inbox
672 CHAPTER 30. NATURAL LANGUAGE PROCESSING

Then we can classify a new message m with an application of Bayes’ rule:

argmax (P (c|m)) = argmax (P (m|c)P (c))


c∈{spam,ham} c∈{spam,ham}

where P (c) is estimated just by counting the total number of spam and ham mes-
sages.

 This approach works well for spam detection, just as it did for language identifi-
cation.

Michael Kohlhase: Artificial Intelligence 2 1062 2025-02-06

Classifier Success Measures: Precision, Recall, and F1 score


 We need a way to measure success in classification tasks.
 Definition 30.6.7. Let fC : S → B be a binary classifier for a class C ⊆ S, then
we call a ∈ S with fC (a) = T a false positive, iff a ̸∈ C and fC (a) = F a false
negative, iff a ∈ C. False positives and negatives are erros of fC . True positives
and negatives occur when fC correctly indicates actual membership in S.
#(T P )
 Definition 30.6.8. The precision of fC is defined as #(T P )+#(F P ) and the recall
#(T P )
is#(T P )+#(F N ) , where T P is the set of true positives and F N /F P the sets of
false negatives and false positives of fC .
 Intuitively these measure the rates of:
 true positives in class C. (precision high, iff few false positives)
−1
 true positives in fC (T). (recall high, iff few true positives forgotten, i.e. few
false negatives)
 Definition 30.6.9. The F1 score combines precision and recall into a single number:
(harmonic mean) precision · recall
2
(precision + recall)

 Observation: Classifiers try to reach precision and recall ; F1 score of 1.


 if that is impossible, compromize on one ; Fβ score . (application-dependent)
 The Fβ score generalizes the F1 score by weighing the precision β times as
important as recall.

Michael Kohlhase: Artificial Intelligence 2 1063 2025-02-06

30.7 Information Retrieval


A Video Nugget covering this section can be found at https://fau.tv/clip/id/35274.

Information Retrieval

30.7. INFORMATION RETRIEVAL 673

 Definition 30.7.1. An information need is an individual or group’s desire to locate


and obtain information to satisfy a conscious or unconscious need.
 Definition 30.7.2. An information object is medium that is mainly used for its
information content.

 Definition 30.7.3. Information retrieval (IR) deals with the representation, orga-
nization, storage, and maintenance of information objects that provide users with
easy access to the relevant information and satisfy their various information needs.
Observation (Hjørland 1997): Information need is closely related to relevance:
If something is relevant for a person in relation to a given task, we might say that
the person needs the information for that task.
 Definition 30.7.4. Relevance denotes how well an information object meets the
information need of the user. Relevance may include concerns such as timeliness,
authority or novelty of the object.

 Observation: We normally come in contact with IR in the form of web search.


 Definition 30.7.5. Web search is a fully automatic process that responds to a
user query by returning a sorted document list relevant to the user requirements
expressed in the query.
 Example 30.7.6. Google and Bing are web search engines, their query is a bag of
words and documents are web pages, PDFs, images, videos, shopping portals.

Michael Kohlhase: Artificial Intelligence 2 1064 2025-02-06

Vector Space Models for IR


 Idea: For web search, we usually represent documents and queries as bags of
words over a fixed vocabulary V . Given a query Q, we return all documents that
are “similar”.

 Definition 30.7.7. Given a vocabulary (a list) V of words, a word w ∈ V , and


a document d, then we define the raw term frequency (often just called the term
frequency) of w in d as the number of occurrences of w in d.
 Definition 30.7.8. A multiset of words in V = {t1 , . . ., tn } is called a bag of words
(BOW), and can be represented as a word frequency vectors in N|V | : the vector of
raw word frequencies.
 Example 30.7.9. If we have two documents: d1 = Have a good day! and d2 =
Have a great day!, then we can use V = Have, a, good, great, day and can repre-
sent good as ⟨0, 0, 1, 0, 0⟩, great as ⟨0, 0, 0, 1, 0⟩, and d1 a ⟨1, 1, 1, 0, 1⟩.
Words outside the vocabulary are ignored in the BOW approach. So the document
d3 = What a day, a good day is represented as ⟨0, 2, 1, 0, 2⟩.

Michael Kohlhase: Artificial Intelligence 2 1065 2025-02-06

Vector Space Models for IR


674 CHAPTER 30. NATURAL LANGUAGE PROCESSING

 Idea: Query and document are similar, iff the angle between their word frequency
vectors is small.

term 1
D1 (t1,1 , t1,2 , t1,3 )
D2 (t2,1 , t2,2 , t2,3 )
term 3
term 2

 Lemma 30.7.10 (Euclidean Dot Product Formula). A·B = ∥A∥2 ∥B∥2 cos θ,
where θ is the angle between A and B.
 Definition 30.7.11. The cosine similarity of A and B is cos θ = A·B
∥A∥2 ∥B∥2 .

Michael Kohlhase: Artificial Intelligence 2 1066 2025-02-06

TF-IDF: Term Frequency/Inverse Document Frequency


 Problem: Word frequency vectors treat all the words equally.
 Example 30.7.12. In an query the brown cow, the the is less important than
brown cow. (because the is less specific)
 Idea: Introduce a weighting factor for the word frequency vector that de-emphasizes
the dimension of the more (globally) frequent words.
 We need to normalize the word frequency vectors first:
 Definition 30.7.13. Given a document d and a vocabulary word t ∈ V , the
normalized term frequency (confusingly often called just term frequency) tf(t, d) is
the raw term frequency divided by |d|.
 Definition 30.7.14. Given a document collection D = {d1 , . . ., dN } and a word t
the inverse document frequency is given by idf(t, D) := log10 ( |{d∈DN| t∈d}| ).

 Definition 30.7.15. We define tfidf(t, d, D):=tf(t, d) · idf(t, D).

 Idea: Use the tfidf-vector with cosine similarity for information retrieval instead.
 Definition 30.7.16. Let D be a document collection with vocabulary V =
{t1 , . . ., t|V | }, then the tfidf-vector tfidf(d, D) ∈ N|V | is defined by tfidf(d, D)i :=
tfidf(ti , d, D).

Michael Kohlhase: Artificial Intelligence 2 1067 2025-02-06

TF-IDF Example
30.7. INFORMATION RETRIEVAL 675

 Let D := {d1 , d2 } be a document corpus over the vocabulary

V = {this, is, a, sample, another, example}

with word frequency vectors ⟨1, 1, 1, 2, 0, 0⟩ and ⟨1, 1, 0, 0, 2, 3⟩.

 Then we compute for the word this


1 1
 tf(this, d1 ) = 5 = 0.2 and tf(this, d2 ) = 7 ≊ 0.14,
 idf is constant over D, we have idf(this, D) = log10 ( 22 ) = 0,
 thus tfidf(this, d1 , D) = 0 = tfidf(this, d2 , D). (this occurs in both)

 The word example is more interesting, since it occurs only in d2 (thrice)


0 3
 tf(example, d1 ) = 5 = 0 and tf(example, d2 ) = 7 ≊ 0.429.
 idf(example, D) = log10 ( 12 ) ≊ 0.301,
 thus tfidf(example, d1 , D) = 0 · 0.301 = 0 and tfidf(example, d2 , D) ≊ 0.429 ·
0.301 = 0.129.

Michael Kohlhase: Artificial Intelligence 2 1068 2025-02-06

Once an answer set has been determined, the results have to be sorted, so that they can be
presented to the user. As the user has a limited attention span – users will look at most at three
to eight results before refining a query, it is important to rank the results, so that the hits that
contain information relevant to the user’s information need early. This is a very difficult problem,
as it involves guessing the intentions and information context of users, to which the search engine
has no access.

Ranking Search Hits: e.g. Google’s Page Rank

 Problem: There are many hits, need to sort them (e.g. by importance)
 Idea: A web site is important, . . . if many other hyperlink to it.

 Refinement: . . . , if many important web pages hyperlink to it.

 Definition 30.7.17. Let A be a web page that is hyperlinked from web pages
S1 , . . . , Sn , then the page rank PR of A is defined as
 
PR(S1 ) PR(Sn )
PR(A) = 1 − d + d + ··· +
C(S1 ) C(Sn )

where C(W ) is the number of links in a page W and d = 0.85.

 Remark 30.7.18. PR(A) is the probability of reaching A by random browsing.

Michael Kohlhase: Artificial Intelligence 2 1069 2025-02-06


676 CHAPTER 30. NATURAL LANGUAGE PROCESSING

Getting the ranking right is a determining factor for success of a search engine. In fact, the early
of Google was based on the pagerank algorithm discussed above (and the fact that they figured
out a revenue stream using text ads to monetize searches).

30.8 Information Extraction


Information Extraction
 Definition 30.8.1. Information extraction is the process of acquiring information
by skimming a text and looking for occurrences of a particular class of object and
for relationships among objects.

 Example 30.8.2. Extracting instances of addresses from web pages, with attributes
for street, city, state, and zip code;
 Example 30.8.3. Extracting instances of storms from weather reports, with at-
tributes for temperature, wind speed, and precipitation.

 Observation: In a limited domain, this can be done with high accuracy.

Michael Kohlhase: Artificial Intelligence 2 1070 2025-02-06

Attribute-Based Information Extraction


 Definition 30.8.4. In attribute-based information extraction we assume that the
text refers to a single object and the task is to extract a factored representation.
 Example 30.8.5 (Computer Prices). Extracting from the text IBM ThinkBook
970. Our price: $399.00 the attribute-based representation
\{Manufacturer=IBM, Model=ThinkBook970,Price=$399.00\}.
 Idea: Try a template-based approach for each attribute.
 Definition 30.8.6. A template is a finite automaton that recognizes the information
to be extracted. The template often consists of three sub-automata per attribute:
the prefix pattern followed by the target pattern (it matches the attribute value)
and the postfix pattern.
 Example 30.8.7 (Extracing Prices with Regular Expressions).
When we want to extract computer price information, we could use regular expres-
sions for the automata, concretely, the

 prefix pattern: .∗price[:]?


 target pattern: [$][0−9]+([.][0−9][0−9])?
 postfix pattern: + shipping|
 Alternative: take all the target matches and choose among them.

 Example 30.8.8. For List price $99.00, special sale price $78.00, shipping $3.00.
take the lowest price that is within 50% of the highest price. ; $78.00

Michael Kohlhase: Artificial Intelligence 2 1071 2025-02-06


30.8. INFORMATION EXTRACTION 677

Relational Information Extraction


 Question: Can we also do structured representations?
 Answer: That is the next step up from attribute-based information extraction.

 Definition 30.8.9. The task of a relational extraction system is to extract multiple


objects and the relationships among them from a text.
 Example 30.8.10. When these systems see the text $249.99, they need to deter-
mine not just that it is a price, but also which object has that price.

 Example 30.8.11. FASTUS is a typical relational extraction system, which handles


news stories about corporate mergers and acquisitions. It can read the story
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan
with a local concern and a Japanese trading house to produce golf clubs to be
shipped to Japan.

and extract the relations:

e ∈ JointV entures ∧ P roduct(e, ”golf clubs) ∧ Date(e, ”F riday”)

M ember(e, ”BridgestoneSportsCo”) ∧ M ember(e, ”alocalconcern”)


M ember(e, ”aJapanesetradinghouse”)

Michael Kohlhase: Artificial Intelligence 2 1072 2025-02-06

Advertisement: Logic-Based Natural Language Semantics


 Advanced Course: “Logic-Based Natural Language Semantics” (next semester)

 Wed. 10:15-11:50 and Thu 12:15-13:50 (expected: ≤ 10 Students)


 Contents: (Alternating Lectures and hands-on Lab Sessions)
 Foundations of Natural Language Semantics (NLS)
 Montague’s Method of Fragments (Grammar, Semantics Constr., Logic)
 Implementing Fragments in GLF (Grammatical Framework and MMT)
 Inference Systems for Natural Language Pragmatics (tableau machine)
 Advanced logical systems for NLS (modal, higher-order, dynamic Logics)
 Grading: Attendance & Wakefulness, Project/Homework, Oral Exam.

 Course Intent: Groom students for bachelor/master theses and as KWARC re-
search assistants.

Michael Kohlhase: Artificial Intelligence 2 1073 2025-02-06


678 CHAPTER 30. NATURAL LANGUAGE PROCESSING
Chapter 31

Deep Learning for NLP

Deep Learning for NLP: Agenda


 Observation: Symbolic and statistical systems have demonstrated success on
many NLP tasks, but their performance is limited by the endless complexity of
natural language.
 Idea: Given the vast amount of text in machine-readable form, can data-driven
machine-learning base approaches do better?

 In this chapter, we explore this idea, using – and extending – the methods from ??.
 Overview:
1. Word embeddings
2. Recurrent neural networks for NLP
3. Sequence-to-sequence models
4. Transformer Architecture
5. Pretraining and transfer learning.

Michael Kohlhase: Artificial Intelligence 2 1074 2025-02-06

31.1 Word Embeddings


A Video Nugget covering this section can be found at https://fau.tv/clip/id/35276.

Word Embeddings
 Problem: For ML methods in NLP, we need numerical data. (not words)
 Idea: Embed words or word sequences into real vector spaces.

 Definition 31.1.1. A word embedding is a mapping from words in context into a


real vector space Rn used for natural language processing.
 Definition 31.1.2. A vector is called one hot, iff all components are 0 except for
one 1. We call a word embedding one hot, iff all of its vectors are.

679
680 CHAPTER 31. DEEP LEARNING FOR NLP

One hot word embeddings are rarely used for actual tasks, but often used as a
starting point for better word embeddings.
 Example 31.1.3 (Vector Space Methods in Information Retrieval).
Word frequency vectors are induced by adding up one hot word embeddings.

 Example 31.1.4. Given a corpus D – the context – the tf idf word embedding
is given by tfidf(t, d, D):=tf(t, d) · log10 ( |{d∈D|D|
| t∈d}| ), where tf(t, d) is the term
frequency of word t in document d.
 Intuition behind these two: Words that occur in similar documents are similar.

Michael Kohlhase: Artificial Intelligence 2 1075 2025-02-06

Word2Vec
Idea: Use feature extraction to map words to vectors in RN :
Train a neural network on a “dummy task”, throw away the output layer, use the
previous layer’s output (of size N ) as the word embedding
First Attempt: Dimensionality Reduction: Train to predict the original one hot
vector:
 For a vocabulary size V , train a network with a single hidden layer; i.e. three layers
of sizes (V, N, V ). The first two layers will compute our embeddings.

 Feed the one hot encoded input word into the network, and train it on the one hot
vector itself, using a softmax activation function at the output layer. (softmax
normalizes a vector into a probability distribution)

Michael Kohlhase: Artificial Intelligence 2 1076 2025-02-06

Word2Vec: The Continuous Bag Of Words (CBOW) Algorithm


Distributional Semantics: “a word is characterized by the company it keeps”.
31.1. WORD EMBEDDINGS 681

Better Idea: Predict a word from its context:


 For a context window size n, take all se-
quences of 2n + 1 words in our corpus (e.g.
the brown cow jumps over the moon for n = 3)
as training data. We call the word at the cen-
ter (jumps) the target word, and the remaining
words the context words.
 For every such sentence, pass all context words
(one-hot encoded) through the first layer of the
network, yielding 2n vectors.

 Pass their average into the output layer (av-


erage pooling layer ) with a softmax activation
function, and train the network to predict the
target word. (sum pooling also works)

Michael Kohlhase: Artificial Intelligence 2 1077 2025-02-06

Properties
Vector embeddings like CBOW have interesting properties:
 Similarity : Using e.g. cosine similarity (A · B · cos(θ)) to compare vectors, we can
find words with similar meanings.
 Semantic and syntactic relationships emerge as arithmetic relations:

king − man + woman ≈ queen

germany − country + capitol ≈ berlin

Michael Kohlhase: Artificial Intelligence 2 1078 2025-02-06

Common Word Embeddings


 Observation: Word embeddings are crucial as first steps in any NN-based NLP
methods.

 In practice it is often sufficient to use generic, pretrained word embeddings


 Definition 31.1.5. Common pretrained – i.e. trained for generic NLP applications
word embeddings include
682 CHAPTER 31. DEEP LEARNING FOR NLP

 Word2vec: the original system that established the concept (see above)
 GloVe (Global Vectors)
 fastText (embeddings for 157 languages)

 But we can also train our own word embedding (together with main task) (up
next)

Michael Kohlhase: Artificial Intelligence 2 1079 2025-02-06

Learning POS tags and Word embeddings simultaneously


Specific word embeddings are trained on a carefully selected corpus and tend to
emphasize the characteristics of the task.
Example 31.1.6. POS tagging – even though simple – is a good but non-trivial
example.
Recall that many words can have multiple POS tags, e.g. cut can be
 a present tense verb (transitive or intransitive)
 a past tense verb
 a infinitive verb

 a past participle
 an adjective
 a noun.
If a nearby temporal adverb refers to the past ; this occurrence may be a past tense
verb.
Note: CBOW treats all context words identically reagrdless of order, but in POS
tagging the exact positions of the words matter.

Michael Kohlhase: Artificial Intelligence 2 1080 2025-02-06

POS/Embedding Network
Idea: Start with a random (or pretrained) embedding of the words in the corpus and
just concatenate them over some context window size
Figure 24.3

Feedforward part-of-speech tagging model. This model takes a 5-word window as input and
predicts the tag of the word in the middle—here, cut. The model is able to account for word

 Layer 1 has (in this case) 5 · N inputs, Output layer is one hot over POS classes.
position because each of the 5 input embeddings is multiplied by a different part of the first
hidden layer. The parameter values for the word embeddings and for the three layers are all
learned simultaneously during training.

7. To encode a sequence of words into an input vector, simply look up the

embedding for each word and concatenate the embedding vectors. The result is a
real-valued input vector of length . Even though a given word will have the
same embedding vector whether it occurs in the first position, the last, or
somewhere in between, each embedding will be multiplied by a different part of the
31.2. RECURRENT NEURAL NETWORKS 683

 The embedding layers treat all words the same, but the first hidden layer will treat
them differently depending on the position.
 The embeddings will be finetuned for the POS task during training.

Note: Better positional encoding techniques exist (e.g. sinusoidal), but for fixed small
context window sizes, this works well.
Michael Kohlhase: Artificial Intelligence 2 1081 2025-02-06

31.2 Recurrent Neural Networks


Recurrent Neural Networks in NLP
 word embeddings give a good representation of words in isolation.

 But natural language of word sequences ⇝ surrounding words provide context!


 For simple tasks like POS tagging, a fixed-size window of e.g. 5 words is sufficient.
 Observation: For advanced tasks like question answering we need more context!

 Example 31.2.1. In the sentence Eduardo told me that Miguel was very sick so
I took him to the hospital, the pronouns him refers to Miguel and not Eduardo.
(14 words of context)
 Observation: Language models with n-grams or n-word feed-forward networks
have problems:
Either the context is too small or the model has too many parameters! (or both)
 Observation: Feed-forward networks N also have the problem of asymmetry:
whatever N learns about a word w at position n, it has to relearn about w at
position m ̸= n.
 Idea: What about recurrent neural networks – nets with cycles? (up next)

Michael Kohlhase: Artificial Intelligence 2 1082 2025-02-06

RNNs for Time Series


 Idea: RNNs – neural networks with cycles – have memory
; use that for more context in neural NLP.

 Example 31.2.2 (A simple RNN).


684 CHAPTER 31. DEEP LEARNING FOR NLP

It has an input layer x, a hidden layer z with recurrent


connections and delay ∆, and an output layer y as
shown on the right.
Defining Equations for time step t:

zt = gz (Wz,z zt−1 + Wx,z xt )


yt = gy (Wz,y zt )

where gz and gy are the activation functions for the


hidden and output layers.
(a) Schematic diagram of an RNN where the hidden layer has recurrent

 Intuition: RNNs are a bit like HMMs and dynamic Bayesian Networks:
indicates a delay. Each input is the word embedding vector of the next w
output is the output for that time step. (b) The same network unrolled o

They make a Markov assumption: the hidden state z suffices to capture the input
feedforward network. Note that the weights are shared across all timestep

from all previous inputs.


In an RNN language model each input word is encoded as a wor
 Side Benefit: RNNs solve the asymmetry problem ⇝, the Wz,z are
Therethe
is a same at which gets passed as input from one tim
hidden layer
every step. interested in doing multiclass classification: the classes are the w

Thus the output will be a softmax probability distribution over


next word in the sentence.
Michael Kohlhase: Artificial Intelligence 2 1083 2025-02-06

The RNN architecture solves the problem of too many parameter

Training RNNs for NLP parameters in the weight matrixes , , and stays constan

of words—it is . This is in contrast to feedforward networks,


parameters, and -gram models, which have parameters, w
 Idea: For training, unroll a RNN into a feed-forward network ; back-propagation.
vocabulary.

 Example 31.2.3. The RNN from ?? unrolled three times.


The RNN architecture also solves the problem of asymmetry, bec
same for every word position.

The RNN architecture can sometimes solve the limited context p


there is no limit to how far back in the input the model can look.

layer has access to both the current input word and the prev
which means that information about any word in the input can b
indefinitely, copied over (or modified as appropriate) from one ti
course, there is a limited amount of storage in , so it can’t remem
the previous words.
Problem: The weight matrices Wx,z , Wz,z , and Wz,y are shared over all time
(a) Schematic diagram of an RNN where the hidden layer has recurrent connections; the symbol
slides.
indicates a delay. Each input is the word embedding vector of the next word in the sentence. Each
output is the output for that time step. (b) The same network unrolled over three timesteps to create a
 Definition 31.2.4. The back-propagation through time algorithm carefully main-
feedforward network. Note that the weights are shared across all timesteps.

tains the identity of Wz,z over all steps


In an RNN language model each input word is encoded as a word embedding vector, .
There is a hidden layerMichael
which getsArtificial
Kohlhase: passed as input
Intelligence 2 from one time step
1084 to the next. We are
2025-02-06

interested in doing multiclass classification: the classes are the words of the vocabulary.

Thus the output will be a softmax probability distribution over the possible values of the
Bidirectional RNN for more Context
next word in the sentence.

 Observation: RNNs only take left context – i.e. words before – into account, but
The RNN architecture solves the problem of too many parameters. The number of
we may also need right context – the words after.
parameters in the weight matrixes , , and stays constant, regardless of the number

 Example
of words—it is 31.2.5.
. This For Eduardo
is in contrast told
to feedforward me that
networks, Miguel
which have was very sick so I took him
to and
parameters, the -gram
hospital thewhich
models, pronoun
have him parameters,
resolves to Miguel
where with
is the high
size of the probability.

If the
vocabulary. sentence ended with to see Miguel, then it should be Eduardo.

The RNN architecture also solves the problem of asymmetry, because the weights are the
same for every word position.
model. The only difference is that the training data will require labels—part of speech tags or
reference indications. That makes it much harder to collect the data than for the case of a
language model, where unlabelled text is all we need.

In a language model we want to predict the th word given the previous words. But for
classification, there is no reason we should limit ourselves to looking at only the previous
words. It can be very helpful to look ahead in the sentence. In our coreference example, the
referent him would be different if the sentence concluded “to see Miguel” rather than “to the
31.2. RECURRENT NEURAL NETWORKS
hospital,” so looking ahead is crucial. We know from eye-tracking experiments that human
685
readers do not go strictly left-to-right.

 Definition 31.2.6. A bidirectional RNN concatenates a separate right-to-left model


onto a left-to-right model
To capture the context on the right, we can use a bidirectional RNN, which concatenates a
separate right-to-left model onto the left-to-right model. An example of using a bidirectional

 Example 31.2.7. Bidirectional RNNs can be used for POS tagging, extending the
RNN for POS tagging is shown in Figure 24.5 .

network from ??
Figure 24.5

Michael Kohlhase: Artificial Intelligence 2 1085 2025-02-06

Long Short-Term Memory RNNs


 Problem: When training a vanilla RNN using back-propagation through time, the
long-term gradients which are back-propagated can “vanish” – tend to zero – or
“explode” – tend to infinity.
 Definition 31.2.8. LSTMs provide a short-term memory for RNN that can last
thousands of time steps, thus the name “long short-term memory”. A LSTM can
learn when to remember and when to forget pertinent information,
 Example 31.2.9. In NLP LSTMs can learn grammatical dependencies.
An LSTM might process the sentence Dave, as a result of his controversial claims,
is now a pariah by

 remembering the (statistically likely) grammatical gender and number of the


subject Dave,
 note that this information is pertinent for the pronoun his and
 note that this information is no longer important after the verb is.

Michael Kohlhase: Artificial Intelligence 2 1086 2025-02-06

LSTM: Idea
Introduce a memory vector c in addition to the recurrent (short-term memory) vector
z

 c is essentially copied from the previous time step, but can be modified by the forget
gate f , the input gate i, and the output gate o.
 the forget gate f decides which components of c to retain or discard
686 CHAPTER 31. DEEP LEARNING FOR NLP

 the input gate i decides which components of the current input to add to c
(additive, not multiplicative ; no vanishing gradients)
 the output gate o decides which components of c to output as z

Michael Kohlhase: Artificial Intelligence 2 1087 2025-02-06

31.3 Sequence-to-Sequence Models

Neural Machine Translation


 Question: Machine translation (MT) is an important task in NLP, can we do it
with neural networks?
 Observation: If there were a one-to-one correspondence between source words
and target words MT would be a simple tagging task. But

 the three Spanish words caballo de mar translate to the English seahorse and
 the two Spanish words perro grande translate to English as big dog.
 in English, the subject is usually first and in Fijian last.
 Idea: For MT, generate one word at a time, but keep track of the context, so that

 we can remember parts of the source we have not translated yet


 we remember what we already translated so we do not repeat ourselves.
We may have to process the whole source sentence before generating the target!
 Remark: This smells like we need LSTMs.

Michael Kohlhase: Artificial Intelligence 2 1088 2025-02-06

Sequence-To-Sequence Models
 Idea: Use two coupled RNNs, one for the source, and one for the target. The
input for the target is the output of the last hidden layer of the source RNN.
 Definition 31.3.1. A sequence-to-sequence (seq2seq) model is a neural model for
translating an input sequence x into an output sequence y by an encoder followed
by a decoder generates y.

output
hi
Encoder Decoder

input

 Example 31.3.2. A simple seq2seq model (without embedding and output layers)
This neural network architecture is called a basic sequence-to-sequence model, an example
of which is shown in Figure 24.6 . Sequence-to-sequence models are most commonly used
for machine translation, but can also be used for a number of other tasks, like automatically
generating a text caption from an image, or summarization: rewriting a long text into a

shorter one that maintains the same meaning.

31.3. SEQUENCE-TO-SEQUENCE MODELS 687


Figure 24.6

Basic sequence-to-sequence model. Each block represents one LSTM timestep. (For simplicity, the
embedding and output layers are not shown.) On successive steps we feed the network the words of the
Each block
source represents one
sentence “The man LSTM
is tall,” time
followed by thestep;
<start> inputs are fed
tag to indicate successively
that the followed by
network should start
the token <start>
producing tosentence.
the target start The thefinaldecoder.
hidden state at the end of the source sentence is used as the
hidden state for the start of the target sentence. After that, each target sentence word at time is used as
input at time , until the network produces the <end> tag to indicate that sentence generation is
finished.
Michael Kohlhase: Artificial Intelligence 2 1089 2025-02-06

Seq2Seq Evaluation
 Remark: Seq2seq models were a major breakthrough in NLP and MT. But they
have three major shortcomings:
 nearby context bias: RNNs remember with their hidden state, which has more
information about a word in – say – step 56 than in step 5. BUT long-distance
context can also be important.
 fixed context size: the entire information about the source sentence must be
compressed into the fixed-dimensional – typically 1024 – vector. Larger vectors
; slow training and overfitting.
 Idea: Concatenate all source RNN hidden vectors to use all of them to mitigate
the nearby context bias.

 Problem: Huge increase of weights ; slow training and overfitting.

Michael Kohlhase: Artificial Intelligence 2 1090 2025-02-06

Attention
 Bad Idea: Concatenate all source RNN hidden vectors to use all of them to
mitigate the nearby context bias.
 Better Idea: The decoder generates the target sequence one word at a time. ;
Only a small part of the source is actually relevant.
the decoder must focus on different parts of the source for every word.
 Idea: We need a neural component that does context-free summarization.
 Definition 31.3.3. An attentional seq2seq model is a seq2seq that passes along a
context vector ci in the decoder. If hi = RN N (hi−1 , xi ) is the standard decoder,
then the decoder with attention is given by hi = RN N (hi−1 , xi + ci ), where xi + ci
is the concatenation of the input xi and context vectors ci with

rij = hi−1 ·P sj raw attention score


rij
aij = eP /( k erij ) attention probability matrix
ci = j aij · sj context vector
where is the target RNN vector that is going to be used for predicting the word at
timestep , and is the output of the source RNN vector for the source word (or timestep) .
Both and are -dimensional vectors, where is the hidden size. The value of is
688 CHAPTER 31. DEEP LEARNING FOR NLP
therefore the raw “attention score” between the current target state and the source word .

These scores are then normalized into a probability using a softmax over all source
where
output
is the target RNN vector that is going to be used for predicting the word at
words. Finally, these probabilities are used to generate a weighted average of the source
RNN vectors, Both
timestep , and
(another
context
is the output of the source RNN vector for the source word (or timestep) .
and -dimensional
vectorvector).
are -dimensional vectors, where is the hidden size. The value of is
Encoder Decoder x +c
therefore the raw “attention score” between the current target statei and the
i source word .
An example ofThese scores are then normalized into a probability
an attentional sequence-to-sequence modelusing a softmax over all source
is given in Figure 24.7(a) .
input details to understand. First, the attention component itself has no
words. Finally, these probabilities are used to generate a weighted average of the source
There are a few important
RNN vectors, (another -dimensional vector).
learned weights and supports variable-length sequences on both the source and target side.
Second, like most of theof
An example other neural network
an attentional modeling techniques
sequence-to-sequence we’ve
model is given learned
in Figure 24.7(a)about,
.
Michael Kohlhase: Artificial Intelligence 2 1091 2025-02-06
There are
attention is entirely a few important
latent. details to understand.
The programmer First, thewhat
does not dictate attention component gets
information itself used
has no
learned weights and supports variable-length sequences on both the source and target side.
when; the model learns what to use. Attention can also be combined with multilayer RNNs.
Second, like most of the other neural network modeling techniques we’ve learned about,
Attention: English to Spanish Translation
Typically attention is applied at each layer in that case.
attention is entirely latent. The programmer does not dictate what information gets used

when; the model learns what to use. Attention can also be combined with multilayer RNNs.

Figure 24.7 Typically attention is applied at each layer in that case.


 Example 31.3.4. An attentional seq2seq model for English-to-Spanish translation
Figure 24.7

dashed lines represent attention attention probablity matrix


(a) Attentional sequence-to-sequence model for English-to-Spanish translation. The dashed lines
darker colors
(a) Attentional sequence-to-sequence model for English-to-Spanish translation. The dashed lines
represent attention. (b) Example of attention probability matrix for a bilingual higherpair,probabilities
; sentence with darker
represent attention.
boxes(b) Examplehigher
representing of attention
values of probability matrix
. The attention for a bilingual
probabilities sentence
sum to one over eachpair, with darker
column.
boxes representing higher values of . The attention probabilities sum to one over each column.
 Remarks: The attention
 component learns no weights and supports variable-length sequences.
 is entirely latent – the developer does not influence it.

Michael Kohlhase: Artificial Intelligence 2 1092 2025-02-06

Attention: Greedy Decoding


 During training, a seq2seq model tries to maximize the probability of each word in
the training sequence, conditioned on the source and the previous target words.
 Definition 31.3.5. The procedure that generates the target one word at a time
and feeds it back at the next time step is called decoding.

 Definition 31.3.6. Always selecting the highest probability word is called greedy
decoding.
 Problem: This may not always maximize the probability of the whole sequence
 Example 31.3.7. Let’s use a greedy decoder on The front door is red.

 The correct translation is La puerta de entrada es roja.


 Suppose we have generated the first word La for The.
 A greedy decoder might propose entrada for front.
 Greedy decoding is fast, but has no mechanism for correcting mistakes.
31.4. THE TRANSFORMER ARCHITECTURE 689

 Solution: Use an optimizing search algorithm (e.g. local beam search)

Michael Kohlhase: Artificial Intelligence 2 1093 2025-02-06

Decoding with Beam Search


 Recall: Greedy decoding is not optimal!
 Idea: Search for an optimal decoding (or at least a good one) using one of the
search algorithms from ??.
 Local beam search is a common choice in machine translation. Concretely:
 keep the top k hypotheses at each stage,
 extending each by one word using the top k choices of words,
 then chooses the best k of the resulting k 2 new hypotheses.
When all hypotheses in the beam generate the special <end> token, the algorithm
outputs the highest scoring hypothesis.
 Observation: The better the seq2seq models get, the smaller we can keep beam
size
Today beams of b = 4 are sufficient after b = 100 a decade ago.

Michael Kohlhase: Artificial Intelligence 2 1094 2025-02-06

Decoding with Beam Search


 Example 31.3.8. A local beam search with beam size b = 2

Beam search with beam size of . The score of each word is the log-probability generated by the target
 Word scores
RNN softmax, andare log-probabilities
the score of each hypothesisgenerated by word
is the sum of the the scores.
decoder softmax
At timestep 3, the highest
scoring hypothesis La entrada can only generate low-probability continuations, so it “falls off the beam.”
 hypothesis score is the sum of the word scores.

At time step 3, the highest scoring hypothesis La entrada can only generate low-
probability continuations, so it “falls off the beam”. (as
intended)

Michael Kohlhase: Artificial Intelligence 2 1095 2025-02-06

31.4 The Transformer Architecture


690 CHAPTER 31. DEEP LEARNING FOR NLP

Self-Attention
 Idea: “Attention is all you need!” (see [Vas+17])
 So far, attention was used from the encoder to the decoder.

 Self-attention extends this so that each hidden states sequence also attends to itself.
(*coder to *coder)
 Idea: Just use the dot product of the input vectors
 Problem: Always high, so each hidden state will be biased towards attending to
itself.
 Self-attention solves this by first projecting the input into three different represen-
tations using three different weight matrices:
 the query vector qi = Wq xi =
b standard attention
 key vector ki = Wk xi =
b the source in seq2seq
 value vector vi = Wv xi is the context being generated


rij = (qi ·kiP
)/ d
rij
aij = eP /( k erij )
ci = j aij · vj

where d is the dimension of k and q.

Michael Kohlhase: Artificial Intelligence 2 1096 2025-02-06

The Transformer Architecture


 Definition 31.4.1. The transformer architecture uses neural blocks called trans-
formers, which are built up from multiple transformer layers.
the transformer layer. A single-layer transformer in shown in Figure 24.9 . In practice,
 Remark: The context modeled in self-attention is agnostic
transformer models usually tolayers.
have six or more word As withorder
the other ;
models that we’ve

transformers use positional embeddings tolearned


cope with
about, that.
the output of layer is used as the input to layer .

 Example 31.4.2. Figure 24.9

A single-layer transformer consists of


self-attention, a feed-forward network,
and residual connections to cope with
the vanishing gradient problem.

A single-layer transformer consists of self-attention, a feedforward network, and residual connections.

 In practice transformers consist of 6-7 transformer layers.

Positional embedding

The transformer architecture does not explicitly capture the order of words in the sequence,
since context is modeled only through self-attention, which is agnostic to word order. To
capture the ordering of the words, the transformer uses a technique called positional
31.5. LARGE LANGUAGE MODELS 691

Figure 24.10 illustrates the transformer architecture for POS tagging, applied to the same
Michael Kohlhase: Artificial Intelligence 2 1097 2025-02-06
sentence used in Figure 24.3 . At the bottom, the word embedding and the positional
embeddings are summed to form the input for a three-layer transformer. The transformer

A Transformer for POS tagging


produces one vector per word, as in RNN-based POS tagging. Each vector is fed into a final
output layer and softmax layer to produce a probability distribution over the tags.

 Example 31.4.3. A transformers for POS tagging:


Figure 24.10

Using the transformer architecture for POS tagging.

Michael Kohlhase: Artificial Intelligence 2 1098 2025-02-06


In this section, we have actually only told half the transformer story: the model we
described here is called the transformer encoder. It is useful for text classification tasks. The

31.5 LargefullLanguage Models


transformer architecture was originally designed as a sequence-to-sequence model for
machine translation. Therefore, in addition to the encoder, it also includes a transformer

Pretraining and Transfer Learning


decoder. The encoder and decoder are nearly identical, except that the decoder uses a
version of self-attention where each word can only attend to the words before it, since text
is generated left-to-right. The decoder also has a second attention module in each
 Gettingtransformer
enough layer
datathat
toattends
buildtoathe
robust model can be a challenge.
output of the transformer encoder.

 In NLP we often work with unlabeled data


 syntactic/semantic labeling is much more difficult ; costly than image labeling.
 the Internet has lots of texts (adds ∼ 1011 words/day)
 Idea: Why not let other’s do this work and re-use their training efforts.

 Definition 31.5.1. In pretraining we use


 a large amount of shared general-domain language data to train an initial version
of an NLP model.
 a smaller amount of domain-specific data (perhaps labeled) to finetune it to the
vocabulary, idioms, syntactic structures, and other linguistic phenomena that are
specific to the new domain.
 Pretraining is a form of transfer learning:
 Definition 31.5.2. In Transfer learning (TL), knowledge learned from a task is
re-used in order to boost performance on a related task.
692 CHAPTER 31. DEEP LEARNING FOR NLP

 Idea: Take a pretrained neural network, replace the last layer(s), and then train
those on your own corpus.
 Observation: Simple but surprisingly efficient!

Michael Kohlhase: Artificial Intelligence 2 1099 2025-02-06

Large Language Models


Definition 31.5.3. A Large Language Model (LLM) is a generic pretrained neural
network, providing embeddings for sentences or entire documents for NLP tasks. In
practice, they (usually) combine the following components:

 Tokenization: Splitting text into tokens (characters, words, punctuation,...)


 embeddings for these tokens, (e.g., Word2vec – or we let the transformer learn
them)
 positional embeddings of tokens (encodes where in a sentence a token is)

 a transformer architecture, trained on


 a masked token prediction task.
LLMs can be used for a variety of tasks.

 classification (e.g., sentiment analysis, POS-tagging),


 translation (bwetween languages, styles, etc.),
 generation (e.g., text completion, summarization, chatbots),
 ...

Michael Kohlhase: Artificial Intelligence 2 1100 2025-02-06

Tokenization - Byte Pair Encodings


So far: we have encoded text either as sequences of characters (non-semantic) or as
sequences of words (semantic, but virtually unlimited vocabulary, OOV-problems).
Idea: Find a middle ground: Learn an optimal vocabulary of tokens from data and
split text into a sequence of tokens.
Definition 31.5.4. The Byte Pair Encoding (BPE) algorithm learns a vocabulary of
tokens of given size N > 256 from a corpus C, by doing the following:
 Let ℓ = 256 and set BPE(⟨b⟩) = b for every byte 0 ≤ b ≤ 255.
 While ℓ < N , find the most common pair of tokens (a, b) and let BPE(⟨a, b⟩) = ℓ+1
(and increase ℓ by 1).

 Repeat until ℓ = N .
; we obtain a one-hot encoding of tokens of size N , where the most common sequences
of bytes are represented by a single token. By retaining BPE(⟨b⟩) = b, we avoid OOV
problems.
; We can then train a word embedding on the resulting tokens
31.5. LARGE LANGUAGE MODELS 693

Alternative techniques include WordPiece and SentencePiece.

Michael Kohlhase: Artificial Intelligence 2 1101 2025-02-06

Tokenization - Example
https://huggingface.co/spaces/Xenova/the-tokenizer-playground

Michael Kohlhase: Artificial Intelligence 2 1102 2025-02-06

Positional encodings
Definition 31.5.5. Let ⟨w1 , . . . , wn ⟩ be a sequence of tokens. A positional encoding
PEi (wi ) is a vector that retains the position of wi in the sequence alongside the word
embedding of wi .
We want positional encodings to satisfy the following properties:

1. PEi (w) ̸= PEj (w) for i ̸= j,


2. PE should retain distances: if i1 −i2 = j1 −j2 , then given the embeddings for w1 , w2 ,
we should be able to linearly transform ⟨PEi1 (w1 ), PEi2 (w2 )⟩ into ⟨PEj1 (w1 ), PEj2 (w2 )⟩.
; no entirely separate embeddings for w1 , w2 depending on positions
; learning from short sentences generalizes (ideally) to longer ones

Michael Kohlhase: Artificial Intelligence 2 1103 2025-02-06

Sinusoidal positional encoding


Idea: Let PEt (w) = E(w)+pt , for some suitable pt (where E(w) is the word embedding
for token w).
; pt has the same dimensionality as our embedding E.
Idea: Use a combination of sine and cosine functions with different frequencies for
each dimension of the embedding.
Attention is all you need: For a vocabulary size d, we define
 t
sin( c2k/d ) if i = 2k
pt i := t
cos( c2k/d ) if i = 2k + 1

for some constant c. (10000 in the paper)


; works for arbitrary sequence lengths and vocabulary sizes.
694 CHAPTER 31. DEEP LEARNING FOR NLP

Michael Kohlhase: Artificial Intelligence 2 1104 2025-02-06

Training Large Language Models


Three strategies for training LLMs:

 Masked Token Prediction: Given a sentence (e.g. “The river rose five feet”), ran-
domly replace tokens by a special mask token (e.g. “The river [MASK] five feet”).
The LLM should predict the masked tokens (e.g. “rose”). (BERT et al; well suited
for generic tasks)
 Discrimination: Train a small masked token prediction model M . Given a masked
sentence, let M generated possible completions. Train the actual model to distin-
guish between tokens generated by M and the original tokens. (Google Electra et
al; well suited for generic tasks)
 Next Token Prediction: Given the (beginning of) a sentence, predict the next token
in the sequence. (GPT et al; well suited for generative tasks)

; All techniques turn an unlabelled corpus into a supervised learning task.

Michael Kohlhase: Artificial Intelligence 2 1105 2025-02-06

Deep Learning for NLP: Evaluation

 Deep learning methods are currently dominant in NLP! (think ChatGPT)


 Data-driven methods are easier to develop and maintain than symbolic ones
 also perform better models crafted by humans (with reasonable effort)

 But problems remain;


 DL methods work best on immense amounts of data. (small languages?)
 LLM contain knowledge, but integration with symbolic methods elusive.

 DL4NLP methods do very well, but only after processing orders of magnitude more
data than humans do for learning language.
 This suggests that there is of scope for new insigths from all areas.

Michael Kohlhase: Artificial Intelligence 2 1106 2025-02-06


Chapter 32

What did we learn in AI 1/2?

Topics of AI-1 (Winter Semester)


 Getting Started
 What is Artificial Intelligence? (situating ourselves)
 Logic programming in Prolog (An influential paradigm)
 Intelligent Agents (a unifying framework)
 Problem Solving
 Problem Solving and search (Black Box World States and Actions)
 Adversarial search (Game playing) (A nice application of search)
 constraint satisfaction problems (Factored World States)
 Knowledge and Reasoning
 Formal Logic as the mathematics of Meaning
 Propositional logic and satisfiability (Atomic Propositions)
 First-order logic and theorem proving (Quantification)
 Logic programming (Logic + Search; Programming)
 Description logics and semantic web

 Planning
 Planning Frameworks
 Planning Algorithms
 Planning and Acting in the real world

Michael Kohlhase: Artificial Intelligence 2 1107 2025-02-06

Rational Agents as an Evaluation Framework for AI


 Agents interact with the environment

695
696 CHAPTER 32. WHAT DID WE LEARN IN AI 1/2?

Section 2.1. Agents and Environments 35


General agent schema

Agent Sensors
Percepts

Environment
?

Actions
Actuators

Figure 2.1 Agents interact with environments through sensors and actuators.
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49

there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
We can imagine tabulating the agent function that describes any given agent; for most
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment

in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling
rule ←in Rthe
ULEright-hand column
-M ATCH(state, in various ways. The obvious question, then, is this: What
rules)
is theaction
right ←way to fill
rule.ACTION out the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action

1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.

trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
Section 2.4. The Structure of Agents 51 697

Sensors
State
How the world evolves What the world
is like now

Environment
What my actions do

Condition-action rules What action I


should do now

Agent Actuators

52 Figure 2.11 A model-based reflex agent. Chapter 2. Intelligent Agents


Goal-Based Agents

function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action


Sensors
persistent: state, the agent’s current conception of the world state
model , a description of how the next state depends on current state and action
State
What the world
rules, aHow
setthe
of world
condition–action
evolves rules is like now
action, the most recent action, initially none

Environment
What it will be like
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
What action I
Goals should do now
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
Agent Actuators

is responsible for creating


Figure 2.13 the newgoal-based
A model-based, internal state description.
agent. The
It keeps track details
of the ofstate
world howasmodels
well as and
54 statesa are represented vary widely depending on the Chapter 2.
type ofthat
environmentIntelligent Agents
and lead
the particular
Utility-Based
set of Agent
goals it is trying to achieve, and chooses an action will (eventually) to the
technology used of
achievement in its
thegoals.
agent design. Detailed examples of models and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
Regardless of the kind of representation used, Sensorsit is seldom possible for the agent to
example, the
determine thetaxi may state
current be driving back home,
of a partially
State
and it may
observable have a rule
environment telling Instead,
exactly. it to fill up
thewith
box
gas on the way home
labeled “what the world unless it
is world has
like now” at least half a tank. Although “driving back home” may
evolves (Figure 2.11) represents the agent’s “best guess” (or
What the world
How the
seem to an best
sometimes aspect of the world
guesses). state, thean
For example, fact of the
automated taxi’s
is like
may not beisable
taxidestination
now actually
to seeanaround
aspect the
of
Environment

the agent’s internal state.


large truck that has stopped If you find this puzzling, consider that the taxi could be in exactly
What myinactions
frontdoof it and canWhatonly
it will guess
be like about what may be causing the
the same Thus,
hold-up. place uncertainty
at the same time, but
about the intending toif reach
current state
I do action
may be a different
A destination.
unavoidable, but the agent still has
to make a decision. How happy I will be
2.4.4A perhaps
Goal-based agents
less obvious
Utility
point about the internal in such a state
“state” maintained by a model-based
agent
Knowingis that it does not
something have
about thetocurrent
describe “what
state of thethe world
environment
What action I
is like now”
is not in a enough
always literal sense. For
to decide
should do now
what to do. For example, at a road junction, the taxi can turn left, turn right, or go straight
on. The correct decision
Agent depends on where the taxiActuators is trying to get to. In other words, as well
GOAL as a current state description, the agent needs some sort of goal information that describes
situations that are desirable—for example, being at the passenger’s destination. The agent
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
program can combine this with the model (the same information as was used in the model-
Learning Agents
a utility function that measures its preferences among states of the world. Then it chooses the
basedaction
reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
that leads to the best expected utility, where expected utility is computed by averaging
agent’s structure.
over all possible outcome states, weighted by the probability of the outcome.
Sometimes goal-based action selection is straightforward—for example, when goal sat-
isfaction results immediately from a single action. Sometimes it will be more tricky—for
outcome. (Appendix
example, when A defines
the agent has toexpectation
consider longmore precisely.)
sequences In Chapter
of twists 16, we
and turns show to
in order that any
find a
rational agent must
way to achieve behave
the goal. it possesses
as if (Chapters
Search 3 toa 5)
utility
and function
planningwhose expected
(Chapters 10 andvalue it tries
11) are the
to maximize.
subfields of AIAn agent that
devoted possesses
to finding an sequences
action explicit utility
that function
achieve thecanagent’s
make rational
goals. decisions
with aNotice
general-purpose algorithm
that decision making that does
of this notisdepend
kind on the specific
fundamentally differentutility function
from the being
condition–
maximized. In this way,
action rules described the “global”
earlier, in that itdefinition of rationality—designating
involves consideration of the future—both as rational
“Whatthose
will
agent
happenfunctions that have the highest
if I do such-and-such?” and “Willperformance—is turned into
that make me happy?” a “local”
In the constraint
reflex agent on
designs,
rational-agent
this information designs
is notthat can be expressed
explicitly in abecause
represented, simple theprogram.
built-in rules map directly from
The utility-based agent structure appears in Figure 2.14. Utility-based agent programs
appear in Part IV, where we design decision-making agents that must handle the uncertainty
Section
698 2.4. The Structure of Agents 55 AI 1/2?
CHAPTER 32. WHAT DID WE LEARN IN

Performance standard

Critic Sensors

feedback

Environment
changes
Learning Performance
element element
knowledge
learning
goals

Problem
generator

Actuators
Agent

Figure 2.15 A general learning agent.


Michael Kohlhase: Artificial Intelligence 2 1108 2025-02-06

He estimates how much work this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
in initially
 Idea: Tryunknown
to design environments
agents that andare
to become
successfulmore competent than(do its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Throughout 32.0.1.
 Definition the book, An we agent
comment on opportunities
is called rational, ifandit methods
chooses for learning in
whichever particular
action max-
kinds of agents. Part V goes into much more depth on the learning
imizes the expected value of the performance measure given the percept sequence algorithms themselves.
to date.A learning agent can
This is called thebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT  Note:
sponsibleAfor makingagent
rational improvements,
need notand bethe performance element, which is responsible for
perfect
selecting external actions. The performance element is what we have previously considered
to only
 be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
 need from critic on
notthepredict e.g.how verytheunlikely
agent is butdoing and determines
catastrophic how in
events thetheperformance
future
element should be modified to do better in the future.
 percepts
The designmayofnot supply all
the learning relevant
element information
depends very much on the(Rational
design of the clairvoyant)
̸= performance
element.
 if weWhen tryingperceive
cannot to design things
an agentwe that
dolearns a certain
not need capability,
to react the first question is
to them.
not “How am I going to get it to learn this?” but
 but we may need to try to find out about hidden dangers
“What kind of performance element will my
(exploration)
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
 action
can outcomes
be constructed may not
to improve be as
every partexpected
of the agent. (rational ̸= successful)
 but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
 Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance

Michael Kohlhase: Artificial Intelligence 2 1109 2025-02-06

Symbolic AI: Adding Knowledge to Algorithms

 Problem Solving (Black Box States, Transitions, Heuristics)


 Framework: Problem Solving and Search (basic tree/graph walking)
 Variant: Game playing (Adversarial search) (minimax + αβ-Pruning)

 Constraint Satisfaction Problems (heuristic search over partial assignments)


699

 States as partial variable assignments, transitions as assignment


 Heuristics informed by current restrictions, constraint graph
 Inference as constraint propagation (transferring possible values across arcs)

 Describing world states by formal language (and drawing inferences)


 Propositional logic and DPLL (deciding entailment efficiently)
 First-order logic and ATP (reasoning about infinite domains)
 Digression: Logic programming (logic + search)
 Description logics as moderately expressive, but decidable logics

 Planning: Problem Solving using white-box world/action descriptions


 Framework: describing world states in logic as sets of propositions and actions
by preconditions and add/delete lists
 Algorithms: e.g heuristic search by problem relaxations

Michael Kohlhase: Artificial Intelligence 2 1110 2025-02-06

Topics of AI-2 (Summer Semester)


 Uncertain Knowledge and Reasoning
 Uncertainty
 Probabilistic reasoning
 Making Decisions in Episodic Environments
 Problem Solving in Sequential Environments
 Foundations of machine learning
 Learning from Observations
 Knowledge in Learning
 Statistical Learning Methods
 Communication (If there is time)
 Natural Language Processing
 Natural Language for Communication

Michael Kohlhase: Artificial Intelligence 2 1111 2025-02-06

Statistical AI: Adding uncertainty and Learning

 Problem Solving under uncertainty(non-observable environment, stochastic states)


 Framework: Probabilistic Inference: Conditional Probabilities/Independence
 Intuition: Reasoning in Belief Space instead of State Space!
 Implementation: Bayesian Networks (exploit conditional independence)
700 CHAPTER 32. WHAT DID WE LEARN IN AI 1/2?

 Extension: Utilities and Decision Theory (for static/episodic environments)


 Problem Solving in Sequential Worlds:
 Framework: Markov Processes, transition models
 Extension: MDPs, POMDPs (+ utilities/decisions)
 Implementation: Dynamic Bayesian Networks
 Machine learning: adding optimization in changing environments (unsupervised)
 Framework: Learning from Observations (positive/negative examples)
 Intuitions: finding consistent/optimal hypotheses in a hypothesis space
 Problems: consistency, expressivity, under/overfitting, computational/data re-
sources.
 Extensions
 knowledge in learning (based on logical methods)
 statistical learning (optimizing the probability distribution over hypspace,
learning BNs)
 Communication
 Phenomena of natural language (NL is interesting/complex)
 symbolic/statistical NLP (historic/as a backup)
 Deep Learning for NLP (the current hype/solution)

Michael Kohlhase: Artificial Intelligence 2 1112 2025-02-06

Topics of AI-3 – A Course not taught at FAU /


 Machine Learning

 Theory and Practice of Deep Learning


 More Reinforcement Learning
 Communicating, Perceiving, and Acting

 More NLP, dialogue, speech acts, ...


 Natural Language Semantics/Pragmatics
 Perception
 Robotics
 Emotions, Sentiment Analysis

 The Good News: All is not lost


 There are tons of specialized courses at FAU (more as we speak)
 Russell/Norvig’s AIMA [RN09] cover some of them as well!

Michael Kohlhase: Artificial Intelligence 2 1113 2025-02-06


Bibliography

[Bac00] Fahiem Bacchus. Subset of PDDL for the AIPS2000 Planning Competition. The AIPS-
00 Planning Competition Comitee. 2000.
[BF95] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Proceedings of the 14th International Joint Conference on Artificial Intelligence
(IJCAI). Ed. by Chris S. Mellish. Montreal, Canada: Morgan Kaufmann, San Mateo,
CA, 1995, pp. 1636–1642.
[BF97] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Artificial Intelligence 90.1-2 (1997), pp. 279–298.
[BG01] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search”. In: Artificial Intelli-
gence 129.1–2 (2001), pp. 5–33.
[BG99] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search: New Results”. In:
Proceedings of the 5th European Conference on Planning (ECP’99). Ed. by S. Biundo
and M. Fox. Springer-Verlag, 1999, pp. 60–72.
[BKS04] Paul Beame, Henry A. Kautz, and Ashish Sabharwal. “Towards Understanding and
Harnessing the Potential of Clause Learning”. In: Journal of Artificial Intelligence
Research 22 (2004), pp. 319–351.
[Bon+12] Blai Bonet et al., eds. Proceedings of the 22nd International Conference on Automated
Planning and Scheduling (ICAPS’12). AAAI Press, 2012.
[Bro90] Rodney Brooks. In: Robotics and Autonomous Systems 6.1–2 (1990), pp. 3–15. doi:
10.1016/S0921-8890(05)80025-9.
[Cho65] Noam Chomsky. Syntactic structures. Den Haag: Mouton, 1965.
[CKT91] Peter Cheeseman, Bob Kanefsky, and William M. Taylor. “Where the Really Hard
Problems Are”. In: Proceedings of the 12th International Joint Conference on Artificial
Intelligence (IJCAI). Ed. by John Mylopoulos and Ray Reiter. Sydney, Australia:
Morgan Kaufmann, San Mateo, CA, 1991, pp. 331–337.
[CM85] Eugene Charniak and Drew McDermott. Introduction to Artificial Intelligence. Ad-
dison Wesley, 1985.
[CQ69] Allan M. Collins and M. Ross Quillian. “Retrieval time from semantic memory”. In:
Journal of verbal learning and verbal behavior 8.2 (1969), pp. 240–247. doi: 10.1016/
S0022-5371(69)80069-1.
[Dav67] Donald Davidson. “Truth and Meaning”. In: Synthese 17 (1967).
[DCM12] DCMI Usage Board. DCMI Metadata Terms. DCMI Recommendation. Dublin Core
Metadata Initiative, June 14, 2012. url: http : / / dublincore . org / documents /
2012/06/14/dcmi-terms/.
[DF31] B. De Finetti. “Sul significato soggettivo della probabilita”. In: Fundamenta Mathe-
maticae 17 (1931), pp. 298–329.

701
702 BIBLIOGRAPHY

[DHK15] Carmel Domshlak, Jörg Hoffmann, and Michael Katz. “Red-Black Planning: A New
Systematic Approach to Partial Delete Relaxation”. In: Artificial Intelligence 221
(2015), pp. 73–114.
[Ede01] Stefan Edelkamp. “Planning with Pattern Databases”. In: Proceedings of the 6th Eu-
ropean Conference on Planning (ECP’01). Ed. by A. Cesta and D. Borrajo. Springer-
Verlag, 2001, pp. 13–24.
[FD14] Zohar Feldman and Carmel Domshlak. “Simple Regret Optimization in Online Plan-
ning for Markov Decision Processes”. In: Journal of Artificial Intelligence Research
51 (2014), pp. 165–205.
[Fis] John R. Fisher. prolog :- tutorial. url: https : / / saksagan . ceng . metu . edu .
tr/courses/ceng242/documents/prolog/jrfisher/contents.html (visited on
10/29/2024).
[FL03] Maria Fox and Derek Long. “PDDL2.1: An Extension to PDDL for Expressing Tem-
poral Planning Domains”. In: Journal of Artificial Intelligence Research 20 (2003),
pp. 61–124.
[Fla94] Peter Flach. Wiley, 1994. isbn: 0471 94152 2. url: https://github.com/simply-
logical/simply-logical/releases/download/v1.0/SL.pdf.
[FN71] Richard E. Fikes and Nils Nilsson. “STRIPS: A New Approach to the Application of
Theorem Proving to Problem Solving”. In: Artificial Intelligence 2 (1971), pp. 189–
208.
[Gen34] Gerhard Gentzen. “Untersuchungen über das logische Schließen I”. In: Mathematische
Zeitschrift 39.2 (1934), pp. 176–210.
[Ger+09] Alfonso Gerevini et al. “Deterministic planning in the fifth international planning
competition: PDDL3 and experimental evaluation of the planners”. In: Artificial In-
telligence 173.5-6 (2009), pp. 619–668.
[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractability—A Guide to
the Theory of NP-Completeness. BN book: Freeman, 1979.
[Glo] Grundlagen der Logik in der Informatik. Course notes at https://www8.cs.fau.de/
_media/ws16:gloin:skript.pdf. url: https://www8.cs.fau.de/_media/ws16:
gloin:skript.pdf (visited on 10/13/2017).
[GNT04] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: Theory and
Practice. Morgan Kaufmann, 2004.
[GS05] Carla Gomes and Bart Selman. “Can get satisfaction”. In: Nature 435 (2005), pp. 751–
752.
[GSS03] Alfonso Gerevini, Alessandro Saetti, and Ivan Serina. “Planning through Stochas-
tic Local Search and Temporal Action Graphs”. In: Journal of Artificial Intelligence
Research 20 (2003), pp. 239–290.
[Hau85] John Haugeland. Artificial intelligence: the very idea. Massachusetts Institute of Tech-
nology, 1985.
[HD09] Malte Helmert and Carmel Domshlak. “Landmarks, Critical Paths and Abstractions:
What’s the Difference Anyway?” In: Proceedings of the 19th International Conference
on Automated Planning and Scheduling (ICAPS’09). Ed. by Alfonso Gerevini et al.
AAAI Press, 2009, pp. 162–169.
[HE05] Jörg Hoffmann and Stefan Edelkamp. “The Deterministic Part of IPC-4: An Overview”.
In: Journal of Artificial Intelligence Research 24 (2005), pp. 519–579.
[Hel06] Malte Helmert. “The Fast Downward Planning System”. In: Journal of Artificial In-
telligence Research 26 (2006), pp. 191–246.
BIBLIOGRAPHY 703

[Her+13a] Ivan Herman et al. RDF 1.1 Primer (Second Edition). Rich Structured Data Markup
for Web Documents. W3C Working Group Note. World Wide Web Consortium (W3C),
2013. url: http://www.w3.org/TR/rdfa-primer.
[Her+13b] Ivan Herman et al. RDFa 1.1 Primer – Second Edition. Rich Structured Data Markup
for Web Documents. W3C Working Goup Note. World Wide Web Consortium (W3C),
Apr. 19, 2013. url: http://www.w3.org/TR/xhtml-rdfa-primer/.
[HG00] Patrik Haslum and Hector Geffner. “Admissible Heuristics for Optimal Planning”. In:
Proceedings of the 5th International Conference on Artificial Intelligence Planning
Systems (AIPS’00). Ed. by S. Chien, R. Kambhampati, and C. Knoblock. Brecken-
ridge, CO: AAAI Press, Menlo Park, 2000, pp. 140–149.
[HG08] Malte Helmert and Hector Geffner. “Unifying the Causal Graph and Additive Heuris-
tics”. In: Proceedings of the 18th International Conference on Automated Planning
and Scheduling (ICAPS’08). Ed. by Jussi Rintanen et al. AAAI Press, 2008, pp. 140–
147.
[HHH07] Malte Helmert, Patrik Haslum, and Jörg Hoffmann. “Flexible Abstraction Heuristics
for Optimal Sequential Planning”. In: Proceedings of the 17th International Conference
on Automated Planning and Scheduling (ICAPS’07). Ed. by Mark Boddy, Maria
Fox, and Sylvie Thiebaux. Providence, Rhode Island, USA: Morgan Kaufmann, 2007,
pp. 176–183.
[Hit+12] Pascal Hitzler et al. OWL 2 Web Ontology Language Primer (Second Edition). W3C
Recommendation. World Wide Web Consortium (W3C), 2012. url: http://www.
w3.org/TR/owl-primer.
[HN01] Jörg Hoffmann and Bernhard Nebel. “The FF Planning System: Fast Plan Generation
Through Heuristic Search”. In: Journal of Artificial Intelligence Research 14 (2001),
pp. 253–302.
[Hof11] Jörg Hoffmann. “Every806thing You Always Wanted to Know about Planning (But
Were Afraid to Ask)”. In: Proceedings of the 34th Annual German Conference on
Artificial Intelligence (KI’11). Ed. by Joscha Bach and Stefan Edelkamp. Vol. 7006.
Lecture Notes in Computer Science. Springer, 2011, pp. 1–13. url: http://fai.cs.
uni-saarland.de/hoffmann/papers/ki11.pdf.
[How60] R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
[ILD] 7. Constraints: Interpreting Line Drawings. url: https://www.youtube.com/watch?
v=l-tzjenXrvI&t=2037s (visited on 11/19/2019).
[JN33] E. S. Pearson J. Neyman. “IX. On the problem of the most efficient tests of statis-
tical hypotheses”. In: Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences 231.694-706 (1933), pp. 289–337.
doi: 10.1098/rsta.1933.0009.
[KC04] Graham Klyne and Jeremy J. Carroll. Resource Description Framework (RDF): Con-
cepts and Abstract Syntax. W3C Recommendation. World Wide Web Consortium
(W3C), Feb. 10, 2004. url: http://www.w3.org/TR/2004/REC- rdf- concepts-
20040210/.
[KD09] Erez Karpas and Carmel Domshlak. “Cost-Optimal Planning with Landmarks”. In:
Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJ-
CAI’09). Ed. by C. Boutilier. Pasadena, California, USA: Morgan Kaufmann, July
2009, pp. 1728–1733.
[Kee74] R. L. Keeney. “Multiplicative utility functions”. In: Operations Research 22 (1974),
pp. 22–34.
704 BIBLIOGRAPHY

[KHD13] Michael Katz, Jörg Hoffmann, and Carmel Domshlak. “Who Said We Need to Relax
all Variables?” In: Proceedings of the 23rd International Conference on Automated
Planning and Scheduling (ICAPS’13). Ed. by Daniel Borrajo et al. Rome, Italy: AAAI
Press, 2013, pp. 126–134.
[KHH12a] Michael Katz, Jörg Hoffmann, and Malte Helmert. “How to Relax a Bisimulation?”
In: Proceedings of the 22nd International Conference on Automated Planning and
Scheduling (ICAPS’12). Ed. by Blai Bonet et al. AAAI Press, 2012, pp. 101–109.
[KHH12b] Emil Keyder, Jörg Hoffmann, and Patrik Haslum. “Semi-Relaxed Plan Heuristics”.
In: Proceedings of the 22nd International Conference on Automated Planning and
Scheduling (ICAPS’12). Ed. by Blai Bonet et al. AAAI Press, 2012, pp. 128–136.
[KNS97] B. Kessler, G. Nunberg, and H. Schütze. “Automatic detection of text genre”. In:
CoRR cmp-lg/9707002 (1997).
[Koe+97] Jana Koehler et al. “Extending Planning Graphs to an ADL Subset”. In: Proceedings
of the 4th European Conference on Planning (ECP’97). Ed. by S. Steel and R. Alami.
Springer-Verlag, 1997, pp. 273–285. url: ftp://ftp.informatik.uni- freiburg.
de/papers/ki/koehler-etal-ecp-97.ps.gz.
[Koh08] Michael Kohlhase. “Using LATEX as a Semantic Markup Format”. In: Mathematics in
Computer Science 2.2 (2008), pp. 279–304. url: https://kwarc.info/kohlhase/
papers/mcs08-stex.pdf.
[Kow97] Robert Kowalski. “Algorithm = Logic + Control”. In: Communications of the Asso-
ciation for Computing Machinery 22 (1997), pp. 424–436.
[KS00] Jana Köhler and Kilian Schuster. “Elevator Control as a Planning Problem”. In: AIPS
2000 Proceedings. AAAI, 2000, pp. 331–338. url: https://www.aaai.org/Papers/
AIPS/2000/AIPS00-036.pdf.
[KS06] Levente Kocsis and Csaba Szepesvári. “Bandit Based Monte-Carlo Planning”. In:
Proceedings of the 17th European Conference on Machine Learning (ECML 2006). Ed.
by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou. Vol. 4212. LNCS.
Springer-Verlag, 2006, pp. 282–293.
[KS92] Henry A. Kautz and Bart Selman. “Planning as Satisfiability”. In: Proceedings of the
10th European Conference on Artificial Intelligence (ECAI’92). Ed. by B. Neumann.
Vienna, Austria: Wiley, Aug. 1992, pp. 359–363.
[KS98] Henry A. Kautz and Bart Selman. “Pushing the Envelope: Planning, Propositional
Logic, and Stochastic Search”. In: Proceedings of the Thirteenth National Conference
on Artificial Intelligence AAAI-96. MIT Press, 1998, pp. 1194–1201.
[Kur90] Ray Kurzweil. The Age of Intelligent Machines. MIT Press, 1990. isbn: 0-262-11121-7.
[LPN] Learn Prolog Now! url: http://lpn.swi-prolog.org/ (visited on 10/10/2019).
[LS93] George F. Luger and William A. Stubblefield. Artificial Intelligence: Structures and
Strategies for Complex Problem Solving. World Student Series. The Benjamin/Cum-
mings, 1993. isbn: 9780805347852.
[Luc96] Peter Lucas. “Knowledge Acquisition for Decision-theoretic Expert Systems”. In:
AISB Quarterly 94 (1996), pp. 23–33. url: https : / / www . researchgate . net /
publication/2460438_Knowledge_Acquisition_for_Decision-theoretic_Expert_
Systems.
[McD+98] Drew McDermott et al. The PDDL Planning Domain Definition Language. The AIPS-
98 Planning Competition Comitee. 1998.
[Met+53] N. Metropolis et al. “Equations of state calculations by fast computing machines”. In:
Journal of Chemical Physics 21 (1953), pp. 1087–1091.
[Min] Minion - Constraint Modelling. System Web page at http://constraintmodelling.
org/minion/. url: http://constraintmodelling.org/minion/.
BIBLIOGRAPHY 705

[MSL92] David Mitchell, Bart Selman, and Hector J. Levesque. “Hard and Easy Distributions
of SAT Problems”. In: Proceedings of the 10th National Conference of the American
Association for Artificial Intelligence (AAAI’92). San Jose, CA: MIT Press, 1992,
pp. 459–465.
[NHH11] Raz Nissim, Jörg Hoffmann, and Malte Helmert. “Computing Perfect Heuristics in
Polynomial Time: On Bisimulation and Merge-and-Shrink Abstraction in Optimal
Planning”. In: Proceedings of the 22nd International Joint Conference on Artificial
Intelligence (IJCAI’11). Ed. by Toby Walsh. AAAI Press/IJCAI, 2011, pp. 1983–
1990.
[Nor+18a] Emily Nordmann et al. Lecture capture: Practical recommendations for students and
lecturers. 2018. url: https://osf.io/huydx/download.
[Nor+18b] Emily Nordmann et al. Vorlesungsaufzeichnungen nutzen: Eine Anleitung für Studierende.
2018. url: https://osf.io/e6r7a/download.
[NS63] Allen Newell and Herbert Simon. “GPS, a program that simulates human thought”.
In: Computers and Thought. Ed. by E. Feigenbaum and J. Feldman. McGraw-Hill,
1963, pp. 279–293.
[NS76] Alan Newell and Herbert A. Simon. “Computer Science as Empirical Inquiry: Symbols
and Search”. In: Communications of the ACM 19.3 (1976), pp. 113–126. doi: 10.
1145/360018.360022.
[OWL09] OWL Working Group. OWL 2 Web Ontology Language: Document Overview. W3C
Recommendation. World Wide Web Consortium (W3C), Oct. 27, 2009. url: http:
//www.w3.org/TR/2009/REC-owl2-overview-20091027/.
[PD09] Knot Pipatsrisawat and Adnan Darwiche. “On the Power of Clause-Learning SAT
Solvers with Restarts”. In: Proceedings of the 15th International Conference on Princi-
ples and Practice of Constraint Programming (CP’09). Ed. by Ian P. Gent. Vol. 5732.
Lecture Notes in Computer Science. Springer, 2009, pp. 654–668.
[Pól73] George Pólya. How to Solve it. A New Aspect of Mathematical Method. Princeton
University Press, 1973.
[Pra+94] Malcolm Pradhan et al. “Knowledge Engineering for Large Belief Networks”. In:
Proceedings of the Tenth International Conference on Uncertainty in Artificial In-
telligence. UAI’94. Seattle, WA: Morgan Kaufmann Publishers Inc., 1994, pp. 484–
490. isbn: 1-55860-332-8. url: http://dl.acm.org/citation.cfm?id=2074394.
2074456.
[Pro] Protégé. Project Home page at http : / / protege . stanford . edu. url: http : / /
protege.stanford.edu.
[PRR97] G. Probst, St. Raub, and Kai Romhardt. Wissen managen. 4 (2003). Gabler Verlag,
1997.
[PS08] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C
Recommendation. World Wide Web Consortium (W3C), Jan. 15, 2008. url: http:
//www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.
[PW92] J. Scott Penberthy and Daniel S. Weld. “UCPOP: A Sound, Complete, Partial Order
Planner for ADL”. In: Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the 3rd International Conference (KR-92). Ed. by B. Nebel, W. Swartout,
and C. Rich. Cambridge, MA: Morgan Kaufmann, Oct. 1992, pp. 103–114. url: ftp:
//ftp.cs.washington.edu/pub/ai/ucpop-kr92.ps.Z.
[Ran17] Aarne Ranta. Automatic Translation for Consumers and Producers. Presentation
given at the Chalmers Initiative Seminar. 2017. url: https://www.grammaticalframework.
org/~aarne/mt-digitalization-2017.pdf.
706 BIBLIOGRAPHY

[RHN06] Jussi Rintanen, Keijo Heljanko, and Ilkka Niemelä. “Planning as satisfiability: parallel
plans and algorithms for plan search”. In: Artificial Intelligence 170.12-13 (2006),
pp. 1031–1080.
[Rin10] Jussi Rintanen. “Heuristics for Planning with SAT”. In: Proceeedings of the 16th In-
ternational Conference on Principles and Practice of Constraint Programming. 2010,
pp. 414–428.
[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 2nd ed.
Pearso n Education, 2003. isbn: 0137903952.
[RN09] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd.
Prentice Hall Press, 2009. isbn: 0136042597, 9780136042594.
[RN95] Stuart J. Russell and Peter Norvig. Artificial Intelligence — A Modern Approach.
Upper Saddle River, NJ: Prentice Hall, 1995.
[RW10] Silvia Richter and Matthias Westphal. “The LAMA Planner: Guiding Cost-Based
Anytime Planning with Landmarks”. In: Journal of Artificial Intelligence Research
39 (2010), pp. 127–177.
[RW91] S. J. Russell and E. Wefald. Do the Right Thing — Studies in limited Rationality.
MIT Press, 1991.
[She24] Esther Shein. 2024. url: https://cacm.acm.org/news/the- impact- of- ai- on-
computer-science-education/.
[Sil+16] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529 (2016), pp. 484–503. url: http://www.nature.com/nature/
journal/v529/n7587/full/nature16961.html.
[Smu63] Raymond M. Smullyan. “A Unifying Principle for Quantification Theory”. In: Proc.
Nat. Acad Sciences 49 (1963), pp. 828–832.
[SR14] Guus Schreiber and Yves Raimond. RDF 1.1 Primer. W3C Working Group Note.
World Wide Web Consortium (W3C), 2014. url: http://www.w3.org/TR/rdf-
primer.
[sTeX] sTeX: A semantic Extension of TeX/LaTeX. url: https://github.com/sLaTeX/
sTeX (visited on 05/11/2020).
[SWI] SWI Prolog Reference Manual. url: https://www.swi-prolog.org/pldoc/refman/
(visited on 10/10/2019).
[Tur50] Alan Turing. “Computing Machinery and Intelligence”. In: Mind 59 (1950), pp. 433–
460.
[Vas+17] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Infor-
mation Processing Systems. Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc.,
2017. url: https://proceedings.neurips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[Wal75] David Waltz. “Understanding Line Drawings of Scenes with Shadows”. In: The Psy-
chology of Computer Vision. Ed. by P. H. Winston. McGraw-Hill, 1975, pp. 1–19.
[WHI] Human intelligence — Wikipedia The Free Encyclopedia. url: https://en.wikipedia.
org/w/index.php?title=Human_intelligence (visited on 04/09/2018).
Part VIII

Excursions

707
709

As this course is predominantly an overview over the topics of Artificial Intelligence, and not
about the theoretical underpinnings, we give the discussion about these as a “suggested readings”
part here.
710
Appendix A

Completeness of Calculi for


Propositional Logic

The next step is to analyze the two calculi for completeness. For that we will first give ourselves
a very powerful tool: the “model existence theorem” (??), which encapsulates the model-theoretic
part of completeness theorems. With that, completeness proofs – which are quite tedious otherwise
– become a breeze.

A.1 Abstract Consistency and Model Existence


We will now come to an important tool in the theoretical study of reasoning calculi: the “abstract
consistency”/“model existence” method. This method for analyzing calculi was developed by Jaako
Hintikka, Raymond Smullyan, and Peter Andrews in 1950-1970 as an encapsulation of similar
constructions that were used in completeness arguments in the decades before. The basis for
this method is Smullyan’s Observation [Smu63] that completeness proofs based on Hintikka sets
only certain properties of consistency and that with little effort one can obtain a generalization
“Smullyan’s Unifying Principle”.
The basic intuition for this method is the following: typically, a logical system L := ⟨L, K, ⊨⟩ has
multiple calculi, human-oriented ones like the natural deduction calculi and machine-oriented ones
like the automated theorem proving calculi. All of these need to be analyzed for completeness (as
a basic quality assurance measure).
A completeness proof for a calculus C for S typically comes in two parts: one analyzes C-
consistency (sets that cannot be refuted in C), and the other construct K-models for C-consistent
sets.
In this situtation the “abstract consistency”/“model existence” method encapsulates the model
construction process into a meta-theorem: the “model existence” theorem. This provides a set of
syntactic (“abstract consistency”) conditions for calculi that are sufficient to construct models.
With the model existence theorem it suffices to show that C-consistency is an abstract consis-
tency property (a purely syntactic task that can be done by a C-proof transformation argument)
to obtain a completeness result for C.

Model Existence (Overview)


 Definition: Abstract consistency
 Definition: Hintikka set (maximally abstract consistent)
 Theorem: Hintikka sets are satisfiable

711
712 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC

 Theorem: If Φ is abstract consistent, then Φ can be extended to a Hintikka set.


 Corollary: If Φ is abstract consistent, then Φ is satisfiable.
 Application: Let C be a calculus, if Φ is C-consistent, then Φ is abstract consistent.

 Corollary: C is complete.

Michael Kohlhase: Artificial Intelligence 2 1114 2025-02-06

The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for calculi;
they form syntactic counterparts of satisfiability.

Consistency
 Let C be a calculus,. . .
 Definition A.1.1. Let C be a calculus, then a formula set Φ is called C-refutable, if
there is a refutation, i.e. a derivation of a contradiction from Φ. The act of finding
a refutation for Φ is called refuting Φ.
 Definition A.1.2. We call a pair of formulae A and ¬A a contradiction.
 So a set Φ is C-refutable, if C canderive a contradiction from it.

 Definition A.1.3. Let C be a calculus, then a formula set Φ is called C-consistent,


iff there is a formula B, that is not derivable from Φ in C.
 Definition A.1.4. We call a calculus C reasonable, iff implication elimination and
conjunction introduction are admissible in C and A ∧ ¬A ⇒ B is a C-theorem.

 Theorem A.1.5. C-inconsistency and C-refutability coincide for reasonable calculi.

Michael Kohlhase: Artificial Intelligence 2 1115 2025-02-06

It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, ⊨⟩ is the current logical system.
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 713

Abstract Consistency
 Definition A.1.6. Let ∇ be a collection of sets. We call ∇ closed under subsets,
iff for each Φ ∈ ∇, all subsets Ψ ⊆ Φ are elements of ∇.
 Definition A.1.7 (Notation). We will use Φ∗A for Φ ∪ {A}.
 Definition A.1.8. A collection ∇ of sets of propositional formulae is called an
abstract consistency class, iff it is closed under subsets, and for each Φ ∈ ∇
∇c ) P ̸∈ Φ or ¬P ̸∈ Φ for P ∈ V0
∇¬ ) ¬¬A ∈ Φ implies Φ∗A ∈ ∇
∇∨ ) A ∨ B ∈ Φ implies Φ∗A ∈ ∇ or Φ∗B ∈ ∇
∇∧ ) ¬(A ∨ B) ∈ Φ implies Φ ∪ {¬A, ¬B} ∈ ∇
 Example A.1.9. The empty set is an abstract consistency class.
 Example A.1.10. The set {∅, {Q}, {P ∨Q}, {P ∨Q, Q}} is an abstract consistency
class.

 Example A.1.11. The family of satisfiable sets is an abstract consistency class.

Michael Kohlhase: Artificial Intelligence 2 1116 2025-02-06

So a family of sets (we call it a family, so that we do not have to say “set of sets” and we can
distinguish the levels) is an abstract consistency class, iff it fulfills five simple conditions, of which
the last three are closure conditions.
Think of an abstract consistency class as a family of “consistent” sets (e.g. C-consistent for some
calculus C), then the properties make perfect sense: They are naturally closed under subsets — if
we cannot derive a contradiction from a large set, we certainly cannot from a subset, furthermore,
∇c ) If both P ∈ Φ and ¬P ∈ Φ, then Φ cannot be “consistent”.
∇¬ ) If we cannot derive a contradiction from Φ with ¬¬A ∈ Φ then we cannot from Φ∗A, since
they are logically equivalent.
The other two conditions are motivated similarly. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.

Compact Collections
 Definition A.1.12. We call a collection ∇ of sets compact, iff for any set Φ we
have
Φ ∈ ∇, iff Ψ ∈ ∇ for every finite subset Ψ of Φ.
 Lemma A.1.13. If ∇ is compact, then ∇ is closed under subsets.

 Proof:
714 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC

1. Suppose S ⊆ T and T ∈ ∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A ∈ ∇.
4. Thus S ∈ ∇.

Michael Kohlhase: Artificial Intelligence 2 1117 2025-02-06

The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a collection ∇ by testing
all their finite subsets (which is much simpler).

Compact Abstract Consistency Classes


 Lemma A.1.14. Any abstract consistency class can be extended to a compact
one.
 Proof:
1. We choose ∇′ := {Φ ⊆ wff0 (V0 ) | every finite subset of Φ is in ∇}.
2. Now suppose that Φ ∈ ∇. ∇ is closed under subsets, so every finite subset of
Φ is in ∇ and thus Φ ∈ ∇′ . Hence ∇ ⊆ ∇′ .
3. Next let us show that each ∇ is compact.’
3.1. Suppose Φ ∈ ∇′ and Ψ is an arbitrary finite subset of Φ.
3.2. By definition of ∇′ all finite subsets of Φ are in ∇ and therefore Ψ ∈ ∇′ .
3.3. Thus all finite subsets of Φ are in ∇′ whenever Φ is in ∇′ .
3.4. On the other hand, suppose all finite subsets of Φ are in ∇′ .
3.5. Then by the definition of ∇′ the finite subsets of Φ are also in ∇, so
Φ ∈ ∇′ . Thus ∇′ is compact.
4. Note that ∇′ is closed under subsets by the Lemma above.
5. Now we show that if ∇ satisfies ∇∗ , then ∇ satisfies ∇∗ .’
5.1. To show ∇c , let Φ ∈ ∇′ and suppose there is an atom A, such that
{A, ¬A} ⊆ Φ. Then {A, ¬A} ∈ ∇ contradicting ∇c .
5.2. To show ∇¬ , let Φ ∈ ∇′ and ¬¬A ∈ Φ, then Φ∗A ∈ ∇′ .
5.2.1. Let Ψ be any finite subset of Φ∗A, and Θ := (Ψ\{A})∗¬¬A.
5.2.2. Θ is a finite subset of Φ, so Θ ∈ ∇.
5.2.3. Since ∇ is an abstract consistency class and ¬¬A ∈ Θ, we get
Θ∗A ∈ ∇ by ∇¬ .
5.2.4. We know that Ψ ⊆ Θ∗A and ∇ is closed under subsets, so Ψ ∈ ∇.
5.2.5. Thus every finite subset Ψ of Φ∗A is in ∇ and therefore by definition
Φ∗A ∈ ∇′ .
5.3. the other cases are analogous to ∇¬ .

Michael Kohlhase: Artificial Intelligence 2 1119 2025-02-06

Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.

∇-Hintikka Set
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 715

 Definition A.1.15. Let ∇ be an abstract consistency class, then we call a set


H ∈ ∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A ∈ ∇ we
already have A ∈ H.
 Theorem A.1.16 (Hintikka Properties). Let ∇ be an abstract consistency class
and H be a ∇-Hintikka set, then
Hc ) For all A ∈ wff0 (V0 ) we have A ̸∈ H or ¬A ̸∈ H
H¬ ) If ¬¬A ∈ H then A ∈ H
H∨ ) If A ∨ B ∈ H then A ∈ H or B ∈ H
H∧ ) If ¬(A ∨ B) ∈ H then ¬A, ¬B ∈ H

Michael Kohlhase: Artificial Intelligence 2 1120 2025-02-06

∇-Hintikka Set
 Proof:
We prove the properties in turn
1. Hc by induction on the structure of A
1.1. A ∈ V0 Then A ̸∈ H or ¬A ̸∈ H by ∇c .
1.2. A = ¬B
1.2.1. Let us assume that ¬B ∈ H and ¬¬B ∈ H,
1.2.2. then H∗B ∈ ∇ by ∇¬ , and therefore B ∈ H by maximality.
1.2.3. So both B and ¬B are in H, which contradicts the induction hy-
pothesis.
1.3. A = B ∨ C similar to the previous case
2. We prove H¬ by maximality of H in ∇.
2.1. If ¬¬A ∈ H, then H∗A ∈ ∇ by ∇¬ .
2.2. The maximality of H now gives us that A ∈ H.
Proof sketch: other H∗ are similar

Michael Kohlhase: Artificial Intelligence 2 1121 2025-02-06

The following theorem is one of the main results in the “abstract consistency”/”model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ ∈ H.

Extension Theorem
 Theorem A.1.17. If ∇ is an abstract consistency class and Φ ∈ ∇, then there is
a ∇-Hintikka set H with Φ ⊆ H.
 Proof:
1. Wlog. we assume that ∇ is compact (otherwise pass to compact extension)
2. We choose an enumeration A1 , . . . of the set wff0 (V0 )
3. and construct a sequence of sets Hi with H0 := Φ and

Hn if Hn ∗An ̸∈ ∇
Hn+1 :=
Hn ∗An if Hn ∗An ∈ ∇
S
4. Note that all Hi ∈ ∇, choose H := i∈N Hi
716 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC

5. Ψ ⊆ H finite implies there is a j ∈ N such that Ψ ⊆ Hj ,


6. so Ψ ∈ ∇ as ∇ is closed under subsets and H ∈ ∇ as ∇ is compact.
7. Let H∗B ∈ ∇, then there is a j ∈ N with B = Aj , so that B ∈ Hj+1 and
Hj+1 ⊆ H
8. Thus H is ∇-maximal

Michael Kohlhase: Artificial Intelligence 2 1122 2025-02-06

Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of wff0 (V0 ). If we pick a
different enumeration, we will end up with a different H. Say if A and ¬A are both ∇-consistent1
with Φ, then depending on which one is first in the enumeration H, will contain that one; with all
the consequences for subsequent choices in the construction process.

Valuation
 Definition A.1.18. A function ν : wff0 (V0 ) → Do is called a (propositional) valua-
tion, iff
 ν(¬A) = T, iff ν(A) = F
 ν(A ∧ B) = T, iff ν(A) = T and ν(B) = T

 Lemma A.1.19. If ν : wff0 (V0 ) → Do is a valuation and Φ ⊆ wff0 (V0 ) with ν(Φ) =
{T}, then Φ is satisfiable.
 Proof sketch: ν|V0 : V0 → Do is a satisfying variable assignment.
 Lemma A.1.20. If φ : V0 → Do is a variable assignment, then I φ : wff0 (V0 ) → Do
is a valuation.

Michael Kohlhase: Artificial Intelligence 2 1123 2025-02-06

Now, we only have to put the pieces together to obtain the model existence theorem we are after.

Model Existence
 Lemma A.1.21 (Hintikka-Lemma). If ∇ is an abstract consistency class and H
a ∇-Hintikka set, then H is satisfiable.
 Proof:
1. We define ν(A) := T, iff A ∈ H
2. then ν is a valuation by the Hintikka properties
3. and thus ν|V0 is a satisfying assignment.

 Theorem A.1.22 (Model Existence). If ∇ is an abstract consistency class and


Φ ∈ ∇, then Φ is satisfiable.
Proof:
 1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)

1 EdNote: introduce this above


A.2. A COMPLETENESS PROOF FOR PROPOSITIONAL TABLEAUX 717

3. In particular, Φ ⊆ H is satisfiable.

Michael Kohlhase: Artificial Intelligence 2 1124 2025-02-06

A.2 A Completeness Proof for Propositional Tableaux


With the model existence proof we have introduced in the last section, the completeness proof for
first-order natural deduction is rather simple, we only have to check that Tableaux-consistency is
an abstract consistency property.
We encapsulate all of the technical difficulties of the problem in a technical Lemma. From that,
the completeness proof is just an application of the high-level theorems we have just proven.

Abstract Completeness for T0

 Lemma A.2.1. {Φ | ΦT has no closed tableau} is an abstract consistency class.


 Proof: Let’s call the set above ∇
We have to convince ourselves of the abstract consistency properties
1. ∇c P , ¬P ∈ Φ implies P F , P T ∈ ΦT .
2. ∇¬ Let ¬¬A ∈ Φ.
2.1. For the proof of the contrapositive we assume that Φ∗A has a closed
tableau T and show that already Φ has one:
2.2. applying each of T0 ¬T and T0 ¬F once allows to extend any tableau with
¬¬Bα by Bα .
2.3. any path in T that is closed with ¬¬Aα , can be closed by Aα .
3. ∇∨ Suppose A ∨ B ∈ Φ and both Φ∗A and Φ∗B have closed tableaux
3.1. consider the tableaux:
ΨT
ΦT ΦT T
(A ∨ B)
AT BT T
Rest1 Rest2 A BT
Rest Rest2
1

4. ∇∧ suppose, ¬(A ∨ B) ∈ Φ and Φ{¬A, ¬B} have closed tableau T .


4.1. We consider
ΨT
ΦT F
(A ∨ B)
AF F
A
BF
Rest BF
Rest
where Φ = Ψ∗¬(A ∨ B).

Michael Kohlhase: Artificial Intelligence 2 1126 2025-02-06

Observation: If we look at the completeness proof below, we see that the Lemma above is the
only place where we had to deal with specific properties of the T0 .
So if we want to prove completeness of any other calculus with respect to propositional logic,
then we only need to prove an analogon to this lemma and can use the rest of the machinery we
have already established “off the shelf”.
This is one great advantage of the “abstract consistency method”; the other is that the method
can be extended transparently to other logics.
718 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC

Completeness of T0
 Corollary A.2.2. T0 is complete.
 Proof: by contradiction
1. We assume that A ∈ wff0 (V0 ) is valid, but there is no closed tableau for AF .
2. We have {¬A} ∈ ∇ as ¬AT = AF .
3. so ¬A is satisfiable by the model existence theorem (which is applicable as ∇
is an abstract consistency class by our Lemma above)
4. this contradicts our assumption that A is valid.

Michael Kohlhase: Artificial Intelligence 2 1127 2025-02-06


Appendix B

Conflict Driven Clause Learning

B.1 Why Did Unit Propagation Yield a Conflict?


A Video Nugget covering this section can be found at https://fau.tv/clip/id/27026.

DPLL: Example (Redundance1)


 Example B.1.1. We introduce some nasty redundance to make DPLL slow.
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F

P F
T

X1
T F

Xn Xn
T F T F

Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2

Michael Kohlhase: Artificial Intelligence 2 1128 2025-02-06

How To Not Make the Same Mistakes Over Again?


 It’s not that difficult, really:

(A) Figure out what went wrong.


(B) Learn to not do that again in the future.
 And now for DPLL:

(A) Why did unit propagation yield a Conflict?

719
720 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING

 This Section. We will capture the “what went wrong” in terms of graphs
over literals set during the search, and their dependencies.
 What can we learn from that information?:

 A new clause! Next section.

Michael Kohlhase: Artificial Intelligence 2 1129 2025-02-06

Implication Graphs for DPLL


 Definition B.1.2. Let β be a branch in a DPLL derivation and P a variable on β
then we call
 P α a choice literal if its value is set to α by the splitting rule.
 P α an implied literal, if the value of P is set to α by the UP rule.
 P α a conflict literal, if it contributes to a derivation of the empty clause.
 Definition B.1.3 (Implication Graph).
Let ∆ be a clause set, β a DPLL search branch on ∆. The implication graph Gimpl β
is the directed graph whose vertices are labeled with the choice and implied literals
along β, as well as a separate conflict vertex 2C for every clause C that became
empty on β.
Whereever a clause l1 , . . ., lk ∨ l′ ∈ ∆ became unit with implied literal l′ , Gimpl
β
includes the edges (li ,l′ ).
Where C = l1 ∨ . . . ∨ lk ∈ ∆ became empty, Gimpl
β includes the edges (li ,2C ).

 Question: How do we know that li are vertices in Gimpl


β ?
 Answer: Because l1 ∨ . . . ∨ lk ∨ l′ became unit/empty.

 Observation B.1.4. Gimpl


β is acyclic.
 Proof sketch: UP can’t derive l′ whose value was already set beforehand.

 Intuition: The initial vertices are the choice literals and unit clauses of ∆.

Michael Kohlhase: Artificial Intelligence 2 1130 2025-02-06

Implication Graphs: Example (Vanilla1) in Detail

 Example B.1.5. Let ∆ := P T ∨ QT ∨ RF ; P F ∨ QF ; RT ; P T ∨ QF .


We look at the left branch of the derivation from ??:
B.1. UP CONFLICT ANALYSIS 721

1. UP Rule: R 7→ T
Implied literal RT . Implication graph:
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
PF
2. Splitting Rule:

2a. P 7→ F
Choice literal P F .
QT ; QF QT
3a. UP Rule: Q 7→ T
Implied literal QT
edges (RT ,QT ) and (P F ,QT ).
2
Conflict vertex 2P T ∨QF RT 2P T ∨QF
edges (P F ,2P T ∨QF ) and (QT ,2P T ∨QF ).

Michael Kohlhase: Artificial Intelligence 2 1131 2025-02-06

Implication Graphs: Example (Redundance1)

 Example B.1.6. Continuing from ??: ∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨


QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literal: RT .

P F
T

X1
T F

Xn Xn
T F T F

Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2

PT

QT X1 T ... Xn T

RT 2P T ∨QF ∨RT

Michael Kohlhase: Artificial Intelligence 2 1132 2025-02-06


722 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING

Implication Graphs: Example (Redundance2)


 Example B.1.7. Continuing from ??:

∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F

DPLL on ∆ ; Θ ; Φ with Φ := QF ∨ S T ; QF ∨ S F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals:

PT

QT X1 T ... Xn T

RT 2P T ∨QF ∨RT ST 2

Michael Kohlhase: Artificial Intelligence 2 1133 2025-02-06

Implication Graphs: A Remark


 The implication graph is not uniquely determined by the Choice literals.
 It depends on “ordering decisions” during UP: Which unit clause is picked first.
 Example B.1.8. ∆ = P F ∨ QF ; QT ; P T

Option 1 Option 2
T
Q PT

2P F ∨QF PF 2P F ∨QF QF

Michael Kohlhase: Artificial Intelligence 2 1134 2025-02-06

Conflict Graphs
 A conflict graph captures “what went wrong” in a failed node.
 Definition B.1.9 (Conflict Graph). Let ∆ be a clause set, and let Gimpl
β be the
implication graph for some search branch β of DPLL on ∆. A subgraph C of Gimpl
β
is a conflict graph if:
(i) C contains exactly one conflict vertex 2C .
B.1. UP CONFLICT ANALYSIS 723

(ii) If l′ is a vertex in C, then all parents of l′ , i.e. vertices li with a I edge (li ,l′ ),
are vertices in C as well.
(iii) All vertices in C have a path to 2C .
 Conflict graph =
b Starting at a conflict vertex, backchain through the implication
graph until reaching choice literals.

Michael Kohlhase: Artificial Intelligence 2 1135 2025-02-06

Conflict-Graphs: Example (Redundance1)

 Example B.1.10. Continuing from ??: ∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨


QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X 100 T ; X 1 F ∨ . . . ∨ X 100 F
Choice literals: P T , (X 1 T ), . . ., (X 100 T ), QT . Implied literals: RT .

PT

QT X1 T ... Xn T

RT 2P T ∨QF ∨RT

Michael Kohlhase: Artificial Intelligence 2 1136 2025-02-06

Conflict Graphs: Example (Redundance2)


 Example B.1.11. Continuing from ?? and ??:

∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F

DPLL on ∆ ; Θ ; Φ with Φ := QF ∨ S T ; QF ∨ S F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals: RT .
724 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING

PT

QT X1 T ... Xn T

RT 2P T ∨QF ∨RT ST 2

PT

QT X1 T ... Xn T

RT 2P T ∨QF ∨RT ST 2

Michael Kohlhase: Artificial Intelligence 2 1137 2025-02-06

B.2 Clause Learning

Clause Learning
 Observation: Conflict graphs encode the entailment relation.
 Definition B.2.1. Let ∆ be a clause set, C be a conflict graph at some time
pointWduring a run of DPLL on ∆, and L be the choice literals in C, then we call
c := l∈L l the learned clause for C.
 Theorem B.2.2. Let ∆, C, and c as in ??, then ∆ ⊨ c.
 Idea: We can add learned clauses to DPLL derivations at any time without losing
soundness. (maybe this helps, if we have a good notion of learned clauses)

 Definition B.2.3. Clause learning is the process of adding learned clauses to DPLL
clause sets at specific points. (details coming up)

Michael Kohlhase: Artificial Intelligence 2 1138 2025-02-06

Clause Learning: Example (Redundance1)


 Example B.2.4. Continuing from ??:
B.2. CLAUSE LEARNING 725

∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals: RT .

PT

QT X1 T ... Xn T

RT 2P T ∨QF ∨RT

Learned clause: P F ∨ QF

Michael Kohlhase: Artificial Intelligence 2 1139 2025-02-06

The Effect of Learned Clauses (in Redundance1)


 What happens after we learned a new clause C?
1. We add C into ∆. e.g. C = P F ∨ QF .
2. We retract the last choice l′ . e.g. the choice l′ = Q.
W
 Observation: Let C be a learned clause, i.e. C = l∈L l, where L is the set of
conflict literals in a conflict graph G.
Before we learn C, G must contain the most recent choice l′ : otherwise, the conflict
would have occured earlier on.
So C = l1 T ∨ . . . ∨ lk T ∨ l′ where l1 , . . ., lk are earlier choices.

 Example B.2.5. l1 = P , C = P F ∨ QF , l′ = Q.
 Observation: Given the earlier choices l1 , . . . , lk , after we learned the new clause
C = l1 ∨ . . . ∨ lk ∨ l′ , the value of l′ is now set by UP!
 So we can continue:

3. We set the opposite choice l′ as an implied literal.


e.g. QF as an implied literal.
4. We run UP and analyze conflicts.
Learned clause: earlier choices only! e.g. C = P F , see next slide.

Michael Kohlhase: Artificial Intelligence 2 1140 2025-02-06

The Effect of Learned Clauses: Example (Redundance1)


726 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING

 Example B.2.6. Continuing from ??:

∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X 100 T ; X 1 F ∨ . . . ∨ X 100 F

DPLL on ∆ ; Θ ; Φ with Φ := P F ∨ QF
Choice literals: P T , (X 1 T ), . . ., (X 100 T ), QT . Implied literals: QF , RT .

PT

QF X1 T ... Xn T

RT 2

Learned clause: P F

Michael Kohlhase: Artificial Intelligence 2 1141 2025-02-06

NOT the same Mistakes over Again: (Redundance1)


 Example B.2.7. Continuing from ??:

∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF

DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F

P F
T

X1
T

Xn
T

Q
T F set by UP
RT ; 2 RT ; 2
learn P ∨ Q learn P F
F F

 Note: Here, the problem could be avoided by splitting over different variables.
 Problem: This is not so in general! (see next slide)
B.2. CLAUSE LEARNING 727

Michael Kohlhase: Artificial Intelligence 2 1142 2025-02-06

Clause Learning vs. Resolution


 Recall: DPLL =
b tree resolution (from slide 400)
1. in particular: each derived clause C (not in ∆) is derived anew every time it is
used.
2. Problem: there are ∆ whose shortest tree resolution proof is exponentially longer
than their shortest (general) resolution proof.
 Good News: This is no longer the case with clause learning!
1. We add each learned clause C to ∆, can use it as often as we like.
2. Clause learning renders DPLL equivalent to full resolution [BKS04; PD09]. (In-
howfar exactly this is the case was an open question for ca. 10 years, so it’s not
as easy as I made it look here . . . )
 In particular: Selecting different variables/values to split on can provably not bring
DPLL up to the power of DPLL+Clause Learning. (cf. slide 1142, and previous
slide)

Michael Kohlhase: Artificial Intelligence 2 1143 2025-02-06

“DPLL + Clause Learning”?


 Disclaimer: We have only seen how to learn a clause from a conflict.
 We will not cover how the overall DPLL algorithm changes, given this learning.
Slides 1140 – 1142 are merely meant to give a rough intuition on “backjumping”.

 Definition B.2.8 (Just for the record). (not exam or exercises relevant)
 One could run “DPLL + Clause Learning” by always backtracking to the maximal-
level choice variable contained in the learned clause.
 The actual algorithm is called Conflict Directed Clause Learning (CDCL), and
differs from DPLL more radically:
let L := 0; I := ∅
repeat
execute UP
if a conflict was reached then /∗ learned clause C = l1 ∨ . . . ∨ lk ∨ l′ ∗/
if L = 0 then return UNSAT
L := maxki=1 level(li ); erase I below L
add C into ∆; add l′ to I at level L
else
if I is a total interpretation then return I
choose a new decision literal l; add l to I at level L
L := L + 1

Michael Kohlhase: Artificial Intelligence 2 1144 2025-02-06


728 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING

Remarks
 Which clause(s) to learn?:
 While we only select choice literals, much more can be done.
 For any cut through the conflict graph, with Choice literals on the “left hand”
side of the cut and the conflict literals on the right-hand side, the literals on the
left border of the cut yield a learnable clause.
 Must take care to not learn too many clauses . . .

 Origins of clause learning:


 Clause learning originates from “explanation-based (no-good) learning” devel-
oped in the CSP community.
 The distinguishing feature here is that the “no-good” is a clause:
 The exact same type of constraint as the rest of ∆.

Michael Kohlhase: Artificial Intelligence 2 1145 2025-02-06

B.3 Phase Transitions: Where the Really Hard Problems


Are
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25088.

Where Are the Hard Problems?


 SAT is NP hard. Worst case for DPLL is O(2n ), with n propositions.
 Imagine I gave you as homework to make a formula family {φ} where DPLL running
time necessarily is in the order of O(2n ).
 I promise you’re not gonna find this easy . . . (although it is of course possible:
e.g., the “Pigeon Hole Problem”).
 People noticed by the early 90s that, in practice, the DPLL worst case does not
tend to happen.

 Modern SAT solvers successfully tackle practical instances where n > 1.000.000.

Michael Kohlhase: Artificial Intelligence 2 1146 2025-02-06

Where Are the Hard Problems?


 So, what’s the problem: Science is about understanding the world.
 Are “hard cases” just pathological outliers?
 Can we say something about the typical case?
 Difficulty 1: What is the “typical case” in applications? E.g., what is the “average”
hardware verification instance?
B.3. PHASE TRANSITIONS 729

 Consider precisely defined random distributions instead.


 Difficulty 2: Search trees get very complex, and are difficult to analyze math-
ematically, even in trivial examples. Never mind examples of practical relevance
...

 The most successful works are empirical. (Interesting theory is mainly concerned
with hand-crafted formulas, like the Pigeon Hole Problem.)

Michael Kohlhase: Artificial Intelligence 2 1147 2025-02-06

Phase Transitions in SAT [MSL92]


 Fixed clause length model: Fix clause length k; n variables.
Generate m clauses, by uniformly choosing k variables P for each clause C, and for
each variable P deciding uniformly whether to add P or P F into C.
 Order parameter: Clause/variable ratio m
n.

 Phase transition: (Fixing k = 3, n = 50)

Michael Kohlhase: Artificial Intelligence 2 1148 2025-02-06

Does DPLL Care?


 Oh yes, it does: Extreme running time peak at the phase transition!
730 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING

Michael Kohlhase: Artificial Intelligence 2 1149 2025-02-06

Why Does DPLL Care?


 Intuition:

Under-Constrained: Satisfiability likelihood close to 1. Many solutions, first


DPLL search path usually successful. (“Deep but narrow”)
Over-Constrained: Satisfiability likelihood close to 0. Most DPLL search paths
short, conflict reached after few applications of splitting rule. (“Broad but shal-
low”)
Critically Constrained: At the phase transition, many almost-successful DPLL
search paths. (“Close, but no cigar”)

Michael Kohlhase: Artificial Intelligence 2 1150 2025-02-06

The Phase Transition Conjecture


 Definition B.3.1. We say that a class P of problems exhibits a phase transition, if
there is an order parameter o, i.e. a structural parameter of P , so that almost all the
hard problems of P cluster around a critical value c of o and c separates one region
of the problem space from another, e.g. over-constrained and under-constrained
regions.
 All NP-complete problems exhibit at least one phase transition.

 [CKT91] confirmed this for Graph Coloring and Hamiltonian Circuits. Later work
confirmed it for SAT (see previous slides), and for numerous other NP-complete
problems.

Michael Kohlhase: Artificial Intelligence 2 1151 2025-02-06


B.3. PHASE TRANSITIONS 731

Why Should We Care?


 Enlightenment:
 Phase transitions contribute to the fundamental understanding of the behavior
of search, even if it’s only in random distributions.
 There are interesting theoretical connections to phase transition phenomena in
physics. (See [GS05] for a short summary.)
 Ok, but what can we use these results for?:
 Benchmark design: Choose instances from phase transition region.
 Commonly used in competitions etc. (In SAT, random phase transition
formulas are the most difficult for DPLL style searches.)
 Predicting solver performance: Yes, but very limited because:
 All this works only for the particular considered distributions of instances! Not
meaningful for any other instances.

Michael Kohlhase: Artificial Intelligence 2 1152 2025-02-06


732 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING
Appendix C

Completeness of Calculi for


First-Order Logic

We will now analyze the first-order calculi for completeness. Just as in the case of the propositional
calculi, we prove a model existence theorem for the first-order model theory and then use that
for the completeness proofs2 . The proof of the first-order model existence theorem is completely EdN:2
analogous to the propositional one; indeed, apart from the model construction itself, it is just an
extension by a treatment for the first-order quantifiers.3 EdN:3

C.1 Abstract Consistency and Model Existence


We will now come to an important tool in the theoretical study of reasoning calculi: the “abstract
consistency”/“model existence” method. This method for analyzing calculi was developed by Jaako
Hintikka, Raymond Smullyan, and Peter Andrews in 1950-1970 as an encapsulation of similar
constructions that were used in completeness arguments in the decades before. The basis for
this method is Smullyan’s Observation [Smu63] that completeness proofs based on Hintikka sets
only certain properties of consistency and that with little effort one can obtain a generalization
“Smullyan’s Unifying Principle”.
The basic intuition for this method is the following: typically, a logical system L := ⟨L, K, ⊨⟩ has
multiple calculi, human-oriented ones like the natural deduction calculi and machine-oriented ones
like the automated theorem proving calculi. All of these need to be analyzed for completeness (as
a basic quality assurance measure).
A completeness proof for a calculus C for S typically comes in two parts: one analyzes C-
consistency (sets that cannot be refuted in C), and the other construct K-models for C-consistent
sets.
In this situtation the “abstract consistency”/“model existence” method encapsulates the model
construction process into a meta-theorem: the “model existence” theorem. This provides a set of
syntactic (“abstract consistency”) conditions for calculi that are sufficient to construct models.
With the model existence theorem it suffices to show that C-consistency is an abstract consis-
tency property (a purely syntactic task that can be done by a C-proof transformation argument)
to obtain a completeness result for C.

Model Existence (Overview)


 Definition: Abstract consistency

2 EdNote: reference the theorems


3 EdNote: MK: what about equality?

733
734 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC

 Definition: Hintikka set (maximally abstract consistent)


 Theorem: Hintikka sets are satisfiable
 Theorem: If Φ is abstract consistent, then Φ can be extended to a Hintikka set.

 Corollary: If Φ is abstract consistent, then Φ is satisfiable.


 Application: Let C be a calculus, if Φ is C-consistent, then Φ is abstract consistent.
 Corollary: C is complete.

Michael Kohlhase: Artificial Intelligence 2 1153 2025-02-06

The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for calculi;
they form syntactic counterparts of satisfiability.

Consistency
 Let C be a calculus,. . .
 Definition C.1.1. Let C be a calculus, then a formula set Φ is called C-refutable, if
there is a refutation, i.e. a derivation of a contradiction from Φ. The act of finding
a refutation for Φ is called refuting Φ.

 Definition C.1.2. We call a pair of formulae A and ¬A a contradiction.


 So a set Φ is C-refutable, if C canderive a contradiction from it.
 Definition C.1.3. Let C be a calculus, then a formula set Φ is called C-consistent,
iff there is a formula B, that is not derivable from Φ in C.

 Definition C.1.4. We call a calculus C reasonable, iff implication elimination and


conjunction introduction are admissible in C and A ∧ ¬A ⇒ B is a C-theorem.
 Theorem C.1.5. C-inconsistency and C-refutability coincide for reasonable calculi.

Michael Kohlhase: Artificial Intelligence 2 1154 2025-02-06

It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, ⊨⟩ is the current logical system.
C.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 735

Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
The notion of an “abstract consistency class” provides the a calculus-independent notion of con-
sistency: A set Φ of sentences is considered “consistent in an abstract sense”, iff it is a member of
an abstract consistency class ∇.

Abstract Consistency
 Definition C.1.6. Let ∇ be a collection of sets. We call ∇ closed under subsets,
iff for each Φ ∈ ∇, all subsets Ψ ⊆ Φ are elements of ∇.

 Notation: We will use Φ∗A for Φ ∪ {A}.


 Definition C.1.7. A family ∇ ⊆ wff o (Σι , Vι ) of sets of formulae is called a (first-
order) abstract consistency class, iff it is closed under subsets, and for each Φ ∈ ∇
∇c ) A ̸∈ Φ or ¬A ̸∈ Φ for atomic A ∈ wff o (Σι , Vι ).
∇¬ ) ¬¬A ∈ Φ implies Φ∗A ∈ ∇
∇∧ ) A ∧ B ∈ Φ implies Φ ∪ {A, B} ∈ ∇
∇∨ ) ¬(A ∧ B) ∈ Φ implies Φ∗¬A ∈ ∇ or Φ∗¬B ∈ ∇
∇∀ ) If ∀X.A ∈ Φ, then Φ∗([B/X](A)) ∈ ∇ for each closed term B.
∇∃ ) If ¬(∀X.A) ∈ Φ and c is an individual constant that does not occur in Φ,
then Φ∗¬([c/X](A)) ∈ ∇

Michael Kohlhase: Artificial Intelligence 2 1155 2025-02-06

The conditions are very natural: Take for instance ∇c , it would be foolish to call a set Φ of
sentences “consistent under a complete calculus”, if it contains an elementary contradiction. The
next condition ∇¬ says that if a set Φ that contains a sentence ¬¬A is “consistent”, then we should
be able to extend it by A without losing this property; in other words, a complete calculus should
be able to recognize A and ¬¬A to be equivalent. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.

Compact Collections
 Definition C.1.8. We call a collection ∇ of sets compact, iff for any set Φ we have
Φ ∈ ∇, iff Ψ ∈ ∇ for every finite subset Ψ of Φ.
 Lemma C.1.9. If ∇ is compact, then ∇ is closed under subsets.
 Proof:
1. Suppose S ⊆ T and T ∈ ∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A ∈ ∇.
4. Thus S ∈ ∇.
736 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC

Michael Kohlhase: Artificial Intelligence 2 1156 2025-02-06

The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a collection ∇ by testing
all their finite subsets (which is much simpler).

Compact Abstract Consistency Classes


 Lemma C.1.10. Any first-order abstract consistency class can be extended to a
compact one.
 Proof:
1. We choose ∇′ := {Φ ⊆ cwff o (Σι ) | every finite subset of Φis in ∇}.
2. Now suppose that Φ ∈ ∇. ∇ is closed under subsets, so every finite subset of
Φ is in ∇ and thus Φ ∈ ∇′ . Hence ∇ ⊆ ∇′ .
3. Let us now show that each ∇ is compact.’
3.1. Suppose Φ ∈ ∇′ and Ψ is an arbitrary finite subset of Φ.
3.2. By definition of ∇′ all finite subsets of Φ are in ∇ and therefore Ψ ∈ ∇′ .
3.3. Thus all finite subsets of Φ are in ∇′ whenever Φ is in ∇′ .
3.4. On the other hand, suppose all finite subsets of Φ are in ∇′ .
3.5. Then by the definition of ∇′ the finite subsets of Φ are also in ∇, so
Φ ∈ ∇′ . Thus ∇′ is compact.
4. Note that ∇′ is closed under subsets by the Lemma above.
5. Next we show that if ∇ satisfies ∇∗ , then ∇ satisfies ∇∗ .’
5.1. To show ∇c , let Φ ∈ ∇′ and suppose there is an atom A, such that
{A, ¬A} ⊆ Φ. Then {A, ¬A} ∈ ∇ contradicting ∇c .
5.2. To show ∇¬ , let Φ ∈ ∇′ and ¬¬A ∈ Φ, then Φ∗A ∈ ∇′ .
5.2.1. Let Ψ be any finite subset of Φ∗A, and Θ := (Ψ\{A})∗¬¬A.
5.2.2. Θ is a finite subset of Φ, so Θ ∈ ∇.
5.2.3. Since ∇ is an abstract consistency class and ¬¬A ∈ Θ, we get
Θ∗A ∈ ∇ by ∇¬ .
5.2.4. We know that Ψ ⊆ Θ∗A and ∇ is closed under subsets, so Ψ ∈ ∇.
5.2.5. Thus every finite subset Ψ of Φ∗A is in ∇ and therefore by definition
Φ∗A ∈ ∇′ .
5.3. the other cases are analogous to ∇¬ .

Michael Kohlhase: Artificial Intelligence 2 1158 2025-02-06

Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.

∇-Hintikka Set
 Definition C.1.11. Let ∇ be an abstract consistency class, then we call a set
H ∈ ∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A ∈ ∇ we
already have A ∈ H.
 Theorem C.1.12 (Hintikka Properties). Let ∇ be an abstract consistency class
and H be a ∇-Hintikka set, then
C.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 737

Hc ) For all A ∈ wff o (Σι , Vι ) we have A ̸∈ H or ¬A ̸∈ H.


H¬ ) If ¬¬A ∈ H then A ∈ H.
H∧ ) If A ∧ B ∈ H then A, B ∈ H.
H∨ ) If ¬(A ∧ B) ∈ H then ¬A ∈ H or ¬B ∈ H.
H∀ ) If ∀X.A ∈ H, then [B/X](A) ∈ H for each closed term B.
H∃ ) If ¬(∀X.A) ∈ H then ¬([B/X](A)) ∈ H for some term closed term B.
 Proof:
We prove the properties in turn Hc goes by induction on the structure of A
1. A atomic
1.1. Then A ̸∈ H or ¬A ̸∈ H by ∇c .
2. A = ¬B
2.1. Let us assume that ¬B ∈ H and ¬¬B ∈ H,
2.2. then H∗B ∈ ∇ by ∇¬ , and therefore B ∈ H by maximality.
2.3. So {B, ¬B} ⊆ H, which contradicts the induction hypothesis.
3. A = B ∨ C similar to the previous case
4. We prove H¬ by maximality of H in ∇.
4.1. If ¬¬A ∈ H, then H∗A ∈ ∇ by ∇¬ .
4.2. The maximality of H now gives us that A ∈ H.
5. The other H∗ are similar

Michael Kohlhase: Artificial Intelligence 2 1160 2025-02-06

The following theorem is one of the main results in the “abstract consistency”/“model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ ∈ H.

Extension Theorem
 Theorem C.1.13. If ∇ is an abstract consistency class and Φ ∈ ∇ finite, then
there is a ∇-Hintikka set H with Φ ⊆ H.
 Proof:
1. Wlog. assume that ∇ compact (else use compact extension)
2. Choose an enumeration A1 , . . . of cwff o (Σι ) and c1 , . . . of Σsk
0 .
3. and construct a sequence of sets Hi with H0 := Φ and

 Hn if Hn ∗An ̸∈ ∇
Hn+1 := Hn ∪ {An , ¬([cn /X](B))} if Hn ∗An ∈ ∇ and An = ¬(∀X.B)

Hn ∗An else
S
4. Note that all Hi ∈ ∇, choose H := i∈N Hi
5. Ψ ⊆ H finite implies there is a j ∈ N such that Ψ ⊆ Hj ,
6. so Ψ ∈ ∇ as ∇ closed under subsets and H ∈ ∇ as ∇ is compact.
7. Let H∗B ∈ ∇, then there is a j ∈ N with B = Aj , so that B ∈ Hj+1 and
Hj+1 ⊆ H
8. Thus H is ∇-maximal

Michael Kohlhase: Artificial Intelligence 2 1161 2025-02-06

Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
738 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC

extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of cwff o (Σι ). If
we pick a different enumeration, we will end up with a different H. Say if A and ¬A are both
∇-consistent4 with Φ, then depending on which one is first in the enumeration H, will contain
that one; with all the consequences for subsequent choices in the construction process.

Valuations
 Definition C.1.14. A function ν : cwff o (Σι )→D0 is called a (first-order) valuation,
iff ν is a propositional valuation and
 ν(∀X.A) = T, iff ν([B/X](A)) = T for all closed terms B.

 Lemma C.1.15. If φ : Vι → U is a variable assignment, then I φ : cwff o (Σι ) → D0


is a valuation.
 Proof sketch: Immediate from the definitions

Michael Kohlhase: Artificial Intelligence 2 1162 2025-02-06

Thus a valuation is a weaker notion of evaluation in first-order logic; the other direction is also
true, even though the proof of this result is much more involved: The existence of a first-order
valuation that makes a set of sentences true entails the existence of a model that satisfies it.5

Valuation and Satisfiability


 Lemma C.1.16. If ν : cwff o (Σι ) → D0 is a valuation and Φ ⊆ cwff o (Σι ) with
ν(Φ) = {T}, then Φ is satisfiable.
 Proof: We construct a model for Φ.
1. Let Dι := cwff ι (Σι ), and
k
 I(f ) : Dι → Dι ; ⟨A1 , . . ., Ak ⟩ 7→ f (A1 , . . ., Ak ) for f ∈ Σ
f
k
 I(p) : Dι → D0 ; ⟨A1 , . . ., Ak ⟩ 7→ ν(p(A1 , . . ., Ak )) for p ∈ Σ .
p

2. Then variable assignments into Dι are ground substitutions.


3. We show I φ (A) = φ(A) for A ∈ wff ι (Σι , Vι ) by induction on A:
3.1. A = X
3.1.1. then I φ (A) = φ(X) by definition.
3.2. A = f (A1 , . . ., Ak )
3.2.1. then I φ (A) = I(f )(I φ (A1 ), . . . , I φ (An )) = I(f )(φ(A1 ), . . . , φ(An )) =
f (φ(A1 ), . . . , φ(An )) = φ(f (A1 , . . ., Ak )) = φ(A)
We show I φ (A) = ν(φ(A)) for A ∈ wff o (Σι , Vι ) by induction on A.
3.3. A = p(A1 , . . ., Ak )
3.3.1. then I φ (A) = I(p)(I φ (A1 ), . . . , I φ (An )) = I(p)(φ(A1 ), . . . , φ(An )) =
ν(p(φ(A1 ), . . . , φ(An ))) = ν(φ(p(A1 , . . ., Ak ))) = ν(φ(A))
3.4. A = ¬B
3.4.1. then I φ (A) = T, iff I φ (B) = ν(φ(B)) = F, iff ν(φ(A)) = T.
3.5. A = B ∧ C
3.5.1. similar
3.6. A = ∀X.B

4 EdNote: introduce this above


5 EdNote: I think that we only get a semivaluation, look it up in Andrews.
C.2. A COMPLETENESS PROOF FOR FIRST-ORDER ND 739

3.6.1. then I φ (A) = T, iff I ψ (B) = ν(ψ(B)) = T, for all C ∈ Dι , where


ψ = φ,[C/X]. This is the case, iff ν(φ(A)) = T.
4. Thus I φ (A)ν(φ(A)) = ν(A) = T for all A ∈ Φ.
5. Hence M ⊨ A for M := ⟨Dι , I⟩.

Michael Kohlhase: Artificial Intelligence 2 1164 2025-02-06

Now, we only have to put the pieces together to obtain the model existence theorem we are after.

Model Existence
 Theorem C.1.17 (Hintikka-Lemma). If ∇ is an abstract consistency class and
H a ∇-Hintikka set, then H is satisfiable.
 Proof:
1. we define ν(A):=T, iff A ∈ H,
2. then ν is a valuation by the Hintikka set properties.
3. We have ν(H) = {T}, so H is satisfiable.
 Theorem C.1.18 (Model Existence). If ∇ is an abstract consistency class and
Φ ∈ ∇, then Φ is satisfiable.
Proof:
 1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)
3. In particular, Φ ⊆ H is satisfiable.

Michael Kohlhase: Artificial Intelligence 2 1165 2025-02-06

C.2 A Completeness Proof for First-Order ND


With the model existence proof we have introduced in the last section, the completeness proof
for first-order natural deduction is rather simple, we only have to check that ND-consistency is an
abstract consistency property.

Consistency, Refutability and Abstract Consistency


 Theorem C.2.1 (Non-Refutability is an Abstract Consistency Property). Γ :=
{Φ ⊆ cwff o (Σι ) | Φ not ND1 −refutable} is an abstract consistency class.

 Proof: We check the properties of an ACC


1. If Φ is non-refutable, then any subset is as well, so Γ is closed under subsets.
We show the abstract consistency conditions ∇∗ for Φ ∈ Γ.
2. ∇c
2.1. We have to show that A ̸∈ Φ or ¬A ̸∈ Φ for atomic A ∈ wff o (Σι , Vι ).
2.2. Equivalently, we show the contrapositive: If {A, ¬A} ⊆ Φ, then Φ ̸∈ Γ.
2.3. So let {A, ¬A} ⊆ Φ, then Φ is ND1 -refutable by construction.
2.4. So Φ ̸∈ Γ.
3. ∇¬ We show the contrapositive again
3.1. Let ¬¬A ∈ Φ and Φ∗A ̸∈ Γ
3.2. Then we have a refutation D : Φ∗A⊢ND1 F
740 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC

3.3. By prepending an application of ND0¬E for ¬¬A to D, we obtain a refu-


tation D : Φ⊢ND1 F ′ .
3.4. Thus Φ ̸∈ Γ.
Proof sketch: other ∇∗ similar

Michael Kohlhase: Artificial Intelligence 2 1167 2025-02-06

This directly yields two important results that we will use for the completeness analysis.

Henkin’s Theorem
 Corollary C.2.2 (Henkin’s Theorem). Every ND1 -consistent set of sentences has
a model.
 Proof:
1. Let Φ be a ND1 -consistent set of sentences.
2. The class of sets of ND1 -consistent propositions constitute an abstract consis-
tency class.
3. Thus the model existence theorem guarantees a model for Φ.
 Corollary C.2.3 (Löwenheim&Skolem Theorem). Satisfiable set Φ of first-order
sentences has a countable model.
Proof sketch: The model we constructed is countable, since the set of ground terms
is.

Michael Kohlhase: Artificial Intelligence 2 1168 2025-02-06

Now, the completeness result for first-order natural deduction is just a simple argument away.
We also get a compactness theorem (almost) for free: logical systems with a complete calculus are
always compact.

 Completeness and Compactness


 Theorem C.2.4 (Completeness Theorem for ND1 ). If Φ ⊨ A, then Φ⊢ND1 A.
 Proof: We prove the result by playing with negations.
1. If A is valid in all models of Φ, then Φ∗¬A has no model
2. Thus Φ∗¬A is inconsistent by (the contrapositive of) Henkins Theorem.
3. So Φ⊢ND1 ¬¬A by ND0¬I and thus Φ⊢ND1 A by ND0¬E.
 Theorem C.2.5 (Compactness Theorem for first-order logic). If Φ ⊨ A, then
there is already a finite set Ψ ⊆ Φ with Ψ ⊨ A.
Proof: This is a direct consequence of the completeness theorem
 1. We have Φ ⊨ A, iff Φ⊢ND1 A.
2. As a proof is a finite object, only a finite subset Ψ ⊆ Φ can appear as leaves
in the proof.

Michael Kohlhase: Artificial Intelligence 2 1169 2025-02-06


C.3. SOUNDNESS AND COMPLETENESS OF FIRST-ORDER TABLEAUX 741

C.3 Soundness and Completeness of First-Order Tableaux


The soundness of the first-order free-variable tableaux calculus can be established a simple in-
duction over the size of the tableau.

Soundness of T1f
 Lemma C.3.1. Tableau rules transform satisfiable tableaux into satisfiable ones.
 Proof:
we examine the tableau rules in turn
1. propositional rules as in propositional tableaux
2. T1f ∃ by ??
3. T1f⊥ by ?? (substitution value lemma)
4. T1f ∀
4.1. I φ (∀X.A) = T, iff I ψ (A) = T for all a ∈ Dι
4.2. so in particular for some a ∈ Dι ̸= ∅.
 Corollary C.3.2. T1f is correct.

Michael Kohlhase: Artificial Intelligence 2 1170 2025-02-06

The only interesting steps are the cut rule, which can be directly handled by the substitution
value lemma, and the rule for the existential quantifier, which we do in a separate lemma.

Soundness of T1f ∃

 Lemma C.3.3. T1f ∃ transforms satisfiable tableaux into satisfiable ones.

 Proof: Let T ′ be obtained by applying T1f ∃ to (∀X.A) in T , extending it with


F
F
([f (X 1 , . . ., X k )/X](A)) , where W := free(∀X.A) = {X 1 , . . ., X k }
1. Let T be satisfiable in M := ⟨D, I⟩, then I φ (∀X.A) = F.
We need to find a model M′ that satisfies T ′ (find interpretation for f )
2. By definition I φ,[a/X] (A) = F for some a ∈ D (depends on φ|W )
3. Let g : D → D be defined by g(a1 , . . ., ak ):=a, if φ(X ) = ai
k i

4. choose M = ⟨D, I ′ ⟩′ with I ′ := I,[g/f ], then by subst. value lemma

Iφ′ ([f (X 1 , . . ., X k )/X](A)) ′


= Iφ,[I ′ 1 k
φ (f (X ,...,X ))/X]
(A)

= Iφ,[a/X] (A) = F

F
5. So ([f (X 1 , . . ., X k )/X](A)) satisfiable in M′

Michael Kohlhase: Artificial Intelligence 2 1171 2025-02-06

This proof is paradigmatic for soundness proofs for calculi with Skolemization. We use the axiom
of choice at the meta-level to choose a meaning for the Skolem constant. Armed with the Model
Existence Theorem for first-order logic (??), the completeness of first-order tableaux is similarly
straightforward. We just have to show that the collection of tableau-irrefutable sentences is an
abstract consistency class, which is a simple proof-transformation exercise in all but the universal
quantifier case, which we postpone to its own Lemma (??).
742 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC

Completeness of (T1f )

 Theorem C.3.4. T1f is refutation complete.


 Proof: We show that ∇ := {Φ | ΦT has no closed Tableau} is an abstract consis-
tency class
1. as for propositional case.
2. by the lifting lemma below
F
3. Let T be a closed tableau for ¬(∀X.A) ∈ Φ and ΦT ∗([c/X](A)) ∈ ∇.

ΨT ΨT
F F
(∀X.A) (∀X.A)
F F
([c/X](A)) ([f (X 1 , . . ., X k )/X](A))
Rest [f (X 1 , . . ., X k )/c](Rest)

Michael Kohlhase: Artificial Intelligence 2 1172 2025-02-06

So we only have to treat the case for the universal quantifier. This is what we usually call a
“lifting argument”, since we have to transform (“lift”) a proof for a formula θ(A) to one for A. In
the case of tableaux we do that by an induction on the tableau refutation for θ(A) which creates
a tableau-isomorphism to a tableau refutation for A.

Tableau-Lifting
 Theorem C.3.5. If Tθ is a closed tableau for a set θ(Φ) of formulae, then there is
a closed tableau T for Φ.
 Proof: by induction over the structure of Tθ we build an isomorphic tableau T , and
a tableau-isomorphism ω : T → Tθ , such that ω(A) = θ(A).
only the tableau-substitution rule is interesting.
T F
1. Let (θ(Ai )) and (θ(Bi )) cut formulae in the branch Θiθ of Tθ
2. there is a joint unifier σ of (θ(A1 ))=?(θ(B1 )) ∧ . . . ∧ (θ(An ))=?(θ(Bn ))
3. thus σ ◦ θ is a unifier of A and B
4. hence there is a most general unifier ρ of A1=?B1 ∧ . . . ∧ An=?Bn
5. so Θ is closed.

Michael Kohlhase: Artificial Intelligence 2 1173 2025-02-06

Again, the “lifting lemma for tableaux” is paradigmatic for lifting lemmata for other refutation
calculi.

C.4 Soundness and Completeness of First-Order Resolution

Correctness (CNF)
 Lemma C.4.1. A set Φ of sentences is satisfiable, iff CNF1 (Φ) is.

 Proof: propositional rules and ∀-rule are trivial; do the ∃-rule


C.4. SOUNDNESS AND COMPLETENESS OF FIRST-ORDER RESOLUTION 743

F
1. Let (∀X.A) satisfiable in M := ⟨D, I⟩ and free(A) = {X 1 , . . ., X n }
2. I φ (∀X.A) = F, so there is an a ∈ D with I φ,[a/X] (A) = F (only depends on
φ|free(A) )
3. let g : Dn → D be defined by g(a1 , . . ., an ):=a, iff φ(X i ) = ai .
4. choose M′ := ⟨D, I ′ ⟩ with I(f )′ := g, then Iφ′ ([f (X 1 , . . . , X k )/X](A)) = F
F
5. Thus ([f (X 1 , . . . , X k )/X](A)) is satisfiable in M′

Michael Kohlhase: Artificial Intelligence 2 1174 2025-02-06

Resolution (Correctness)
 Definition C.4.2. A clause is called satisfiable, iff I φ (A) = α for one of its literals
Aα .

 Lemma C.4.3. 2 is unsatisfiable


 Lemma C.4.4. CNF transformations preserve satisfiability (see above)
 Lemma C.4.5. Resolution and factorization too!

Michael Kohlhase: Artificial Intelligence 2 1175 2025-02-06

Completeness (R1 )
 Theorem C.4.6. R1 is refutation complete.
 Proof: ∇ := {Φ | ΦT has no closed tableau} is an abstract consistency class
1. as for propositional case.
2. by the lifting lemma below
F
3. Let T be a closed tableau for ¬(∀X.A) ∈ Φ and ΦT ∗([c/X](A)) ∈ ∇.
F
4. CNF1 (ΦT ) = CNF1 (ΨT ) ∪ CNF1 (([f (X 1 , . . ., X k )/X](A)) )
F
5. ([f (X 1 , . . ., X k )/c](CNF1 (ΦT )))∗([c/X](A)) = CNF1 (ΦT )
6. so R1 : CNF1 (ΦT )⊢D′ 2, where D = [f (X1′ , . . . , Xk′ )/c](D).

Michael Kohlhase: Artificial Intelligence 2 1176 2025-02-06

Clause Set Isomorphism


 Definition C.4.7. Let B and C be clauses, then a clause isomorphism ω : C → D
α
is a bijection of the literals of C and D, such that ω(L) = Mα (conserves labels)
α
We call ω θ compatible, iff ω(L ) = (θ(L))
α

 Definition C.4.8. Let Φ and Ψ be clause sets, then we call a bijection Ω : Φ → Ψ


a clause set isomorphism, iff there is a clause isomorphism ω : C → Ω(C) for each
C ∈ Φ.
 Lemma C.4.9. If θ(Φ) is set of formulae, then there is a θ-compatible clause set
isomorphism Ω : CNF1 (Φ) → CNF1 (θ(Φ)).
744 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC

 Proof sketch: by induction on the CNF derivation of CNF1 (Φ).

Michael Kohlhase: Artificial Intelligence 2 1177 2025-02-06

Lifting for R1
 Theorem C.4.10. If R1 : (θ(Φ))⊢Dθ 2 for a set θ(Φ) of formulae, then there is a
R1 -refutation for Φ.
 Proof: by induction over Dθ we construct a R1 -derivation R1 : Φ⊢D C and a θ-
compatible clause set isomorphism Ω : D → Dθ
Dθ′ Dθ′′
T F
1. If Dθ ends in ((θ(A)) ∨ (θ(C))) (θ(B)) ∨ (θ(D))
res
(σ(θ(C))) ∨ (σ(θ(B)))

T
then we have (IH) clause isormorphisms ω ′ : AT ∨ C → (θ(A)) ∨ (θ(C)) and
T
ω ′ : BT ∨ D → (θ(B)) , θ(D)
AT ∨ C BF ∨ D
2. thus Res where ρ = mgu(A, B)(exists, as σ ◦ θ unifier)
(ρ(C)) ∨ (ρ(B))

Michael Kohlhase: Artificial Intelligence 2 1178 2025-02-06

You might also like