Notes
Notes
2025-02-06
0.1. PREFACE i
0.1 Preface
0.1.1 Course Concept
Objective: The course aims at giving students a solid (and often somewhat theoretically ori-
ented) foundation of the basic concepts and practices of artificial intelligence. The course will
predominantly cover symbolic AI – also sometimes called “good old-fashioned AI (GofAI)” – in
the first semester and offers the very foundations of statistical approaches in the second. Indeed, a
full account sub symbolic, machine learning based AI deserves its own specialization courses and
needs much more mathematical prerequisites than we can assume in this course.
Context: The course “Artificial Intelligence” (AI 1 & 2) at FAU Erlangen is a two-semester
course in the “Wahlpflichtbereich” (specialization phase) in semester 5/6 of the bachelor program
“Computer Science” at FAU Erlangen. It is also available as a (somewhat remedial) course in the
“Vertiefungsmodul Künstliche Intelligenz” in the Computer Science Master’s program.
Prerequisites: AI-1 & 2 builds on the mandatory courses in the FAU bachelor’s program, in
particular the course “Grundlagen der Logik in der Informatik” [Glo], which already covers a lot
of the materials usually presented in the “knowledge and reasoning” part of an introductory AI
course. The AI 1& 2 course also minimizes overlap with the course.
The course is relatively elementary, we expect that any student who attended the mandatory
CS course at FAU Erlangen can follow it.
Open to external students: Other bachelor programs are increasingly co-opting the course as
specialization option. There is no inherent restriction to computer science students in this course.
Students with other study biographies – e.g. students from other bachelor programs our external
Master’s students should be able to pick up the prerequisites when needed.
0.1.4 Acknowledgments
Materials: Most of the materials in this course is based on Russel/Norvik’s book “Artificial
Intelligence — A Modern Approach” (AIMA [RN95]). Even the slides are based on a LATEX-based
slide set, but heavily edited. The section on search algorithms is based on materials obtained from
Bernhard Beckert (then Uni Koblenz), which is in turn based on AIMA. Some extensions have
been inspired by an AI course by Jörg Hoffmann and Wolfgang Wahlster at Saarland University
in 2016. Finally Dennis Müller suggested and supplied some extensions on AGI. Florian Rabe,
Max Rapp and Katja Berčič have carefully re-read the text and pointed out problems.
All course materials have been restructured and semantically annotated in the STEX format,
so that we can base additional semantic services on them.
AI Students: The following students have submitted corrections and suggestions to this and
earlier versions of the notes: Rares Ambrus, Ioan Sucan, Yashodan Nevatia, Dennis Müller, Si-
mon Rainer, Demian Vöhringer, Lorenz Gorse, Philipp Reger, Benedikt Lorch, Maximilian Lösch,
Luca Reeb, Marius Frinken, Peter Eichinger, Oskar Herrmann, Daniel Höfer, Stephan Mattejat,
Matthias Sonntag, Jan Urfei, Tanja Würsching, Adrian Kretschmer, Tobias Schmidt, Maxim On-
ciul, Armin Roth, Liam Corona, Tobias Völk, Lena Voigt, Yinan Shao, Michael Girstl, Matthias
Vietz, Anatoliy Cherepantsev, Stefan Musevski, Matthias Lobenhofer, Philipp Kaludercic, Di-
warkara Reddy, Martin Helmke, Stefan Müller, Dominik Mehlich, Paul Martini, Vishwang Dave,
Arthur Miehlich, Christian Schabesberger, Vishaal Saravanan, Simon Heilig, Michelle Fribrance,
Wenwen Wang, Xinyuan Tu, Lobna Eldeeb.
0.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.1 Course Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.2 Course Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.3 This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
0.1.4 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
0.1.5 Recorded Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1 Preliminaries 1
1.1 Administrative Ground Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Getting Most out of AI-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Learning Resources for AI-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 AI-Supported Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
iii
iv CONTENTS
Preliminaries
In this chapter, we want to get all the organizational matters out of the way, so that we can get into
the discussion of artificial intelligence content unencumbered. We will talk about the necessary
administrative details, go into how students can get most out of the course, talk about where the
various resources provided with the course can be found, and finally introduce the ALeA system,
an experimental – using AI methods – learning support system for the AI course.
1
2 CHAPTER 1. PRELIMINARIES
Note: I do not literally presuppose the courses on the slide above – most of you do not have a
bachelor’s degree from FAU, so you cannot have taken them. And indeed some of the content of
these courses is irrelevant for AI-1. Stating these courses is just the easiest way to specifying what
content I will be building on – and any graduate courses has to build on something.
Many of you will have taken the moral equivalent of these courses in your undergraduate studies
at your home university. If you did not, you will have to somehow catch up on the content as we
go along in AI-1. This should be possible with enough motivation.
There are essentially three skillsets that are essential for AI-1:
1. A solid understanding and practical skill in programming (whatever programming language)
2. A good understanding and practice in using mathematical language to represent complex struc-
tures
3. A solid understanding of formal languages and grammars, as well as applied complexity theory
(basics of theoretical computer science).
Without (catching up on) these the AI-1 course will be quite frustrating and hard.
We will briefly go over the most important topics in ?? to synchronize concepts and notation.
Note that if you do not have a formal education in courses like the ones mentioned above you will
very probably have to do significant remedial work.
Now we come to a topic that is always interesting to the students: the grading scheme.
Assessment, Grades
Overall (Module) Grade:
Preparedness Quizzes
PrepQuizzes: Every tuesday 16:15 we start the lecture with a 10 min online quiz
– the PrepQuiz – about the material from the previous week. (starts in week 2)
Motivations: We do this to
https://courses.voll-ki.fau.de/quiz-dash/ai-1
You have to be logged into ALeA! (via FAU IDM)
This thursday we will try out the prepquiz infrastructure with a pretest!
The test will be also used to refine the ALeA learner model, which may make
learning experience in ALeA better. (see below)
Due to the current AI hype, the course Artificial Intelligence is very popular and thus many degree
programs at FAU have adopted it for their curricula. Sometimes the course setup that fits for the
CS program does not fit the other’s very well, therefore there are some special conditions. I want
to state here.
Just send me an e-mail and come to the exam, (we do the necessary admin)
Tell your program coordinator about AI-1/2 so that they remedy this situation
In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.
I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.
Didactic Intuition: Homework assignments give you material to test your under-
standing and show you how to apply it.
Homeworks give no points, but without trying you are unlikely to pass the exam.
Homeworks will be mainly peer-graded in the ALeA system.
Didactic Motivation: Through peer grading students are able to see mistakes
in their thinking and can correct any problems in future assignments. By grading
assignments, students may learn how to complete assignments more accurately and
how to improve their future results. (not just us being lazy)
It is very well-established experience that without doing the homework assignments (or something
1.2. GETTING MOST OUT OF 5
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take very little home from the course. Just sitting in the course and nodding is not
enough!
Homework/Tutorial Discipline:
Start early! (many assignments need more than one evening’s work)
Don’t start by sitting at a blank screen (talking & study groups help)
Humans will be trying to understand the text/code/math when grading it.
Go to the tutorials, discuss with your TA! (they are there for you!)
If you have questions please make sure you discuss them with the instructor, the teaching assistants,
or your fellow students. There are three sensible venues for such discussions: online in the lectures,
in the tutorials, which we discuss now, or in the course forum – see below. Finally, it is always a
very good idea to form study groups with your friends.
Goal 2: Allow you to ask any question you have in a protected environment.
Instructor/Lead TA: Florian Rabe (KWARC Postdoc)
Room: 11.137 @ Händler building, [email protected]
Tutorials: One each taught by Florian Rabe (lead); Yasmeen Shawat, Hatem
Mousa, Xinyuan Tu, and Florian Guthmann.
6 CHAPTER 1. PRELIMINARIES
Collaboration
Definition 1.2.1. Collaboration (or cooperation) is the process of groups of agents
acting together for common, mutual benefit, as opposed to acting in competition
for selfish benefit. In a collaboration, every agent contributes to the common goal
and benefits from the contributions of others.
As we said above, almost all of the components of the AI-1 course are optional. That even applies
to attendance. But make no mistake, attendance is important to most of you. Let me explain, . . .
Note: There are two ways of learning: (both are OK, your mileage may vary)
Approach B: Read a book/papers (here: lecture notes)
Approach I: come to the lectures, be involved, interrupt the instructor whenever
you have a question.
The only advantage of I over B is that books/papers do not answer questions
FAU has issued a very insightful guide on using lecture videos. It is a good idea to heed these
recommendations, even if they seem annoying at first.
Catch up.
You will only pass the exam, if you can do AI-1 yourself!
Intuition: AI tools like GhatGPT, CoPilot, etc. (see also [She24])
can help you solve problems, (valuable tools in production situations)
hinders learning if used for homeworks/quizzes, etc. (like driving instead of
jogging)
What (not) to do: (to get most of the brave new AI-supported world)
try out these tools to get a first-hand intuition what they can/cannot do
challenge yourself while learning so that you can also do it (mind over matter!)
The central idea in the AI4AI approach – using AI to support learning AI – and thus the ALeA
system is that we want to make course materials – i.e. what we give to students for preparing and
postparing lectures – more like teachers and study groups (only available 24/7) than like static
books.
ALeA Status: The ALeA system is deployed at FAU for over 1000 students
taking eight courses
(some) students use the system actively (our logs tell us)
reviews are mostly positive/enthusiastic (error reports pour in)
The ALeA AI-1 page is the central entry point for working with the ALeA system. You can get
to all the components of the system, including two presentations of the course contents (notes-
and slides-centric ones), the flashcards, the localized forum, and the quiz dashboard.
We now come to the heart of the ALeA system: its learning support services, which we will now
briefly introduce. Note that this presentation is not really sufficient to undertstand what you may
be getting out of them, you will have to try them, and interact with them sufficiently that the
learner model can get a good estimate of your competencies to adapt the results to you.
Example 1.4.5 (Guided Tour). A guided tour for a concept c assembles defini-
tions/etc. into a self-contained mini-course culminating at c.
c = count-
able ;
Note that this is only an initial collection of learning support services, we are constantly working
on additional ones. Look out for feature notifications ( ) on the upper right hand of
the ALeA screen.
While the learning support services up to now have been adressed to individual learners, we
now turn to services addressed to communities of learners, ranging from study groups with three
learners, to whole courses, and even – eventually – all the alumni of a course, if they have not
de-registered from ALeA.
Currently, the community aspect of ALeA only consists in localized interactions with the course
materials.
The ALeA system uses the semantic structure of the course materials to localize some interactions
that are otherwise often from separate applications. Here we see two:
1. one for reporting content errors – and thus making the material better for all learners – and‘’
2. a localized course forum, where forum threads can be attached to learning objects.
Localized comments induce a thread in the ALeA forum (like the StudOn
Forum, but targeted towards specific learning objects.)
14 CHAPTER 1. PRELIMINARIES
Let us briefly look into how the learning support services introduced above might work, focusing
on where the necessary information might come from. Even though some of the concepts in the
discussion below may be new to AI-1 students, it is worth looking into them. Bear with us as we
try to explain the AI components of the ALeA system.
ALeA=
b Data-Driven & AI-enabled Learning Assistance
Learner Rhetoric/Didactic
Model Model
understand the objects and their properties they are talking about
have readimade formulations how to convey them best
the domain model and didactic relations to determine the order of LOs
the learner model to determine what to show
the rhetoric relations to make the dialogue coherent
We can use the same four models discussed in the space of guided tours to deploy additional
learning support services, which we now discuss.
We have already seen above how the learner model can drive the drilling with flashcards. It can
also be used for the configuration of card stacks by configuring a domain e.g. a section in the
course materials and a competency threshold. We now come to a very important issue that we
always face when we do AI systems that interface with humans. Most web technology companies
that take one the approach “the user pays for the services with their personal data, which is sold
on” or integrate advertising for renumeration. Both are not acceptable in university setting.
But abstaining from monetizing personal data still leaves the problem how to protect it from
intentional or accidental misuse. Even though the GDPR has quite extensive exceptions for
research, the ALeA system – a research prototype – adheres to the principles and mandates of
the GDPR. In particular it makes sure that personal data of the learners is only used in learning
support services directly or indirectly initiated by the learners themselves.
ALeA Promise: The ALeA team does the utmost to keep your personal data
safe. (SSO via FAU IDM/eduGAIN, ALeA trust zone)
ALeA Privacy Axioms:
1. ALeA only collects learner models data about logged in users.
2. Personally identifiable learner model data is only accessible to its subject
(delegation possible)
3. Learners can always query the learner model about its data.
4. All learner model data can be purged without negative consequences (except
usability deterioration)
5. Logging into ALeA is completely optional.
Observation: Authentication for bonus quizzes are somewhat less optional, but
you can always purge the learner model later.
So, now that you have an overview over what the ALeA system can do for you, let us see what
you have to concretely do to be able to use it.
1.4. AI-SUPPORTED LEARNING 17
To use the ALeA system, you will have to log in via SSO: (do it now)
go to https://courses.voll-ki.fau.de/course-home/ai-1,
Even if you did not understand some of the AI jargon or the underlying methods (yet), you should
be good to go for using the ALeA system in your day-to-day work.
18 CHAPTER 1. PRELIMINARIES
Chapter 2
We start the course by giving an overview of (the problems, methods, and issues of ) Artificial
Intelligence, and what has been achieved so far.
Naturally, this will dwell mostly on philosophical aspects – we will try to understand what
the important issues might be and what questions we should even be asking. What the most
important avenues of attacks may be and where AI research is being carried out.
In particular the discussion will be very non-technical – we have very little basis to discuss
technicalities yet. But stay with me, this will drastically change very soon. A Video Nugget
covering the introduction of this chapter can be found at https://fau.tv/clip/id/21467.
19
20 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
Maybe we can get around the problems of defining “what artificial intelligence is”, by just describing
the necessary components of AI (and how they interact). Let’s have a try to see whether that is
more informative.
Inference
Perception
2.2. ARTIFICIAL INTELLIGENCE IS HERE TODAY! 21
Language understanding
Emotion
Note that list of components is controversial as well. Some say that it lumps together cognitive
capacities that should be distinguished or forgets others, . . . . We state it here much more to get
AI-1 students to think about the issues than to make it normative.
in outer space
in outer space systems
need autonomous con-
trol:
remote control impos-
sible due to time lag
in artificial limbs
the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
in household appliances
The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
general robotic house-
hold help is on the
horizon.
in hospitals
in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
Paro is a cuddly robot
that eases solitude in
nursing homes.
24 CHAPTER 2. AI – WHO?, WHAT?, WHEN?, WHERE?, AND WHY?
The AI Conundrum
Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
But: researchers at the Dartmouth Conference (1956) really thought they would
solve/reach AI in two/three decades.
Consequence: AI still asks the big questions. (and still promises answers soon)
Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
AI Conundrum: Once AI solves a subfield it is called “computer science”.
(becomes a separate subfield of CS)
Example 2.2.1. Functional/Logic Programming, automated theorem proving,
Planning, machine learning, Knowledge Representation, . . .
Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.
All of these phenomena can be seen in the growth of AI as an academic discipline over the course
of its now over 70 year long history.
AI becomes
scarily effective,
ubiquitous
Excitement fades;
some applications
AI-conse- profit a lot
quences,
Biases, AI-bubble bursts,
Regulation the next AI winter
Lighthill report WWW ; comes
Dartmouth Conference Data/-
Turing Test Computing
AI Winter 2
AI Winter 1 Explosion
1987-1994
1974-1980
Of course, the future of AI is still unclear, we are currently in a massive hype caused by the advent
of deep neural networks being trained on all the data of the Internet, using the computational
power of huge compute farms owned by an oligopoly of massive technology companies – we are
definitely in an AI summer.
But AI as a academic community and the tech industry also make outrageous promises, and
the media pick it up and distort it out of proportion, . . . So public opinion could flip again, sending
AI into the next winter.
interact with it via sensors and actuators. Here, the main method for realizing
intelligent behavior is by learning from the world.
As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersection
of computer science (logic, programming, applied statistics), Cognitive Science (psychology, neu-
roscience), philosophy (can machines think, what does that mean?), linguistics (natural language
understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.
We combine the topics in this way in this course, not only because this reproduces the histor-
ical development but also as the methods of statistical and subsymbolic AI share a common
basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.
Consumer tasks: consumer grade applications have tasks that must be fully
generic and wide coverage. ( e.g. machine translation like Google Translate)
Producer tasks: producer grade applications must be high-precision, but can be
2.4. STRONG VS. WEAK AI 27
Precision
100% Producer Tasks
General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
A domain of producer tasks I am interested in: mathematical/technical documents.
An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
million Euro and sells them into dozens of countries. Thus T must also provide comprehensive
machine operation manuals, a non-trivial undertaking, since no two machines are identical and
they must be translated into many languages, leading to hundreds of documents. As those manual
share a lot of semantic content, their management should be supported by AI techniques. It is
critical that these methods maintain a high precision, operation errors can easily lead to very
costly machine damage and loss of production. On the other hand, the domain of these manuals is
quite restricted. A machine tool has a couple of hundred components only that can be described
by a couple of thousand attributes only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.
One can usually defuse public worries about “is AI going to take control over the world” by just
explaining the difference between strong AI and weak AI clearly.
I would like to add a few words on AGI, that – if you adopt them; they are not universally accepted
– will strengthen the arguments differentiating between strong and weak AI.
Kohlhase’s View: Weak AI is here, strong AI is very far off. (not in my lifetime)
: But even if that is true, weak AI will affect all of us deeply in everyday life.
Example 2.4.4. You should not train to be an accountant or truck driver!
(bots will replace you soon)
I want to conclude this section with an overview over the recent protagonists – both personal and
institutional – of AGI.
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
Observation: The ability to represent knowledge about the world and to draw
logical inferences is one of the central components of intelligent behavior.
Thus: reasoning components of some form are at the heart of many AI systems.
Research in the KWARC group ranges over a variety of topics, which range from foundations of
mathematics to relatively applied web information systems. I will try to organize them into three
pillars here.
For all of these areas, we are looking for bright and motivated students to work with us. This
can take various forms, theses, internships, and paid students assistantships.
Sciences like physics or geology, and engineering need high-powered equipment to perform mea-
surements or experiments. computer science and in particular the KWARC group needs high
powered human brains to build systems and conduct thought experiments.
The KWARC group may not always have as much funding as other AI research groups, but
we are very dedicated to give the best possible research guidance to the students we supervise.
So if this appeals to you, please come by and talk to us.
Part I
33
35
This part of the lecture notes sets the stage for the technical parts of the course by establishing
a common framework (Rational Agents) that gives context and ties together the various methods
discussed in the course.
After having seen what AI can do and where AI is being employed today (see ??), we will now
Logic Programming
We will now learn a new programming paradigm: logic programming, which is one of the most
influential paradigms in AI. We are going to study Prolog (the oldest and most widely used) as a
concrete example of ideas behind logic programming and use it for our homeworks in this course.
As Prolog is a representative of a programming paradigm that is new to most students, pro-
gramming will feel weird and tedious at first. But subtracting the unusual syntax and program
organization logic programming really only amounts to recursive programming just as in func-
tional programming (the other declarative programming paradigm). So the usual advice applies,
keep staring at it and practice on easy examples until the pain goes away.
Logic Programming
Idea: Use logic as a programming language!
We state what we know about a problem (the program) and then ask for results
(what the program would compute).
Example 3.1.1.
37
38 CHAPTER 3. LOGIC PROGRAMMING
How to achieve this? Restrict a logic calculus sufficiently that it can be used as
computational procedure.
Remark: This idea leads a totally new programming paradigm: logic programming.
We now formally define the language of Prolog, starting off the atomic building blocks.
greek(sokrates).
fallible(X):−human(X).
The first three lines are Prolog facts and the last a rule.
The whole point of writing down a knowledge base (a Prolog program with knowledge about the
situation), if we do not have to write down all the knowledge, but a (small) subset, from which
the rest follows. We have already seen how this can be done: with logic. For logic programming
we will use a logic called “first-order logic” which we will not formally introduce here.
Definition 3.1.8. The knowledge base given by Prolog program is that set of facts
that can be derived from it by Modus Ponens (MP), ∧I and instantiation.
A A⇒B A B A
MP ∧I Subst
B A∧B [B/X](A)
?? introduces a very important distinction: that between a Prolog program and the knowledge
base it induces. Whereas the former is a finite, syntactic object (essentially a string), the latter
may be an infinite set of facts, which represents the totality of knowledge about the world or the
aspects described by the program.
As knowledge bases can be infinite, we cannot pre-compute them. Instead, logic programming
languages compute fragments of the knowledge base by need; i.e. whenever a user wants to check
membership; we call this approach querying: the user enters a query expression and the system
answers yes or no. This answer is computed in a depth first search process.
Problem: Knowledge bases can be big and even infinite. (cannot pre-compute)
Example 3.1.10. The knowledge base induced by the Prolog program
nat(zero).
nat(s(X)) :− nat(X).
Definition 3.1.12. The Prolog interpreter keeps backchaining from the top to the
bottom of the program until the query
succeeds, i.e. contains no more goals, or (answer: true)
fails, i.e. backchaining becomes impossible. (answer: false)
Note that backchaining replaces the current query with the body of the rule suitably instantiated.
For rules with a long body this extends the list of current goals, but for facts (rules without a
body), backchaining shortens the list of current goals. Once there are no goals left, the Prolog
interpreter finishes and signals success by issuing the string true.
If no rules match the current subgoal, then the interpreter terminates and signals failure with the
string false,
We can extend querying from simple yes/no answers to programs that return values by simply
using variables in queries. In this case, the Prolog interpreter returns a substitution.
3.2. PROGRAMMING AS SEARCH 41
In ?? the first backchaining step binds the variable X to the query variable Y, which gives us the
two subgoals has_wheels(Y,4),has_motor(Y). which again have the query variable Y. The next
backchaining step binds this to mybmw, and the third backchaining step exhausts the subgoals.
So the query succeeds with the (overall) answer substitution Y = mybmw. With this setup, we
can already do the “fallible Greeks” example from the introduction.
We will introduce working with the interpreter using unary natural numbers as examples: we
first add the fact1 to the knowledge base
unat(zero).
which asserts that the predicate unat2 is true on the term zero. Generally, we can add a fact to
the knowledge base either by writing it into a file (e.g. example.pl) and then “consulting it” by
writing one of the following three commands into the interpreter:
[example]
consult(’example.pl’).
consult(’example’).
or by directly typing
assert(unat(zero)).
into the Prolog interpreter. Next tell Prolog about the following rule
assert(unat(suc(X)) :− unat(X)).
which gives the Prolog runtime an initial (infinite) knowledge base, which can be queried by
?− unat(suc(suc(zero))).
Even though we can use any text editor to program Prolog, but running Prolog in a modern
editor with language support is incredibly nicer than at the command line, because you can see
the whole history of what you have done. Its better for debugging too.
We say that a goal G matches a head H, iff we can make them equal by replacing
variables in H with terms.
We can force backtracking to compute more answers by typing ;.
Note: With the Prolog search procedure detailed above, computation can easily go into infinite
loops, even though the knowledge base could provide the correct answer. Consider for instance
the simple program
“unary natural numbers”; we cannot use the predicate nat and the constructor function s here, since their
1 for
p(X):− p(X).
p(X):− q(X).
q(X).
If we query this with ?− p(john), then DFS will go into an infinite loop because Prolog expands
by default the first predicate. However, we can conclude that p(john) is true if we start expanding
the second predicate.
In fact this is a necessary feature and not a bug for a programming language: we need to
be able to write non-terminating programs, since the language would not be Turing complete
otherwise. The argument can be sketched as follows: we have seen that for Turing machines the
halting problem is undecidable. So if all Prolog programs were terminating, then Prolog would be
weaker than Turing machines and thus not Turing complete.
We will now fortify our intuition about the Prolog search procedure by an example that extends
the setup from ?? by a new choice of a vehicle that could be a car (if it had a motor).
Backtracking by Example
Example 3.2.2. We extend ??:
has_wheels(mytricycle,3).
has_wheels(myrollerblade,3).
has_wheels(mybmw,4).
has_motor(mybmw).
car(X):-has_wheels(X,3),has_motor(X). % cars sometimes have three wheels
car(X):-has_wheels(X,4),has_motor(X). % and sometimes four.
?- car(Y).
?- has_wheels(Y,3),has_motor(Y). % backtrack point 1
Y = mytricycle % backtrack point 2
?- has_motor(mytricycle).
FAIL % fails, backtrack to 2
Y = myrollerblade % backtrack point 2
?- has_motor(myrollerblade).
FAIL % fails, backtrack to 1
?- has_wheels(Y,4),has_motor(Y).
Y = mybmw
?- has_motor(mybmw).
Y=mybmw
true
In general, a Prolog rule of the form A:−B,C reads as A, if B and C. If we want to express A if
B or C, we have to express this two separate rules A:−B and A:−C and leave the choice which
one to use to the search procedure.
In ?? we indeed have two clauses for the predicate car/1; one each for the cases of cars with three
and four wheels. As the three-wheel case comes first in the program, it is explored first in the
search process.
Recall that at every point, where the Prolog interpreter has the choice between two clauses for a
predicate, chooses the first and leaves a backtrack point. In ?? this happens first for the predicate
car/1, where we explore the case of three-wheeled cars. The Prolog interpreter immediately has
to choose again – between the tricycle and the rollerblade, which both have three wheels. Again,
it chooses the first and leaves a backtrack point. But as tricycles do not have motors, the subgoal
has_motor(mytricycle) fails and the interpreter backtracks to the chronologically nearest backtrack
point (the second one) and tries to fulfill has_motor(myrollerblade). This fails again, and the next
backtrack point is point 1 – note the stack-like organization of backtrack points which is in keeping
with the depth-first search strategy – which chooses the case of four-wheeled cars. This ultimately
succeeds as before with y=mybmw.
44 CHAPTER 3. LOGIC PROGRAMMING
Now we can directly write the recursive equations X + 0 = X (base case) and
X + s(Y ) = s(X + Y ) into the knowledge base.
add(X,zero,X).
add(X,s(Y),s(Z)) :− add(X,Y,Z).
expt(X,zero,s(zero)).
expt(X,s(Y),Z) :− expt(X,Y,W), mult(X,W,Z).
Note: Viewed through the right glasses logic programming is very similar to functional program-
ming; the only difference is that we are using n + 1 ary relations rather than n ary function. To see
how this works let us consider the addition function/relation example above: instead of a binary
function + we program a ternary relation add, where relation add(X,Y ,Z) means X + Y = Z. We
start with the same defining equations for addition, rewriting them to relational style.
The first equation is straight-forward via our correspondence and we get the Prolog fact
add(X,zero,X). For the equation X + s(Y ) = s(X + Y ) we have to work harder, the straight-
forward relational translation add(X,s(Y),s(X+Y)) is impossible, since we have only partially
replaced the function + with the relation add. Here we take refuge in a very simple trick that we
can always do in logic (and mathematics of course): we introduce a new name Z for the offending
expression X + Y (using a variable) so that we get the fact add(X,s(Y ),s(Z)). Of course this is
not universally true (remember that this fact would say that “X + s(Y ) = s(Z) for all X, Y , and
Z”), so we have to extend it to a Prolog rule add(X,s(Y),s(Z)):−add(X,Y,Z). which relativizes to
mean “X + s(Y ) = s(Z) for all X, Y , and Z with X + Y = Z”.
Indeed the rule implements addition as a recursive predicate, we can see that the recursion
relation is terminating, since the left hand sides have one more constructor for the successor
function. The examples for multiplication and exponentiation can be developed analogously, but
we have to use the naming trick twice.
We now apply the same principle of recursive programming with predicates to other examples
to reinforce our intuitions about the principles.
Example 3.2.4. We can also use the add relation for subtraction without changing
the implementation. We just use variables in the “input positions” and ground terms
in the other two. (possibly very inefficient “generate and test approach”)
?−add(s(zero),X,s(s(s(zero)))).
X = s(s(zero))
true
Note: Note that the is relation does not allow “generate and test” inversion as it insists on the
right hand being ground. In our example above, this is not a problem, if we call the fib with
the first (“input”) argument a ground term. Indeed, it matches the last rule with a goal ?− g,Y.,
where g is a ground term, then g−1 and g−2 are ground and thus D and E are bound to the
(ground) result terms. This makes the input arguments in the two recursive calls ground, and we
get ground results for Z and W, which allows the last goal to succeed with a ground result for
Y. Note as well that re-ordering the bodys literal of the rule so that the recursive calls are called
before the computation literals will lead to failure.
We will now add the primitive data structure of lists to Prolog; they are constructed by prepending
an element (the head) to an existing list (which becomes the rest list or “tail” of the constructed
one).
append([],L,L).
append([X|R],L,[X|S]):−append(R,L,S).
46 CHAPTER 3. LOGIC PROGRAMMING
reverse([],[]).
reverse([X|R],L):−reverse(R,S),append(S,[X],L).
Logic programming is the third large programming paradigm (together with functional program-
ming and imperative programming).
From a programming practice point of view it is probably best understood as “relational program-
ming” in analogy to functional programming, with which it shares a focus on recursion.
The major difference to functional programming is that “relational programming” does not have
a fixed input/output distinction, which makes the control flow in functional programs very direct
and predictable. Thanks to the underlying search procedure, we can sometime make use of the
flexibility afforded by logic programming.
If the problem solution involves search (and depth first search is sufficient), we can just get by
with specifying the problem and letting the Prolog interpreter do the rest. In ?? we just specify
that list Xs can be sorted into Ys, iff Ys is a permutation of Xs and Ys is ordered. Given a concrete
(input) list Xs, the Prolog interpreter will generate all permutations of Ys of Xs via the predicate
perm/2 and then test them whether they are ordered.
This is a paradigmatic example of logic programming. We can (sometimes) directly use the
specification of a problem as a program. This makes the argument for the correctness of the
program immediate, but may make the program execution non optimal.
We “define” the computational behavior of the predicate rev, but the list constructors
[. . .] are just used to construct lists from arguments.
Example 3.2.16 (Trees and Leaf Counting). We represent (unlabelled) trees via
the function t from tree lists to trees. For instance, a balanced binary tree of depth
2 is t([t([t([]),t([])]),t([t([]),t([])])]). We count leaves by
leafcount(t([]),1).
leafcount(t([V]),W) :− leafcount(V,W).
leafcount(t([X|R]),Y) :− leafcount(X,Z), leafcount(t(R),W), Y is Z + W.
RTFM (b
= “read the fine manuals”)
RTFM Resources: There are also lots of good tutorials on the web,
I personally like [Fis; LPN],
[Fla94] has a very thorough logic-based introduction,
48 CHAPTER 3. LOGIC PROGRAMMING
In this chapter we will briefly recap some of the prerequisites from theoretical computer science
that are needed for understanding Artificial Intelligence 1.
49
50CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
100 ... ... ...
1 000 ... ... ...
10 000 ... ... ...
1 000 000 ... ... ...
The last number in the rightmost column may surprise you. Does the run time really grow that
fast? Yes, as a quick calculation shows; and it becomes much worse, as we will see.
performance
size linear quadratic exponential
n 100nµs 7n2 µs 2n µs
1 100µs 7µs 2µs
5 .5ms 175µs 32µs
10 1ms .7ms 1ms
45 4.5ms 14ms 1.1Y
< 100 100ms 7s 1016 Y
1 000 1s 12min −
10 000 10s 20h −
1 000 000 1.6min 2.5mon −
So it does make a difference for larger computational problems what algorithm we choose. Consid-
erations like the one we have shown above are very important when judging an algorithm. These
evaluations go by the name of “complexity theory”.
Let us now recapitulate some notions of elementary complexity theory: we are interested in the
worst-case growth of the resources (time and space) required by an algorithm in terms of the sizes
of its arguments. Mathematically we look at the functions from input size to resource size and
classify them into “big-O” classes, abstracting from constant factors (which depend on the machine
thealgorithm runs on and which we cannot control) and initial (algorithm startup) factors.
Definition 4.1.3. We say that an algorithm α that terminates in time t(n) for all
inputs of size n has running time T (α) := t.
Let S ⊆ N → N be a set of natural number functions, then we say that α has time
complexity in S (written T (α)∈S or colloquially T (α)=S), iff t∈S. We say α has
space complexity in S, iff α uses only memory of size s(n) on inputs of size n and
s∈S.
Time/space complexity depends on size measures. (no canonical one)
Definition 4.1.4. The following sets are often used for S in T (α):
Landau set class name rank Landau set class name rank
O(1) constant 1 O(n2 ) quadratic 4
O(log2 (n)) logarithmic 2 O(nk ) polynomial 5
O(n) linear 3 O(kn ) exponential 6
For AI-1: I expect that given an algorithm, you can determine its complexity class.
(next)
These are not all of “big-Oh calculation rules”, but they’re enough for most purposes
Applications: Convince yourselves using the result above that
O(4n3 + 3n + 71000n ) = O(2n )
O(n)⊂O(n · log2 (n))⊂O(n2 )
OK, that was the theory, . . . but how do we use that in practice?
What I mean by this is that given an algorithm, we have to determine the time complexity.
This is by no means a trivial enterprise, but we can do it by analyzing the algorithm instruction
by instruction as shown below.
52CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
As instructions in imperative programs can introduce new variables, which have their own time
complexity, we have to carry them around via the introduced context, which has to be defined
co-recursively with the time complexity. This makes ?? rather complex. The main two cases to
note here are
• the variable case, which “uses” the context Γ and
• the assignment case, which extends the introduced context by the time complexity of the value.
The other cases just pass around the given context and the introduced context systematically.
Let us now put one motivation for knowing about complexity theory into the perspective of the
job market; here the job as a scientist.
Please excuse the chemistry pictures, public imagery for CS is really just quite boring, this is
what people think of when they say “scientist”. So, imagine that instead of a chemist in a lab, it’s
me sitting in front of a computer.
4.1. RECAP: COMPLEXITY ANALYSIS IN AI? 53
But my 2nd attempt didn’t work either, which got me a bit agitated.
Ta-da . . . when, for once, I turned around and looked in the other direction–
CAN one actually solve this efficiently? – NP hardness was there to rescue me.
The meat of the story is that there is no profit in trying to invent an algorithm, which we could
have known that cannot exist. Here is another image that may be familiar to you.
Example 4.1.9. Trying to find a sea route east to India (from Spain) (does not
exist)
Observation: Complexity theory saves you from spending lots of time trying to
4.2. RECAP: FORMAL LANGUAGES AND GRAMMARS 55
It’s like, you’re trying to find a route to India (from Spain), and you presume it’s somewhere to
the east, and then you hit a coast, but no; try again, but no; try again, but no; ... if you don’t
have a map, that’s the best you can do. But NP hardness gives you the map: you can check
that there actually is no way through here. But what is this notion of NP completness alluded
to above? We observe that we can analyze the complexity of problems by the complexity of the
algorithms that solve them. This gives us a notion of what to expect from solutions to a given
problem class, and thus whether efficient (i.e. polynomial time) algorithms can exist at all.
to predict effects of its observations and actions to obtain a world model. In this section we recap
the basics of formal languages and grammars that form the basis of a compositional theory for
them.
Definition 4.2.2. Note that A0 = {⟨⟩}, where ⟨⟩ is the (unique) 0-tuple. With
the definition above we consider ⟨⟩ as the string of length 0 and call it the empty
string and denote it with ϵ.
Note: Sets ̸= strings, e.g. {1, 2, 3} = {3, 2, 1}, but ⟨1, 2, 3⟩ =
̸ ⟨3, 2, 1⟩.
Notation: We will often write a string ⟨c1 , . . ., cn ⟩ as ”c1 . . .cn ”, for instance
”abc” for ⟨a, b, c⟩
Example 4.2.3. Take A = {h, 1, /} as an alphabet. Each of the members h, 1,
and / is a character. The vector ⟨/, /, 1, h, 1⟩ is a string of length 5 over A.
Definition 4.2.4 (String Length). Given a string s we denote its length with |s|.
We have multiple notations for concatenation, since it is such a basic operation, which is used
so often that we will need very short notations for it, trusting that the reader can disambiguate
based on the context.
Now that we have defined the concept of a string as a sequence of characters, we can go on to
give ourselves a way to distinguish between good strings (e.g. programs in a given programming
language) and bad strings (e.g. such with syntax errors). The way to do this by the concept of a
formal language, which we are about to define.
Formal Languages
S
Definition 4.2.7. Let A be an alphabet, then we define the sets A+ := i∈N+ Ai
of nonempty string and A∗ :=A+ ∪ {ϵ} of strings.
Example 4.2.8. If A = {a, b, c}, then A∗ = {ϵ, a, b, c, aa, ab, ac, ba, . . . , aaa, . . . }.
Definition 4.2.9. A set L ⊆ A∗ is called a formal language over A.
Definition 4.2.10. We use c[n] for the string that consists of the character c
repeated n times.
Example 4.2.11. #[5] = ⟨#, #, #, #, #⟩
Example 4.2.12. The set M := {ba[n] | n ∈ N} of strings that start with character
b followed by an arbitrary numbers of a’s is a formal language over A = {a, b}.
4.2. RECAP: FORMAL LANGUAGES AND GRAMMARS 57
There is a common misconception that a formal language is something that is difficult to under-
stand as a concept. This is not true, the only thing a formal language does is separate the “good”
from the bad strings. Thus we simply model a formal language as a set of stings: the “good”
strings are members, and the “bad” ones are not.
Of course this definition only shifts complexity to the way we construct specific formal languages
(where it actually belongs), and we have learned two (simple) ways of constructing them: by
repetition of characters, and by concatenation of existing languages. As mentioned above,
the purpose of a formal language is to distinguish “good” from “bad” strings. It is maximally
general, but not helpful, since it does not support computation and inference. In practice we
will be interested in formal languages that have some structure, so that we can represent formal
languages in a finite manner (recall that a formal language is a subset of A∗ , which may be infinite
and even undecidable – even though the alphabet A is finite).
To remedy this, we will now introduce phrase structure grammars (or just grammars), the stan-
dard tool for describing structured formal languages.
We fortify our intuition about these – admittedly very abstract – constructions by an example
and introduce some more vocabulary.
S → NP Vi
NP → Article N
Article → the | a | an
N → dog | teacher | . . .
Vi → sleeps | smells | . . .
Now we look at just how a grammar helps in analyzing formal languages. The basic idea is that
a grammar accepts a word, iff the start symbol can be rewritten into it using only the rules of the
grammar.
s →G2 asb
A →G B →G2 aaSbb
TEST2: →G C TEST3: →G2 aaaSbbb
→G D →G2 aaaaSbbbb
→G2 aaaabbbb
S →G NP Vi
→G Article N Vi
→G Article teacher Vi S → NP Vi
NP → Article N
Article → the | a | an | . . .
2. The teacher sleeps is a sentence. N → dog | teacher | . . .
S →∗G Article teacher Vi Vi → sleeps | smells | . . .
→G the teacher Vi
→G the teacher sleeps
Note that this process indeed defines a formal language given a grammar, but does not provide
an efficient algorithm for parsing, even for the simpler kinds of grammars we introduce below.
Observation: The shape of the grammar determines the “size” of its language.
Definition 4.2.26. We call a grammar:
1. context-sensitive (or type 1), if the bodies of production rules have no less symbols
than the heads,
2. context-free (or type 2), if the heads have exactly one symbol,
3. regular (or type 3), if additionally the bodies are empty or consist of a nonterminal,
optionally followed by a terminal symbol.
By extension, a formal language L is called context-sensitive/context-free/regular
(or type 1/type 2/type 3 respectively), iff it is the language of a respective grammar.
Context-free grammars are sometimes CFGs and context-free languages CFLs.
Example 4.2.27 (Context-sensitive). The language {a[n] b[n] c[n] } is accepted by
S → abc|A
A → aAB c|abc
cB → Bc
bB → bb
While the presentation of grammars from above is sufficient in theory, in practice the various
grammar rules are difficult and inconvenient to write down. Therefore computer science – where
grammars are important to e.g. specify parts of compilers – has developed extensions – notations
that can be expressed in terms of the original grammar rules – that make grammars more readable
(and writable) for humans. We introduce an important set now.
We will now build on the notion of BNF grammar notations and introduce a way of writing
down the (short) grammars we need in AI-1 that gives us even more of an overview over what is
happening.
In AI-1 we will only use context-free grammars (simpler, but problem still applies)
in AI-1: I will try to give “grammar overviews” that combine those, e.g. the
grammar of first-order logic.
variables X ∈ V1
function constants fk ∈ Σfk
predicate constants pk ∈ Σp k
terms t ::= X variable
| f0 constant
| f k (t1 , . . ., tk ) application
formulae A ::= pk (t1 , . . ., tk ) atomic
| ¬A negation
| A1 ∧ A2 conjunction
| ∀X.A quantifier
We will generally get by with context-free grammars, which have highly efficient into parsing
algorithms, for the formal language we use in this course, but we will not cover the algorithms in
AI-1.
Mathematical Structures
Observation: Mathematicians often cast classes of complex objects as mathemat-
ical structures.
62CHAPTER 4. RECAP OF PREREQUISITES FROM MATH & THEORETICAL COMPUTER SCIENCE
Note that the idea of mathematical structures has been picked up by most programming lan-
guages in various ways and you should therefore be quite familiar with it once you realize the
parallelism.
Even if the idea of mathematical structures may be familiar from programming, it may be quite
intimidating to some students in the mathematical notation we will use in this course. Therefore
will – when we get around to it – use a special overview notation in AI-1. We introduce it below.
Example 4.3.5.
* N Set nonterminal symbols, +
Σ Set terminal symbols,
grammar =
P {h → b | . . . } production rules,
S N start symbol
∗ ∗
h (Σ ∪ N ) , N , (Σ ∪ N ) head,
production rule h→b = ∗
b (Σ ∪ N ) body
Read the first line “N Set nonterminal symbols” in the structure above as “N is in
an (unspecified) set and is a nonterminal symbol”.
Here – and in the future – we will use Set for the class of sets ; “N is a set”.
In this chapter, we introduce a framework that gives a comprehensive conceptual model for the
multitude of methods and algorithms we cover in this course. The framework of rational agents
accommodates two traditions of AI.
Initially, the focus of AI research was on symbolic methods concentrating on the mental processes
of problem solving, starting from Newell/Simon’s “physical symbol hypothesis”:
A physical symbol system has the necessary and sufficient means for general intelligent action.
[NS76]
Here a symbol is a representation an idea, object, or relationship that is physically manifested in
(the brain of) an intelligent agent (human or artificial).
Later – in the 1980s – the proponents of embodied AI posited that most features of cognition,
whether human or otherwise, are shaped – or at least critically influenced – by aspects of the
entire body of the organism. The aspects of the body include the motor system, the perceptual
system, bodily interactions with the environment (situatedness) and the assumptions about the
world that are built into the structure of the organism. They argue that symbols are not always
necessary since
The world is its own best model. It is always exactly up to date. It always has every detail
there is to be known. The trick is to sense it appropriately and often enough. [Bro90]
The framework of rational agents initially introduced by Russell and Wefald in [RW91] – ac-
commodates both, it situates agents with percepts and actions in an environment, but does not
preclude physical symbol systems – i.e. systems that manipulate symbols as agent functions. Rus-
sell and Norvig make it the central metaphor of their book “Artificial Intelligence – A modern
approach” [RN03], which we follow in this course.
65
66 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK
Humanly Rational
Thinking “The exciting new effort “The formalization of mental
to make computers think faculties in terms of computa-
. . . machines with human-like tional models” [CM85]
minds” [Hau85]
Acting “The art of creating machines “The branch of CS concerned
that perform actions requiring with the automation of appro-
intelligence when performed by priate behavior in complex situ-
people” [Kur90] ations” [LS93]
b building pigeons that can fly so much like real pigeons that they can fool
=
pigeons
Not reproducible, not amenable to mathematical analysis
Thinking Humanly: ; Cognitive Science.
We now discuss all of the four facets in a bit more detail, as they all either contribute directly
to our discussion of AI methods or characterize neighboring disciplines.
It was predicted that by 2000, a machine might have a 30% chance of fooling a lay
person for 5 minutes.
Note: In [Tur50], Alan Turing
Acting Rationally
Idea: Rational behavior =
b doing the right thing!
Definition 5.1.4. Rational behavior consists of always doing what is expected to
maximize goal achievement given the available information.
Rational behavior does not necessarily involve thinking e.g., blinking reflex — but
thinking should be in the service of rational action.
Aristotle: Every art and every inquiry, and similarly every action and pursuit, is
thought to aim at some good. (Nicomachean Ethics)
One possible objection to this is that the agent and the environment are conceptualized as separate
entities; in particular, that the image suggests that the agent itself is not part of the environment.
Indeed that is intended, since it makes thinking about agents and environments easier and is of
little consequence in practice. In particular, the offending separation is relatively easily fixed if
needed.
Let us now try to express the agent/environment ideas introduced above in mathematical language
to add the precision we need to start the process towards the implementation of rational agents.
Definition 5.2.6. The agent function f a of an agent a maps from percept histories
to actions:
f a : P∗ → A
70 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK
We assume that agents can always perceive their own actions. (but not necessarily
their consequences)
Problem: Agent functions can become very big and may be uncomputable.
(theoretical tool only)
Here we already see a problem that will recur often in this course: The mathematical formulation
gives us an abstract specification of what we want (here the agent function), but not directly a
way of how to obtain it. Here, the solution is to choose a computational model for agents (an
agent architecture) and see how the agent function can be implemented in a agent program.
Agent Sensors
Percepts
Environment
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
described by the agent function that maps any given percept sequence to an action.
AGENT FUNCTION
We can
Michael imagine
Kohlhase: Intelligence the
tabulating
Artificial 1 agent function that90describes any given agent; for most
2025-02-06
agents, this would be a very large table—infinite, in fact, unless we place a bound on the
Let us fortify our intuition about
length of percept all ofwethis
sequences wantwith an example,
to consider. which
Given an agent we will
to experiment use
with, weoften
can, in the course
in principle, construct this table by trying out all possible percept sequences and recording
of the AI-1 course.which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
Example: Vacuum-Cleaner World and Agent
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical description; the agent program is a concrete implementation, running
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
locations: squares A and B. The vacuum agent perceives which square it is in and whether
there is dirt in the square. It can choose to move left, move right, suck up the dirt, or do
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise, move to the other square. A partial tabulation of this agent function is shown
in Figure 2.3 and an agent program that implements it appears in Figure 2.8 on page 48.
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
by filling in the right-hand column in various ways. The obvious question, then, is this: What
is the right way to fill out the table? In other words, what makes an agent good or bad,
intelligent or stupid? We answer these questions in the next section.
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
show later in this chapter that it can be very intelligent.
5.2. AGENT/ENV. AS A FRAMEWORK 71
The first implementation idea inspired by the table in last slide would just be table lookup algo-
rithm.
Table-Driven Agents
Idea: We can just implement the agent function as a lookup table and lookup
actions.
We can directly implement this:
function Table−Driven−Agent(percept) returns an action
persistent table /∗ a table of actions indexed by percept sequences ∗/
var percepts /∗ a sequence, initially empty ∗/
append percept to the end of percepts
action := lookup(percepts, table)
return action
Rationality
Idea: Try to design agents that are successful! (aka. “do the right thing”)
Let us see how the observation that we only need to maximize the expected value, not the actual
value of the performance measure affects the consequences.
For the design of agent for a specific task – i.e. choose an agent architecture and design an
agent program, we have to take into account the performance measure, the environment, and the
characteristics of the agent itself; in particular its actions and sensors.
The PEAS criteria are essentially a laundry list of what an agent design task description should
include.
Agents
Which are agents?
(A) James Bond.
(B) Your dog.
(C) Vacuum cleaner.
(D) Thermometer.
Environment types
Observation 5.4.1. Agent design is largely determined by the type of environment
it is intended for.
1. fully observable, iff the a’s sensors give it access to the complete state of the
environment at any point in time, else partially observable.
5.5. TYPES OF AGENTS 75
Note: Take the example above with a grain of salt. There are often multiple
interpretations that yield different classifications and different agents. (agent
designer’s choice)
Example 5.4.4. Seen as a multi-agent game, chess is deterministic, as a single-
agent game, it is stochastic.
Observation 5.4.5. The real world is (of course) a partially observable, stochastic,
sequential, dynamic, continuous, and multi-agent environment. (worst case for AI)
Preview: We will concentrate on the “easy” environment types (fully observ-
able, deterministic, episodic, static, and single-agent) in AI-1 and extend them to
“realworld”-compatible ones in AI-2.
In the AI-1 course we will work our way from the simpler environment types to the more general
ones. Each environment type wil need its own agent types specialized to surviving and doing well
in them.
state, and utility, and finally add learning. A Video Nugget covering this section can be found
at https://fau.tv/clip/id/21926.
Agent Types
Observation: So far we have described (and analyzed) agents only by their be-
havior (cf. agent function f : P ∗ → A).
Problem: This does not help us to build agents. (the goal of AI)
To build an agent, we need to fix an agent architecture and come up with an agent
program that runs on it.
Preview: Four basic types of agent architectures in order of increasing generality:
1. simple reflex agents
2. model-based agents
3. goal-based agents
4. utility-based agents
All these can be turned into learning agents.
Agent Sensors
Actuators
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.
trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
percept, and the RULE -M ATCH function returns the first rule in the set of rules that matches
5.5. TYPES OF AGENTS 77
Problem: Simple reflex agents can only react to the perceived state of the envi-
ronment, not to changes.
Example 5.5.3. Automobile tail lights signal braking by brightening. A simple
reflex agent would have to compare subsequent percepts to realize.
Problem: Partially observable environments get simple reflex agents into trouble.
Example 5.5.4. Vacuum cleaner robot with defective location sensor ; infinite
loops.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
78 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK
a sensor model S that given a state s and a percepts p determines a new state
S(s, p).
a transition model T , that predicts a new state T (s, a) from a state s and an
action a.
An action function f that maps (new) states to an actions.
If the world model of a model-based agent A is in state s and A has taken action
a, A will transition to state s′ = T (S(p, s), a) and take action a′ = f (s′ ).
Note: As different percept sequences lead to different states, so the agent function
f a : P ∗ → A no longer depends only on the last percept.
Example 5.5.6 (Tail Lights Again). Model-based agents can do the ?? if the
states include a concept of tail light brightness.
Problem: Having a world model does not always determine what to do (rationally).
Example 5.5.8. Coming to an intersection, where the agent has to decide between
going left and right.
Goal-based Agents
Problem: A world model does not always determine what to do (rationally).
Observation: Having a goal in mind does! (determines future actions)
Agent Schema:
52
5.5. TYPES OF AGENTS Chapter 2. Intelligent Agents 79
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
Goals should do now
Agent Actuators
Figure 2.13 A model-based, goal-based agent. It keeps track of the world state as well as
a set of goals it is trying to achieve, and chooses an action that will (eventually) lead to the
Michael Kohlhase: Artificial Intelligence 1 107 2025-02-06
achievement of its goals.
Goal-based
example, theagents
taxi may be(continued)
driving back home, and it may have a rule telling it to fill up with
gas on the way home unless it has at least half a tank. Although “driving back home” may
seem to an aspect of the world state, the fact of the taxi’s destination is actually an aspect of
Definition 5.5.9. state.
the agent’s internal A goal-based
If you findagent is a model-based
this puzzling, agent
consider that withcould
the taxi transition model
be in exactly
that
Tthe deliberates
same place at theactions based
same time, but on 3 andtoa reach
intending worlda model:
different It employs
destination.
′
a set G of goals and a goal function f that given a (new) state s selects an
2.4.4 Goal-based agents
action a to best reach G.
Knowing something about the current state of the environment is not always enough to decide
The
whataction
to do. function is then
For example, 7→ fjunction,
at a sroad (T (s), G).
the taxi can turn left, turn right, or go straight
on. The correct decision depends on where the taxi is trying to get to. In other words, as well
GOAL
Observation:
as a current stateAdescription,
goal-based theagent
agent is more
needs flexible
some in goal
sort of the knowledge it can
information that utilize.
describes
situations that
Example are desirable—for
5.5.10. A goal-basedexample, beingeasily
agent can at thebe
passenger’s
changed destination. The agent
to go to a new desti-
program can combine this with the model (the same information as was used in the model-
nation, a model-based agent’s rules make it go to exactly one destination.
based reflex agent) to choose actions that achieve the goal. Figure 2.13 shows the goal-based
agent’s structure.
Sometimes goal-based
Michael Kohlhase: action selection
Artificial Intelligence 1 is straightforward—for
108 example, when goal sat-
2025-02-06
isfaction results immediately from a single action. Sometimes it will be more tricky—for
example, when the agent has to consider long sequences of twists and turns in order to find a
Utility-based the goal. Search (Chapters 3 to 5) and planning (Chapters 10 and 11) are the
way to achieve Agents
subfields of AI devoted to finding action sequences that achieve the agent’s goals.
Notice that decision making of this kind is fundamentally different from the condition–
Definition 5.5.11. earlier,
action rules described A utility-based agent consideration
in that it involves uses a worldofmodel along with
the future—both a utility
“What will
function
happen ifthat
I do models its preferences
such-and-such?” and “Willamong the me
that make states of that
happy?” world.
In the reflex It chooses
agent the
designs,
action that leadsistonotthe
this information best expected
explicitly utility.
represented, because the built-in rules map directly from
Agent Schema:
54
80 Chapter
CHAPTER 5. RATIONAL 2.
AGENTS: Intelligent
AN Agents
AI FRAMEWORK
Sensors
State
What the world
How the world evolves is like now
Environment
What it will be like
What my actions do if I do action A
What action I
should do now
Agent Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
Michael Kohlhase: Artificial Intelligence 1 109 2025-02-06
action that leads to the best expected utility, where expected utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.
Learning Agents
Agent Schema:
Section 2.4. The Structure of Agents 55
5.5. TYPES OF AGENTS 81
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
vs.
Solver specific to a particular prob- vs. Solver based on description in a
lem (“domain”). general problem-description language
(e.g., the rules of any board game).
More efficient. vs. Much less design/maintenance work.
Also The additional internal structure will make the algorithms more complex.
B C
B C
Example 5.6.2. Consider the problem of finding a driving route from one end of
a country to the other via some sequence of cities.
In an atomic representation the state is represented by the name of a city.
In a factored representation we may have attributes “gps-location”, “gas”,. . .
(allows information sharing between states and uncertainty)
But how to represent a situation, where a large truck blocking the road, since it
is trying to back into a driveway, but a loose cow is blocking its path. (attribute
“TruckAheadBackingIntoDairyFarmDrivewayBlockedByLooseCow” is unlikely)
In a structured representation, we can have objects for trucks, cows, etc. and
their relationships. (at “run-time”)
Note: The set of states in atomic representations and attributes in factored ones is determined
at design time, while the objects and their relationships in structured ones are discovered at
“runtime”.
Here – as always when we evaluate representations – the crucial aspect to look out for are the
idendity conditions: when do we consider two representations equal, and when can we (or more
crucially algorithms) distinguish them.
For instance for factored representations, make world representations equal, iff the values of
the attributes – that are determined at agent design time and thus immutable by the agent –
are all equual. So the agent designer has to make sure to add all the attributes to the chosen
representation that are necessary to distinguish environments that the agent program needs to
treat differently.
It is tempting to think that the situation with atomic representations is easier, since we can
“simply” add enough states for the necesssary distictions, but in practice this set of states may
have to be infinite, while in factored or structured representations we can keep representations
finite.
Summary
Agents interact with environments through actuators and sensors.
84 CHAPTER 5. RATIONAL AGENTS: AN AI FRAMEWORK
The agent function describes what the agent does in all circumstances.
The performance measure evaluates the environment sequence.
A perfectly rational agent maximizes expected performance.
agent architectures,
corresponding agent programs and algorithms, and
world representation paradigms.
Problem: Which one is the best?
Answer: That really depends on the environment type they have to survive/thrive
in! The agent designer – i.e. you – has to choose!
The course gives you the necessary competencies.
Consequence: The rational agents paradigm used in this course challenges you
to become a good agent designer.
85
87
This part introduces search-based methods for general problem solving using atomic and factored
representations of states.
Concretely, we discuss the basic techniques of search-based symbolic AI. First in the shape of
classical and heuristic search and adversarial search paradigms. Then in constraint propagation,
where we see the first instances of inference-based methods.
88
Chapter 6
In this chapter, we will look at a class of algorithms called search algorithms. These are
algorithms that help in quite general situations, where there is a precisely described problem, that
needs to be solved. Hence the name “General Problem Solving” for the area.
89
90 CHAPTER 6. PROBLEM SOLVING AND SEARCH
We will use the following problem as a running example. It is simple enough to fit on one slide
and complex enough to show the relevant features of the problem solving algorithms we want to
talk about.
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
it also limits the objectives by specifying goal states. (excludes, e.g. to stay
another couple of weeks.)
A solution is a sequence of actions that leads from the initial state to a goal state.
Problem solving computes solutions from problem formulations.
Finding the right level of abstraction and the required (not more!) information is
often the key to success.
Definition 6.1.7. The graph ⟨S, TA ⟩ is called the state space induced by Π.
Definition 6.1.8. A solution for Π consists of a sequence a1 , . . ., an of actions
such that for all 1 < i ≤ n
Observation: The formulation of problems from ?? uses an atomic (black-box) state represen-
tation. It has enough functionality to construct the state space but nothing else. We will come
back to this in slide ??.
Remark 6.1.10. Note that search problems formalize problem formulations by making many of
the implicit constraints explicit.
S Set states,
* +
A Set actions,
search problem = T A×S → P(S) transition model,
I S initial state,
G P(S) goal states
We will now specialize ?? to deterministic, fully observable environments, i.e. environments where
actions only have one – assured – outcome state.
Definition 6.1.13. The predicate that tests for goal states is called a goal test.
Problem types
Definition 6.2.1. A search problem is called a single state problem, iff it is
fully observable (at least the initial state)
deterministic (unique successor states)
static (states do not change other than by our own actions)
discrete (a countable number of states)
Definition 6.2.2. A search problem is called a multi state problem
states partially observable (e.g. multiple initial states)
deterministic, static, discrete
6.2. PROBLEM TYPES 93
We will explain these problem types with another example. The problem P is very simple: We
have a vacuum cleaner and two rooms. The vacuum cleaner is in one room at a time. The floor
can be dirty or clean.
The possible states are determined by the position of the vacuum cleaner and the information,
whether each room is dirty or not. Obviously, there are eight states: S = {1, 2, 3, 4, 5, 6, 7, 8} for
simplicity.
The goal is to have both rooms clean, the vacuum cleaner can be anywhere. So the set G of
goal states is {7, 8}. In the single-state version of the problem, [right, suck] shortest solution, but
[suck, right, suck] is also one. In the multiple-state version we have
L
S S
Start in 5 L
R
R L
R
R
L L
S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
Multiple-state Problem: Right, S = Suck.
lef t
• Initial → {3, 7}
state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
suckSuck. Larger→environments
{7} might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
have no effect. The complete state space is shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
Michael Kohlhase: Artificial Intelligence 1 126 Each step costs 1, so the path cost
• Path cost: 2025-02-06
is the number of steps in the path.
Compared with the real world, this toy problem has discrete locations, discrete dirt, reliable
cleaning, and it never gets any dirtier. Chapter 4 relaxes some of these assumptions.
8-PUZZLE The 8-puzzle, an instance of which is shown in Figure 3.4, consists of a 3×3 board with
eight numbered tiles and a blank space. A tile adjacent to the blank space can slide into the
Example: Vacuum-Cleaner World (continued) space. The object is to reach a specified goal state, such as the one shown on the right of the
figure. The standard formulation is as follows:
Contingency Problem:
94 CHAPTER 6. PROBLEM SOLVING AND SEARCH
cation only
L
S S
R R
L R L R
S
R
Solution:
L R
suck → {5, 7} S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
right → {6, 8} Right, S = Suck.
sensing) n · 2n states.
• Initial state: Any state can be designated as the initial state.
• Actions: In this simple environment, each state has just three actions: Left, Right, and
Suck. Larger environments might also include Up and Down.
• Transition model: The actions have their expected effects, except that moving Left in
the leftmost square, moving Right in the rightmost square, and Sucking in a clean square
Michael Kohlhase: Artificial Intelligence 1 have no effect. The complete state space is2025-02-06
127 shown in Figure 3.3.
• Goal test: This checks whether all the squares are clean.
• Path cost: Each step costs 1, so the path cost is the number of steps in the path.
etc. Of course, local sensing can help: narrow {6, 8} to {6} or {8}, if we are in the first, then
suck.
“Path cost”: There may be more than one solution and we might want to have the “best” one in
a certain sense.
“State”: e.g., we don’t care about tourist attractions found in the cities along the way. But this is
problem dependent. In a different problem it may well be appropriate to include such information
in the notion of state.
“Realizability”: one could also say that the abstraction must be sound wrt. reality.
Example:
Section 3.2. The
Example 8-puzzle
Problems 71
7 2 4 1 2
5 6 3 4 5
8 3 1 6 7 8
Before closing this section, we should emphasize that the notion of an agent is meant to
be a tool for analyzing systems, not an absolute characterization that divides the world into
agents and non-agents. One could view a hand-held calculator as an agent that chooses the
96 CHAPTER 6. PROBLEM SOLVING AND SEARCH
States? Actions?. . .
States real-valued coordinates of
robot joint angles and parts of the object to be assembled
Actions continuous motions of robot joints
Goal test assembly complete?
Path cost time to execute
General Problems
Question: Which are “Problems”?
6.3 Search
A Video Nugget covering this section can be found at https://fau.tv/clip/id/21956.
Arad
Arad
Arad
Arad
Let us now think a bit more about the implementation of tree search algorithms based on the
ideas discussed above. The abstract, mathematical notions of a search problem and the induced
tree search algorithm gets further refined here.
Figure 3.10 Nodes are the data structures from which the search tree is constructed. Each
has a parent, a state, and various bookkeeping fields. Arrows point from child to parent.
Observation: A set of search tree nodes that can all (recursively) reach a single
initial node form a search tree. components for a child node. (they
The functionimplement
C HILD -N ODE takesit)
Given the components for a parent node, it is easy to see how to compute the necessary
a parent node and an action
and returns the resulting child node:
Observation: Paths in the search tree correspond to paths in the state space.
function C HILD -N ODE( problem, parent , action) returns a node
return a node with
Definition 6.3.3. We define the path cost of a node
S =n in a search
problem.R (parent.S tree ), to be
, actionT
TATE ESULT TATE
P = parent , A = action,ARENT CTION
the sum of the step costs on the path from n to theP root
-C of T .P. -C + problem.S -C (parent.S
= parent ATH OST ATH OST TEP OST TATE, action )
Observation: As a search tree node has access to parents, we can read off the
The node data structure is depicted in Figure 3.10. Notice how the PARENT pointers
string the nodes together into a tree structure. These pointers also allow the solution path to be
solution from a goal node. extracted when a goal node is found; we use the S OLUTION function to return the sequence
of actions obtained by following parent pointers back to the root.
Up to now, we have not been very careful to distinguish between nodes and states, but in
writing detailed algorithms it’s important to make that distinction. A node is a bookkeeping
data structure used to represent the search tree. A state corresponds to a configuration of the
Michael Kohlhase: Artificial Intelligence 1 136 2025-02-06
world. Thus, nodes are on particular paths, as defined by PARENT pointers, whereas states
are not. Furthermore, two different nodes can contain the same world state if that state is
generated via two different search paths.
Now that we have nodes, we need somewhere to put them. The frontier needs to be
It is very important to understand the fundamental difference between a state in a search problem,
stored in such a way that the search algorithm can easily choose the next node to expand
according to its preferred strategy. The appropriate data structure for this is a queue. The
a node search tree employed by the tree search algorithm, and the implementation in a search tree
QUEUE
node. The implementation above is faithful in the sense ••that the implemented data structures
E MPTY ?(queue) returns true only if there are no more elements in the queue.
P OP(queue) removes the first element of the queue and returns it.
contain all the information needed in the tree search algorithm.
• I NSERT (element, queue) inserts an element and returns the resulting queue.
So we can use it to refine the idea of a tree search algorithm into an implementation.
The fringe is the set of search tree nodes not yet expanded in tree search.
Idea: We treat the fringe as an abstract data type with three accessors: the
binary function first retrieves an element from the fringe according to a strategy.
binary function insert adds a (set of) search tree node into a fringe.
unary predicate empty to determine whether a fringe is the empty set.
The strategy determines the behavior of the fringe (data structure) (see below)
Search strategies
Definition 6.3.5. A strategy is a function that picks a node from the fringe of a
search tree. (equivalently, orders the fringe and picks the first.)
Note that there can be infinite branches, see the search tree for Romania.
The opposite of uninformed search is informed or heuristic search that uses a heuristic function
that adds external guidance to the search process. In the Romania example, one could add the
heuristic to prefer cities that lie in the general direction of the goal (here SE).
Even though heuristic search is usually much more efficient, uninformed search is important
nonetheless, because many problems do not allow to extract good heuristics.
Breadth-First Search
Idea: Expand the shallowest unexpanded node.
Definition 6.4.2. The breadth first search (BFS) strategy treats the fringe as a
FIFO queue, i.e. successors go in at the end of the fringe.
Example 6.4.3 (Synthetic).
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
6.4. UNINFORMED SEARCH STRATEGIES 101
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
We will now apply the breadth first search strategy to our running example: Traveling in Romania.
Note that we leave out the green dashed nodes that allow us a preview over what the search tree
will look like (if expanded). This gives a much cleaner picture we assume that the readers already
have grasped the mechanism sufficiently.
Arad
102 CHAPTER 6. PROBLEM SOLVING AND SEARCH
Arad
Arad
Arad
Arad
An alternative is to generate all solutions and then pick an optimal one. This works
only, if m is finite.
The next idea is to let cost drive the search. For this, we will need a non-trivial cost function: we
will take the distance between cities, since this is very natural. Alternatives would be the driving
time, train ticket cost, or the number of tourist attractions along the way.
Of course we need to update our problem formulation with the necessary information.
6.4. UNINFORMED SEARCH STRATEGIES 103
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Sometimes the goal is specified by an abstract property rather than an explicitly enumer-
ated set of states. For example, in chess, the goal is to reach a state called “checkmate,”
Uniform-cost search
where the opponent’s king is under attack and can’t escape.
PATH COST • A path cost function that assigns a numeric cost to each path. The problem-solving
Idea: agent chooses a cost function that reflects its own performance measure. For the agent
Expand least cost unexpanded node.
trying to get to Bucharest, time is of the essence, so the cost of a path might be its length
in kilometers.
Definition 6.4.5.InUniform-cost
this chapter, we search
assume that the cost
(UCS) of a path
is the can bewhere
strategy described
theasfringe
the is
sum of the costs of the individual actions along the path.3 The step cost of taking action
STEP COST
ordered by increasing path cost. ! !
a in state s to reach state s is denoted by c(s, a, s ). The step costs for Romania are
shown in Figure to
Note: Equivalent 3.2breadth
as route distances. We assume
first search that step
if all step costs
costs nonnegative.4
areareequal.
The preceding elements define a problem and can be gathered into a single data structure
Synthetic
that is givenExample:
as input to a problem-solving algorithm. A solution to a problem is an action
sequence that leads from the initial state to a goal state. Solution quality is measured by the
OPTIMAL SOLUTION path cost function, and an optimal solution has the lowest path cost among all solutions.
Arad
Arad
140 118 75
Sibiu Timisoara Zerind
71 75
Oradea Arad
104 CHAPTER 6. PROBLEM SOLVING AND SEARCH
Arad
140 118 75
Sibiu Timisoara Zerind
118 111 71 75
Arad
140 118 75
Sibiu Timisoara Zerind
140 99 151 80 118 111 71 75
Note that we must sum the distances to each leaf. That is, we go back to the first level after the
third step.
If step cost is negative, the same situation as in breadth first search can occur: later solutions may
be cheaper than the current one.
If step cost is 0, one can run into infinite branches. UCS then degenerates into depth first
search, the next kind of search algorithm we will encounter. Even if we have infinite branches,
where the sum of step costs converges, we can get into trouble, since the search is forced down
these infinite paths before a solution can be found.
Worst case is often worse than BFS, because large trees with small steps tend to be searched
first. If step costs are uniform, it degenerates to BFS.
Depth-first Search
Idea: Expand deepest unexpanded node.
Definition 6.4.6. Depth-first search (DFS) is the strategy where the fringe is
organized as a (LIFO) stack i.e. successors go in at front of the fringe.
Definition 6.4.7. Every node that is pushed to the stack is called a backtrack
point. The action of popping a non-goal node from the stack and continuing the
search with the new top element of the stack (a backtrack point by construction)
is called backtracking, and correspondingly the DFS algorithm backtracking search.
6.4. UNINFORMED SEARCH STRATEGIES 105
Depth-First Search
Example 6.4.8 (Synthetic).
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
106 CHAPTER 6. PROBLEM SOLVING AND SEARCH
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
6.4. UNINFORMED SEARCH STRATEGIES 107
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
B C
D E F G
H I J K L M N O
Arad
Arad
Arad
Arad
Definition 6.4.11. Iterative deepening search (IDS) is depth limited search with
ever increasing depth limits. We call the difference between successive depth limits
the step size.
A A
A A A A
B C B C B C B C
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
A A A A
B C B C B C B C
D E F G D E F G D E F G D E F G
110 CHAPTER 6. PROBLEM SOLVING AND SEARCH
Completeness Yes
Time complexity (d+1)·b0 +d·b1 +(d−1)·b2 +. . .+bd ∈ O(bd+1 )
Space complexity O(b · d)
Optimality Yes (if step cost = 1)
Consequence: IDS used in practice for search spaces of large, infinite, or unknown
depth.
Note: To find a solution (at depth d) we have to search the whole tree up to d. Of course since
we do not save the search state, we have to re-compute the upper part of the tree for the next
level. This seems like a great waste of resources at first, however, IDS tries to be complete without
the space penalties.
However, the space complexity is as good as DFS, since we are using DFS along the way. Like
in BFS, the whole tree on level d (of optimal solution) is explored, so optimality is inherited from
there. Like BFS, one can modify this to incorporate uniform cost search behavior.
As a consequence, variants of IDS are the method of choice if we do not have additional
information.
Kohlhase:
Kohlhase:Künstliche
KünstlicheIntelligenz 1 1
Intelligenz 150150 JulyJuly
5, 2018
5, 2018
Graph versions of all the tree search algorithms considered here exist, but are more
difficult to understand (and to prove properties about).
The (time complexity) properties are largely stable under duplicate pruning. (no
gain in the worst case)
induced by a search problem in search of a goal state. Search strategies only differ
by the treatment of the fringe.
Search Strategies and their Properties: We have discussed
Best-first search
Idea: Order the fringe by estimated “desirability” (Expand most desirable
unexpanded node)
Definition 6.5.2. An evaluation function assigns a desirability value to each node
of the search tree.
Note: A evaluation function is not part of the search problem, but must be added
externally.
Definition 6.5.3. In best first search, the fringe is a queue sorted in decreasing
order of desirability.
This is like UCS, but with an evaluation function related to problem at hand replacing the path
cost function.
If the heuristic is arbitrary, we expect incompleteness!
Depends on how we measure “desirability”.
Concrete examples follow.
Greedy search
Idea: Expand the node that appears to be closest to the goal.
Definition 6.5.4. A heuristic is an evaluation function h on states that estimates
the cost from n to the nearest goal state. We speak of heuristic search if the search
algorithm uses a heuristic in some way.
Note: All nodes for the same state must have the same h-value!
Definition 6.5.5. Given a heuristic h, greedy search is the strategy where the
fringe is organized as a queue sorted by increasing h value.
Note: Unlike uniform cost search the node evaluation function has nothing to do
with the nodes expanded so far
In greedy search we replace the objective cost to construct the current solution with a heuristic or
subjective measure from which we think it gives a good idea how far we are from a solution. Two
things have shifted:
• we went from internal (determined only by features inherent in the search space) to an external/heuris-
tic cost
• instead of measuring the cost to build the current partial solution, we estimate how far we are
from the desired goal
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad Fagaras Oradea R. Vilcea
Arad
366
Sibiu Timisoara Zerind
253 329 374
Arad Fagaras Oradea R. Vilcea
Sibiu Bucharest
253 0
Let us fortify our intuitions with another example: navigation in a simple maze. Here the states
are the cells in the grid underlying the maze and the actions navigating to one of the adjoining
cells. The initial and goal states are the left upper and right lower corners of the grid. To see the
influence of the chosen heuristic (indicated by the red number in the cell), we compare the search
induced goal distance function with a heuristic based on the Manhattan distance. Just follow the
greedy search by following the heuristic gradient.
HeuristicFunctions
Heuristic FunctionsininPath
PathPlanning
Planning
I Example 6.5.8
Example 4.4 (The
(Themaze
mazesolved).
solved). We indicate h∗ by giving the goal distance:
We indicate h⇤ by giving the goal distance
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 18 17 16 15 14 12 13 14 15 16 17 18
2 23 19 18 17 13 12 11 12 13 14 15 16 17
3 22 21 20 16 12 11 10
4 23 22 21 15 14 13 9 8 4 3 2 1
5 24 23 22 16 15 9 8 7 6 5 1 0
G
I Example 4.5 (Maze Heuristic: the good case).
Example
We use the6.5.9 (Maze distance
Manhattan Heuristic: The
to the goalgood
as a case).
heuristicWe use the Manhattan
distance to the goal as a heuristic:
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4
2 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3
3 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
5 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
G
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018
Not surprisingly, the first maze is searchless, since we are guided by the perfect heuristic. In cases,
where there is a choice, the this has no influence on the length (or in other cases cost) of the
solution.
In the “good case” example, greedy search performs well, but there is some limited backtracking
needed, for instance when exploring the left lower corner 3×3 area before climbing over the second
wall.
In the “bad case”, greedy search is led down the lower garden path, which has a dead end, and
does not lead to the goal. This suggests that there we can construct adversary examples – i.e.
example mazes where we can force greedy search into arbitrarily bad performance.
Example 6.5.11. Greedy search can get stuck going from Iasi to Oradea:
Iasi → Neamt → Iasi → Neamt → · · ·
6.5. INFORMED SEARCH STRATEGIES 117
68 Chapter 3. Solving Problems by Searching
Oradea
71
Neamt
Zerind 87
75 151
Iasi
Arad
140
92
Sibiu Fagaras
99
118
Vaslui
80
Rimnicu Vilcea
Timisoara
142
111 Pitesti 211
Lugoj 97
70 98
85 Hirsova
Mehadia 146 101 Urziceni
75 138 86
Bucharest
Drobeta 120
90
Craiova Eforie
Giurgiu
Note that nothing prevents from all nodes being searched in worst case; e.g. if the heuristic
function gives us the same (low) estimate on all nodes except where the heuristic mis-estimates
the distance to be high. So in the worst case, greedy search is even worse than BFS, where d
(depth of first solution) replaces m.
The search procedure cannot be optimal, since actual cost of solution is not considered.
For both, completeness and optimality, therefore, it is necessary to take the actual cost of
partial solutions, i.e. the path cost, into account. This way, paths that are known to be expensive
are avoided.
Heuristic Functions
Definition 6.5.13. Let Π be a search problem with states S. A heuristic function
(or short heuristic) for Π is a function h : S → R+
0 ∪ {∞} so that h(s) = 0 whenever
s is a goal state.
h(s) is intended as an estimate the distance between state s and the nearest goal
state.
Definition 6.5.14. Let Π be a search problem with states S, then the function
h∗ : S → R+ ∗
0 ∪ {∞}, where h (s) is the cost of a cheapest path from s to a goal
state, or ∞ if no such path exists, is called the goal distance function for Π.
Notes:
h(s) = 0 on goal states: If your estimator returns “I think it’s still a long way”
on a goal state, then its intelligence is, um . . .
118 CHAPTER 6. PROBLEM SOLVING AND SEARCH
Return value ∞: To indicate dead ends, from which the goal state can’t be
reached anymore.
The distance estimate depends only on the state s, not on the node (i.e., the
path we took to reach s).
This works, provided that h does not overestimate the true cost to achieve the goal. In other
words, h must be optimistic wrt. the real cost h∗ . If we are too pessimistic, then non-optimal
solutions have a chance.
A∗ Search: Optimality
Theorem 6.5.22. A∗ search with admissible heuristic is optimal.
O G
A∗ Search Example
Arad
366=0+366
Arad
Arad
Arad
Arad
646=280+366 671=291+380
Arad
646=280+366 671=291+380
To extend our intuitions about informed search algorithms to A∗ -search, we take up the maze
examples from above again. We first show the good maze with Manhattan distance again.
G
We will find a solution with little search.
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018
To compare it to A∗ -search, here is the same maze but now with the numbers in red for the
evaluation function f where h is the Manhattan distance.
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 22 22 22 22 22 24 24 24 24 24 24 24
2 18 20 20 20 22 22 22 22 22 22 22 22 22
3 18 18 18 20 22 22 22
4 18 18 18 20 20 20 22 22 24 24 24 24
5 18 18 18 20 20 24 22 22 22 22 24 24
G
I A⇤ with
∗ a consistent heuristic g + h always increases monotonically (h cannot
In A with a consistent heuristic, g + h always increases monotonically (h
decrease mor than g increases)
cannot decrease more than g increases)
I We need more search, in the “right upper half”. This is typical: Greedy best-first
We need
search more
tends to besearch, in the
faster than A⇤“right
. upper half”. This is typical: Greedy best
first search tends to be faster than A∗ .
Let’s now consider the “bad maze” with Manhattan distance again.
G
Kohlhase: Künstliche Intelligenz 1 160 July 5, 2018
And we compare it to A∗ -search; again the numbers in red are for the evaluation function f .
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 18 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 18 16 22 14 13 12 11 10 9 8 7 6 5 4 24
3 18 15 20 13 22 22 22 9 26 26 26 5 30 3 24
4 18 18 18 12 20 10 22 8 24 6 26 4 28 2 24
5 18 18 18 18 18 9 22 22 22 5 26 26 26 1 24
G
We will search less of the “dead-end street”. Sometimes g + h gives better
We will search
search lessthan
guidance of the
h. “dead-end street”. Sometimes g +
(;h gives
A⇤ is better search
faster there)
guidance than h. (; A∗ is faster there)
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 24 17 24 24 24 24 24 24 24 24 24 24 24 24 24
2 24 16 24 14 13 12 11 10 9 8 7 6 5 4 24
3 24 15 24 13 34 36 38 9 50 52 54 5 66 3 24
4 24 24 24 12 32 10 40 8 48 6 56 4 64 2 24
5 26 26 26 28 30 9 42 44 46 5 58 60 62 1 24
G
In A⇤ , node values always increase monotonically (with any heuristic). If the
A∗ , node
Inheuristic is values
perfect,always increaseconstant
they remain monotonically (withpaths.
on optimal any heuristic). If the heuris-
tic is perfect, they remain constant on optimal paths.
A∗ search: f -contours
Section 3.5. Informed (Heuristic) Search Strategies 97
Intuition: A∗ -search gradually adds “f -contours” (areas of the same f -value) to
the search.
Z N
A I
380 S
F
V
400
T R
L P
H
M U
420 B
D
C E
G
Admissible
Section 3.2. heuristics:
Example Problems Example 8-puzzle 71
7 2 4 1 2
5 6 3 4 5
8 3 1 6 7 8
We now try to generalize these insights into (the beginnings of) a general method for obtaining
126 CHAPTER 6. PROBLEM SOLVING AND SEARCH
admissible heuristics.
Relaxed problems
Observation: Finding good admissible heuristics is an art!
Idea: Admissible heuristics can be derived from the exact solution cost of a relaxed
version of the problem.
Example 6.5.33. If the rules of the 8-puzzle are relaxed so that a tile can move
anywhere, then we get heuristic h1 .
Example 6.5.34. If the rules are relaxed so that a tile can move to any adjacent
square, then we get heuristic h2 . (Manhattan distance)
Definition 6.5.35. Let Π := ⟨S , A, T , I , G ⟩ be a search problem, then we call
a search problem P r := ⟨S, Ar , T r , I r , G r ⟩ a relaxed problem (wrt. Π; or simply
relaxation of Π), iff A ⊆ Ar , T ⊆ T r , I ⊆ I r , and G ⊆ G r .
Lemma 6.5.36. If P r relaxes Π, then every solution for Π is one for P r .
Key point: The optimal solution cost of a relaxed problem is not greater than the
optimal solution cost of the real problem.
Relaxation means to remove some of the constraints or requirements of the original problem,
so that a solution becomes easy to find. Then the cost of this easy solution can be used as an
optimistic approximation of the problem.
See http://qiao.github.io/PathFinding.js/visual/
Difference to Breadth-first Search?: That would explore all grid cells in a circle
around the initial state!
In order to understand the procedure on a more intuitive level, let us consider the following
scenario: We are in a dark landscape (or we are blind), and we want to find the highest hill. The
search procedure above tells us to start our search anywhere, and for every step first feel around,
and then take a step into the direction with the steepest ascent. If we reach a place, where the
next step would take us down, we are finished.
Of course, this will only get us into local maxima, and has no guarantee of getting us into
global ones (remember, we are blind). The solution to this problem is to re-start the search at
random (we do not have any information) places, and hope that one of the random jumps will get
us to a slope that leads to a global maximum.
a single current node (rather than multiple paths) and generally move only to neighbors
CURRENT NODE
with heuristic cost estimate h of=that17node. Typically, the paths followed by the search are not retained. Although local
showing h-values for moving a queen
search algorithms are not systematic, they have two key advantages: (1) they use very little
memory—usually a constant amount; and (2) they can often find reasonable solutions in large
within its column: or infinite (continuous) state spaces for which systematic algorithms are unsuitable.
In addition to finding goals, local search algorithms are useful for solving pure op-
OPTIMIZATION
timization problems, in which the aim is to find the best state according to an objective
Problem: The state space hasfunction.
local Many optimization problems do not fit the “standard” search model introduced in
PROBLEM
OBJECTIVE
FUNCTION
Recent work on hill climbing algorithms tries to combine complete search with randomization to
escape certain odd phenomena occurring in statistical distribution of solutions.
124 Chapter 4. Beyond Classical Search
Simulated annealing (Idea)
Annealing is the process of heating steel and let it cool gradually to give it time to
Figure 4.4 Illustration of why ridges cause difficulties for hill climbing. The grid of states
(dark circles) is superimposed on a ridge rising from left to right, creating a sequence of local
maxima that are not directly connected to each other. From each local maximum, all the
available actions point downhill.
130 CHAPTER 6. PROBLEM SOLVING AND SEARCH
E(x)
=e kT ≫1
e kT
for small T .
Observation: Local beam search is not the same as k searches run in parallel!
(Searches that find good states recruit other searches to join them)
Problem: Quite often, all k searches end up on the same local hill!
Idea: Choose k successors randomly, biased towards good ones. (Observe the
close analogy to natural selection!)
Figure 4.6 The genetic algorithm, illustrated for digit strings representing 8-queens states.
The initial population in (a) is ranked by the fitness function in (b), resulting in pairs for
Michael Kohlhase: Artificial Intelligence 1 194 2025-02-06
mating in (c). They produce offspring in (d), which are subject to mutation in (e).
Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
unshaded columns are retained.
Like beam searches, GAs begin with a set of k randomly generated states, called the
POPULATION population. Each state, or individual, is represented as a string over a finite alphabet—most
INDIVIDUAL commonly, a string of 0s and 1s. For example, an 8-queens state must specify the positions of
8 queens, each in a column of 8 squares, and so requires 8 × log2 8 = 24 bits. Alternatively,
the state could be represented as 8 digits, each in the range from 1 to 8. (We demonstrate later
that the two encodings behave differently.) Figure 4.6(a) shows a population of four 8-digit
24415124 20 26% 32752411 32752124 32252124
32543213 11 14% 24415124 24415411 24415417
Figure 4.6 The genetic algorithm, illustrated for digit strings representing 8-queens states.
The initial population in (a) is ranked by the fitness function in (b), resulting in pairs for
mating in (c). They produce offspring in (d), which are subject to mutation in (e).
132 CHAPTER 6. PROBLEM SOLVING AND SEARCH
+ =
Figure 4.7 The 8-queens states corresponding to the first two parents in Figure 4.6(c) and
the first offspring in Figure 4.6(d). The shaded columns are lost in the crossover step and the
Note: Genetic algorithms ̸= evolution: e.g., real genes also encode replication
unshaded columns are retained.
machinery!
Like beam searches, GAs begin with a set of k randomly generated states, called the
POPULATION population. Each state,
Michael Kohlhase: Artificialor individual,
Intelligence 1 is represented as195a string over a finite alphabet—most
2025-02-06
INDIVIDUAL commonly, a string of 0s and 1s. For example, an 8-queens state must specify the positions of
8 queens, each in a column of 8 squares, and so requires 8 × log2 8 = 24 bits. Alternatively,
the state could be represented as 8 digits, each in the range from 1 to 8. (We demonstrate later
that the two encodings behave differently.) Figure 4.6(a) shows a population of four 8-digit
strings representing 8-queens states.
The production of the next generation of states is shown in Figure 4.6(b)–(e). In (b),
FITNESS FUNCTION each state is rated by the objective function, or (in GA terminology) the fitness function. A
fitness function should return higher values for better states, so, for the 8-queens problem
we use the number of nonattacking pairs of queens, which has a value of 28 for a solution.
The values of the four states are 24, 23, 20, and 11. In this particular variant of the genetic
algorithm, the probability of being chosen for reproducing is directly proportional to the
fitness score, and the percentages are shown next to the raw scores.
In (c), two pairs are selected at random for reproduction, in accordance with the prob-
Chapter 7
7.1 Introduction
Video Nuggets covering this section can be found at https://fau.tv/clip/id/22060 and
https://fau.tv/clip/id/22061.
The Problem
The Problem of Game-Play: cf. ??
Example 7.1.1.
133
134 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
An Example Game
7.1. INTRODUCTION 135
Definition 7.1.6. Let Θ be an adversarial search problem, and let X ∈ {Max, Min}.
A strategy for X is a function σ X : S X → AX so that a is applicable to s whenever
σ X (s) = a.
We don’t know how the opponent will react, and need to prepare for all possibilities.
Definition 7.1.7. A strategy is called optimal if it yields the best possible utility
for X assuming perfect opponent play (not formalized here).
Problem: In (almost) all games, computing an optimal strategy is infeasible.
(state/search tree too huge)
Solution: Compute the next move “on demand”, given the current state instead.
It’s even worse: Our algorithms here look at search trees (game trees), no
duplicate pruning.
Example 7.1.10.
Chess without duplicate pruning: 35100 ≃ 10154 .
Go without duplicate pruning: 200300 ≃ 10690 .
“Minimax”?
We want to compute an optimal strategy for player “Max”.
In other words: We are Max, and our opponent is Min.
Max attempts to maximize û(s) of the terminal states reachable during play.
Min attempts to minimize û(s).
Section 5.2. Optimal Decisions in Games 163
The computation alternates between minimization and maximization ; hence “min-
imax”.
until we reach leaf nodes corresponding to terminal states such that one player has three in
a row or all the squares are filled. The number on each leaf node indicates the utility value
of the terminal state from the point of view of MAX; high values are assumed to be good for
Michael Kohlhase: Artificial Intelligence 1 207 2025-02-06
MAX and bad for MIN (which is how the players get their names).
For tic-tac-toe the game tree is relatively small—fewer than 9! = 362, 880 terminal
nodes. But for chess there are over 1040 nodes, so the game tree is best thought of as a
Example Tic-Tac-Toe
theoretical construct that we cannot realize in the physical world. But regardless of the size
SEARCH TREE of the game tree, it is MAX’s job to search for a good move. We use the term search tree for a
tree that is superimposed on the full game tree, and examines enough nodes to allow a player
Example 7.2.1.
to determine whatA move
full gameto make.tree for tic-tac-toe
MAX (X)
X X X
MIN (O) X X X
X X X
XO X O X ...
MAX (X) O
X O X X O X O ...
MIN (O) X X
X O X X O X X O X ...
TERMINAL O X O O X X
O X X O X O O
Utility –1 0 +1
Figure 5.1 A (partial) game tree for the game of tic-tac-toe. The top node is the initial
current
state,player and
and MAX action
moves marked
first, placing an X on
in anthe left.
empty square. We show part of the tree, giving
alternating moves by MIN ( O ) and MAX ( X ), until we eventually reach terminal states, which
Last can
row: terminal positions with their utility.
be assigned utilities according to the rules of the game.
Minimax: Outline
In a normal search problem, the optimal solution would be a sequence of actions leading to
a goal state—a terminal state that is a win. In adversarial search, MIN has something to say
STRATEGY
We max, we min, we max, we min . . .
about it. MAX therefore must find a contingent strategy, which specifies MAX’s move in
the initial state, then MAX’s moves in the states resulting from every possible response by
1. Depth first search in game tree, with Max in the root.
7.2. MINIMAX SEARCH 139
Minimax: Example
Max 3
3 12 8 2 4 6 14 5 2
Max −∞
Max −∞
Min ∞
Max −∞
Min ∞
Max −∞
Min 3
3
7.2. MINIMAX SEARCH 141
Max −∞
Min 3
3 12
Max −∞
Min 3
3 12 8
Max 3
Min 3
3 12 8
Max 3
Min 3 Min ∞
3 12 8
142 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Max 3
Min 3 Min ∞
3 12 8 2
Max 3
Min 3 Min ∞
3 12 8 2
Max 3
Min 3 Min 2
3 12 8 2 4
Max 3
Min 3 Min 2
3 12 8 2 4 6
7.2. MINIMAX SEARCH 143
Max 3
3 12 8 2 4 6
Max 3
3 12 8 2 4 6 14
Max 3
3 12 8 2 4 6 14
Max 3
3 12 8 2 4 6 14 5
144 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Max 3
3 12 8 2 4 6 14 5 2
Max 3
3 12 8 2 4 6 14 5 2
Max 3
3 12 8 2 4 6 14 5 2
(If any of you sat down, prior to this lecture, to implement a Tic-Tac-Toe player,
chances are you either looked this up on Wikipedia, or invented it in the process.)
Returns an optimal action, assuming perfect opponent play.
No matter how the opponent plays, the utility of the terminal state reached
will be at least the value computed for the root.
If the opponent plays perfectly, exactly that value will be reached.
There’s no need to re-run minimax for every game state: Run it once, offline
before the game starts. During the actual game, just follow the branches taken
in the tree. Whenever it’s your turn, choose an action maximizing the value of
the successor states.
Minimax disadvantages: It’s completely infeasible in practice.
When the search tree is too large, we need to limit the search depth and apply
an evaluation function to the cut off states.
Solution: We impose a search depth limit (also called horizon) d, and apply an
evaluation function to the cut-off states, i.e. states s with dp(s) = d.
Definition 7.3.1. An evaluation function f maps game states to numbers:
f (s) is an estimate of the actual value of s (as would be computed by unlimited-
depth minimax for s).
If cut-off state is terminal: Just use û instead of f .
Analogy to heuristic functions (cf. ??): We want f to be both (a) accurate and
(b) fast.
Another analogy: (a) and (b) are in contradiction ; need to trade-off accuracy
against overhead.
In typical game playing algorithms today, f is inaccurate but very fast.
(usually no good methods known for computing accurate f )
Max 3
3 12 8 2 4 6 14 5 2
Example Chess
This assumes that the features (their contribution towards the actual value of the state) are
independent. That’s usually not the case (e.g. the value of a rook depends on the pawn struc-
ture).
Definition 7.3.4 (Better Solution). The quiescent search algorithm uses a dy-
namically adapted search depth d: It searches more deeply in unquiet positions,
where value of evaluation function changes a lot in neighboring states.
Example 7.3.5. In quiescent search for chess:
piece exchange situations (“you take mine, I take yours”) are very unquiet
; Keep searching until the end of the piece exchange is reached.
Max 3
3 12 8 2 4 6 14 5 2
Max ≥3
3 12 8 2
Idea: We can use this to prune the search tree ; better algorithm
Alpha Pruning
Definition 7.4.1. For each node n in a minimax search tree, the alpha value α(n)
is the highest Max-node utility that search has encountered on its path from the
root to n.
Example 7.4.2 (Computing alpha values).
Max −∞; α = −∞
Max −∞; α = −∞
Min ∞; α = −∞
Max −∞; α = −∞
Min ∞; α = −∞
3
150 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Max −∞; α = −∞
Min 3; α = −∞
Max −∞; α = −∞
Min 3; α = −∞
3 12
Max −∞; α = −∞
Min 3; α = −∞
3 12 8
Max 3; α = 3
Min 3; α = −∞
3 12 8
7.4. ALPHA-BETA SEARCH 151
Max 3; α = 3
Min 3; α = −∞ Min ∞; α = 3
3 12 8
Max 3; α = 3
Min 3; α = −∞ Min ∞; α = 3
3 12 8 2
Max 3; α = 3
Min 3; α = −∞ Min 2; α = 3
3 12 8 2
Max 3; α = 3
3 12 8 2
How to use α?: In a Min-node n, if û(n′ ) ≤ α(n) for one of the successors, then
stop considering n. (pruning out its remaining successors)
152 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Alpha-Beta Pruning
Recall:
What is α: For each search node n, the highest Max-node utility that search
has encountered on its path from the root to n.
How to use α: In a Min-node n, if one of the successors already has utility
≤ α(n), then stop considering n. (Pruning out its remaining successors)
Note: Note that α only gets assigned a value in Max-nodes, and β only gets assigned a value in
Min-nodes.
7.4. ALPHA-BETA SEARCH 153
Min ∞; [−∞, ∞]
Min ∞; [−∞, ∞]
Min 3; [−∞, 3]
Min 3; [−∞, 3]
3 12
154 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Min 3; [−∞, 3]
3 12 8
Max 3; [3, ∞]
Min 3; [−∞, 3]
3 12 8
Max 3; [3, ∞]
3 12 8
Max 3; [3, ∞]
3 12 8 2
7.4. ALPHA-BETA SEARCH 155
Max 3; [3, ∞]
3 12 8 2
Max 3; [3, ∞]
3 12 8 2
Max 3; [3, ∞]
3 12 8 2 14
Max 3; [3, ∞]
3 12 8 2 14
156 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Max 3; [3, ∞]
3 12 8 2 14 5
Max 3; [3, ∞]
3 12 8 2 14 5
Max 3; [3, ∞]
3 12 8 2 14 5 2
Max 3; [3, ∞]
3 12 8 2 14 5 2
Note: We could have saved work by choosing the opposite order for the successors
7.4. ALPHA-BETA SEARCH 157
Max 3; [3, ∞]
3 12 8 2
Max 3; [3, ∞]
3 12 8 2 5
Max 3; [3, ∞]
3 12 8 2 5
158 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Max 3; [3, ∞]
Max 3; [3, ∞]
14
Max 3; [3, ∞]
14
7.4. ALPHA-BETA SEARCH 159
Max 3; [3, ∞]
14
Max 3; [3, ∞]
14
Observation: Assuming game tree with branching factor b and depth limit d:
Minimax would have to search bd nodes.
Best case: If we always choose the best moves first, then the search tree is
d
reduced to b 2 nodes!
Practice: It is often possible to get very close to the best case by simple move-
ordering methods.
Example 7.4.5 (Chess).
160 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Move ordering: Try captures first, then threats, then forward moves, then back-
ward moves.
d
From 35d to 35 2 . E.g., if we have the time to search a billion (109 ) nodes, then
minimax looks ahead d = 6 moves, i.e., 3 rounds (white-black) of the game.
Alpha-beta search looks ahead 6 rounds.
And now . . .
AlphaGo = Monte Carlo tree search (AI-1) + neural networks (AI-2)
Definition 7.5.2. For the Monte Carlo tree search algorithm (MCTS) we maintain
a search tree T , the MCTS tree.
7.5. MONTE-CARLO TREE SEARCH (MCTS) 161
This looks only at a fraction of the search tree, so it is crucial to have good guidance where to go,
i.e. which part of the search tree to look at.
Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Ex-
pansions: 1, 1, 0
avg. reward: 70, 10, 0 Ex-
pansions: 1, 1, 1
avg. reward: 70, 10, 40 Ex-
pansions: 1, 1, 2
avg. reward: 70, 10, 35 Ex- Expansions: 0, 0
pansions: 2, 1, 2 avg. reward: 0, 0
avg. reward: 60, 10, 35 Ex-
pansions: 2, 2, 2
avg. reward: 60, 55, 35 Ex-
pansions: 2, 2, 2
avg. reward: 60, 55, 35 40
70 50 30
100 10
The sampling goes middle, left, right, right, left, middle. Then it stops and selects the highest-
average action, 60, left. After first sample, when values in initial state are being updated, we
have the following “expansions” and “avg. reward fields”: small number of expansions favored for
exploration: visit parts of the tree rarely visited before, what is out there? avg. reward: high
values favored for exploitation: focus on promising parts of the search tree.
162 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Expansions: 0, 0, 0
avg. reward: 0, 0, 0 Expan-
sions: 0, 1, 0
avg. reward: 0, 10, 0 Expan-
sions: 1, 1, 0
avg. reward: 70, 10, 0 Ex-
Expansions: 1, 0 Expansions: 1 pansions: 1, 1, 1
avg. reward: 70, 0 Ex- avg. reward: 10 avg. reward: 70, 10, 40 Ex-
pansions: 2, 0 Expansions: 2 pansions: 1, 1, 2
avg. reward: 60, 0 avg. reward: 55 avg. reward: 70, 10, 35 Ex-
Expansions: 1, 0
Expansions: 1 pansions: 2, 1, 2 avg. reward: 40, 0 Ex-
avg. reward: 100 avg. reward: 60, 10, 35 Ex- 2, 0
pansions:
pansions: 2, 2, 2 avg. reward: 35, 0
avg. reward: 60, 55, 35 Ex-
Expansions: 0, 1 pansions: 2, 2, 2 Expansions: 0, 1
avg. reward: 0, 50 avg. reward: 60, 55, 35 avg. reward: 0, 30
40
70 50 30
100 10
This is the exact same search as on previous slide, but incrementally building the search tree, by
always keeping the first state of the sample. The first three iterations middle, left, right, go to
show the tree extension; do point out here that, like the root node, the nodes added to the tree
have expansions and avg reward counters for every applicable action. Then in next iteration right,
after 30 leaf node was found, an important thing is that the averages get updated *along the entire
path*, i.e., not only in the root as we did before, but also in the nodes along the way. After all
six iterations have been done, as before we select the action left, value 60; but we keep the part
of the tree below that action, “saving relevant work already done before”.
Exploitation: Prefer moves that have high average already (interesting regions
of state space)
Exploration: Prefer moves that have not been tried a lot yet (don’t overlook
other, possibly better, options)
UCT: “Upper Confidence bounds applied to Trees” [KS06].
7.5. MONTE-CARLO TREE SEARCH (MCTS) 163
AlphaGo: Overview
Definition 7.5.5 (Neural Networks in AlphaGo).
a b
Rollout policy SL policy network RL policy network Value network Policy network Value network
Neural network
Policy gradient
n
Cla
tio
Se
n
ca
ssio
ssifi
lf P
ssifi
lay
gre
ca
Cla
tio
Re
n
Data
s s′
Human expert positions Self-play positions
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
rollout policy pπ and supervised learning (SL) policy network pσ are b, Schematic representation of the neural network architecture used in
trained to predict human expert moves in a data set of positions. AlphaGo. The policy network takes a representation of the board position
Illustration taken from [Sil+16] .
A reinforcement learning (RL) policy network pρ is initialized to the SL s as its input, passes it through many convolutional layers with parameters
policy network, and is then improved by policy gradient learning to σ (SL policy network) or ρ (RL policy network), and outputs a probability
maximize the outcome (that is, winning more games) against previous distribution pσ (a | s) or pρ (a | s) over legal moves a, represented by a
versions Rollout policy p : Simple but fast, ≈ prior work on Go.
π set is generated by playing
of the policy network. A new data probability map over the board. The value network similarly uses many
games of self-play with the RL policy network. Finally, a value network vθ convolutional layers with parameters θ, but outputs a scalar value vθ(s′)
SL policy network p : Supervised learning, human-expert data (“learn to choose
is trained by regression to predict the expectedσoutcome (that is, whether that predicts the expected outcome in position s′.
an expert action”).
sampled state-action pairs (s, a), using stochastic gradient ascent to and its weights ρ are initialized to the same values, ρ = σ. We play
maximize
RL policyof network
the likelihood the human move pρ :a selected in state s
Reinforcement games betweenself-play
learning, the current policy topρwin”).
network
(“learn and a randomly selected
previous iteration of the policy network. Randomizing from a pool
∂log pσ (a | s ) of opponents in this way stabilizes training by preventing overfitting
∆σ ∝
∂σ to the current policy. We use a reward function r(s) that is zero for all
non-terminal time steps t < T. The outcome zt = ± r(sT) is the termi-
We trained a 13-layer policy network, which we call the SL policy nal reward at the end of the game from the perspective of the current
network, from 30 million positions from the KGS Go Server. The net- player at time step t: +1 for winning and −1 for losing. Weights are
work predicted expert moves on a held out test set with an accuracy of then updated at each time step t by stochastic gradient ascent in the
57.0% using all input features, and 55.7% using only raw board posi- direction that maximizes expected outcome25
164 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Value network vθ : Use self-play games with pρ as training data for game-position
evaluation vθ (“predict which player will win in this state”).
a A fast rollout policy pπ and supervised learning (SL) policy network pσ are trained to predict
human expert moves in a data set of positions. A reinforcement learning (RL) policy network
pρ is initialized to the SL policy network, and is then improved by policy gradient learning to
maximize the outcome (that is, winning more games) against previous versions of the policy
network. A new data set is generated by playing games of self-play with the RL policy network.
Finally, a value network vθ is trained by regression to predict the expected outcome (that is,
whether the current player wins) in positions from the self-play data set.
b Schematic representation of the neural network architecture used in AlphaGo. The policy
network takes a representation of the board position s as its input, passes it through many con-
volutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a
probability distribution pσ (a|s) or pρ (a|s) over legal moves a, represented by a probability map
over the board. The value network similarly uses many convolutional layers with parameters θ,
but outputs a scalar value vθ (s′ ) that predicts the expected outcome in position s′ .
QT
P P Q Q
Q + u(P) max Q + u(P)
QT QT
Q Q
P P
Q + u(P) max Q + u(P)
pV QT QT QT
P P
pS
r r r r
Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation is evaluated in two ways: using the value network vθ; and by running
traverses the tree by selecting the edge with maximum action value Q, a rollout to the end of the game with the fast rollout policy pπ, then
Illustration taken from [Sil+16]
plus a bonus u(P) that depends on a stored prior probability P for that
edge. b, The leaf node may be expanded; the new node is processed once
computing the winner with function r. d, Action values Q are updated to
track the mean value of all evaluations r(·) and vθ(·) in the subtree below
by the policy network pσ and the output probabilities are stored as prior that action.
probabilities Rollout policy p : Action choice in random samples.
P for each action. c, At the end
π of a simulation, the leaf node
learning ofconvolutional
SL policynetworks,
network won 11% pσ of Action
: games against Pachi23 bias
choice (s, a)within the
of the search treeUCTS tree value
stores an action (stored
Q(s, a),as visit“P ”, N(s, a),
count
and 12% against a slightly weaker program, Fuego24. and prior probability P(s, a). The tree is traversed by simulation (that
gets smaller to “u(P )” with number of is,visits); descendingalong
the treewith quality
in complete games Q.without backup), starting
Reinforcement learning of value networks from the root state. At each time step t of each simulation, an action at
RL
The final stage policy
of the trainingnetwork pρ :onNot
pipeline focuses used
position here (used
evaluation, only
is selected fromtostatelearn
st vθ ).
p
estimating a value function v (s) that predicts the outcome from posi-
Value
tion s of games played bynetwork
using policy vθp :forUsed to evaluate leaf states s, in
both players28–30
a t =linear
argmax(Q sum
(s t , a )with
+ u(s t , athe
)) value
returned by a random~sample
v p(s ) = E[z |s = s, a p]
on s. a
t t t…T
so as to maximize action value plus a bonus
Ideally, we would like to know the optimal value function under
perfect play v*(s); in practice, we instead estimate the value function P(s, a )
Michael Kohlhase: Artificial Intelligence 1 236 u(s, a ) ∝ 2025-02-06
v pρ for our strongest policy, using the RL policy network pρ. We approx- 1 + N (s, a )
imate the value function using a value network vθ(s) with weights θ,
vθ(s ) ≈ v pρ(s ) ≈ v ⁎(s ) . This neural network has a similar architecture that is proportional to the prior probability but decays with
Comments
to the policyon thebutFigure:
network, outputs a single prediction instead of a prob- repeated visits to encourage exploration. When the traversal reaches a
ability distribution. We train the weights of the value network by regres- leaf node sL at step L, the leaf node may be expanded. The leaf position
a Eachsion on state-outcome
simulation pairs (s, z), using
traverses thestochastic
tree by gradient descent to the
selecting sL isedge
processed with maximum
just once by the SL policyaction
network pvalue Q, plus
σ. The output prob- a
minimize the mean squared error (MSE) between the predicted value abilities are stored as prior probabilities P for each legal action a,
bonus
vθ(s),u(P ) that
and the depends
corresponding outcome onz a stored prior probability P(s, a ) = pP for
σ (a|s )
that
. The edge.
leaf node is evaluated in two very different ways:
first, by the value network vθ(sL); and second, by the outcome zL of a
∂vθ(s ) random rollout played out until terminal step T using the fast rollout
∆θ ∝ (z − vθ(s ))
∂θ policy pπ; these evaluations are combined, using a mixing parameter
λ, into a leaf evaluation V(sL)
The naive approach of predicting game outcomes from data con-
sisting of complete games leads to overfitting. The problem is that
7.6. STATE OF THE ART 165
b The leaf node may be expanded; the new node is processed once by the policy network pσ and
the output probabilities are stored as prior probabilities P for each action.
c At the end of a simulation, the leaf node is evaluated in two ways:
• The AlphaGo design is quite intricate (architecture, learning workflow, training data design,
neural network architectures, . . . ).
• How much of this is reusable in/generalizes to other problems?
• Still lots of human expertise in here. Not as much, like in chess, about the game itself. But
rather, in the design of the neural networks + learning architecture.
The chess machine is an ideal one to start with, since (Claude Shannon (1949))
1. the problem is sharply defined both in allowed operations (the moves) and in the
ultimate goal (checkmate),
2. it is neither so simple as to be trivial nor too difficult for satisfactory solution,
3. chess is generally considered to require “thinking” for skilful play, [. . . ]
4. the discrete structure of chess fits well into the digital nature of modern comput-
ers.
Chess is the drosophila of Artificial Intelligence. (Alexander Kronrod (1965))
7.7 Conclusion
Summary
Games (2-player turn-taking zero-sum discrete and finite games) can be understood
as a simple extension of classical search problems.
Each player tries to reach a terminal state with the best possible utility (maximal
vs. minimal).
7.7. CONCLUSION 167
Minimax searches the game depth-first, max’ing and min’ing at the respective turns
of each player. It yields perfect play, but takes time O(bd ) where b is the branching
factor and d the search depth.
Except in trivial games (Tic-Tac-Toe), minimax needs a depth limit and apply an
evaluation function to estimate the value of the cut-off states.
Alpha-beta search remembers the best values achieved for each player elsewhere in
the tree already, and prunes out sub-trees that won’t be reached in the game.
Monte Carlo tree search (MCTS) samples game branches, and averages the findings.
AlphaGo controls this using neural networks: evaluation function (“value network”),
and action filter (“policy network”).
Suggested Reading:
• Chapter 5: Adversarial Search, Sections 5.1 – 5.4 [RN09].
– Section 5.1 corresponds to my “Introduction”, Section 5.2 corresponds to my “Minimax Search”,
Section 5.3 corresponds to my “Alpha-Beta Search”. I have tried to add some additional clarify-
ing illustrations. RN gives many complementary explanations, nice as additional background
reading.
– Section 5.4 corresponds to my “Evaluation Functions”, but discusses additional aspects re-
lating to narrowing the search and look-up from opening/termination databases. Nice as
additional background reading.
– I suppose a discussion of MCTS and AlphaGo will be added to the next edition . . .
168 CHAPTER 7. ADVERSARIAL SEARCH FOR GAME PLAYING
Chapter 8
In the last chapters we have studied methods for “general problem”, i.e. such that are applicable to
all problems that are expressible in terms of states and “actions”. It is crucial to realize that these
states were atomic, which makes the algorithms employed (search algorithms) relatively simple
and generic, but does not let them exploit the any knowledge we might have about the internal
structure of states.
In this chapter, we will look into algorithms that do just that by progressing to factored states
representations. We will see that this allows for algorithms that are many orders of magnitude
more efficient than search algorithms.
To give an intuition for factored states representations we, we present some motivational examples
in ?? and go into detail of the Waltz algorithm, which gave rise to the main ideas of constraint
satisfaction algorithms in ??. ?? and ?? define constraint satisfaction problems formally and use
that to develop a class of backtracking/search based algorithms. The main contribution of the
factored states representations is that we can formulate advanced search heuristics that guide
search based on the structure of the states.
169
170 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS
Allows useful general-purpose algorithms with more power than standard tree
search algorithm.
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
Variables: vAvs.B where A and B are teams, with domains {1, . . . ,34}:
X = {Axle F , Axle B , Wheel RF , Wheel LF , Wheel RB , Wheel LB , Nuts RF ,
Nuts LF , Nuts RB , Nuts LB , Cap RF , Cap LF , Cap RB , Cap LB , Inspect } .
For each
match, the The index
value of thevariable
of each weekend where
is the time that theittask
is starts.
scheduled.
Next we represent precedence
PRECEDENCE
CONSTRAINTS constraints between individual tasks. Whenever a task T1 must occur before task T2 , and
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
172 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS
(Some) constraints:
If {A, B} ∩ {C, D} =
̸ ∅: vAvs.B ̸=
vCvs.D (each team only one match
per day).
If A = C: vAvs.B + 1 ̸= vCvs.D
(each team alternates between home
matches and away matches).
Leading teams of last season meet
near the end of each half-season.
...
Estimated running time: End of this universe, and the next couple billion ones
after it . . .
Directly enumerate all permutations of the numbers 1, . . . , 306, test for each whether
it’s a legal Bundesliga schedule.
Estimated running time: Maybe only the time span of a few thousand uni-
verses.
View this as variables/constraints and use backtracking (this chapter)
Executed running time: About 1 minute.
How do they actually do it?: Modern computers and CSP methods: fractions
of a second. 19th (20th/21st?) century: Combinatorics and manual work.
Try it yourself: with an off-the shelf CSP solver, e.g. Minion [Min]
1. U.S. Major League Baseball, 30 teams, each 162 games. There’s one crucial additional difficulty,
in comparison to Bundesliga. Which one? Travel is a major issue here!! Hence “Traveling
Tournament Problem” in reference to the TSP.
2. This particular scheduling problem is called “car sequencing”, how to most efficiently get cars
through the available machines when making the final customer configuration (non-standard/flexible/custom
extras).
Simple methods for making backtracking aware of the structure of the problem,
and thereby reduce search.
Idea: Adjacent intersections impose constraints on each other. Use CSP to find a
unique set of labelings.
8.2. THE WALTZ ALGORITHM 175
Observation 8.2.1. Then each line on the images is one of the following:
a boundary line (edge of an object) (<) with right hand of arrow denoting “solid”
and left hand denoting “space”
an interior convex edge (label with “+”)
an interior concave edge (label with “-”)
Waltz’s Examples
In his dissertation 1972 [Wal75] David Waltz used the following examples
Types of CSPs
Definition 8.3.1. We call a CSP discrete, iff all of the variables have countable
domains; we have two kinds:
finite domains (size d ; O(dn ) solutions)
e.g., Boolean CSPs (solvability =
b Boolean satisfiability ; NP complete)
infinite domains (e.g. integers, strings, etc.)
e.g., job scheduling, variables are start/end days for each job
need a “constraint language”, e.g., StartJob1 + 5 ≤ StartJob3
linear constraints decidable, nonlinear ones undecidable
Types of Constraints
We classify the constraints by the number of variables they involve.
Definition 8.3.11. Problems like the one in ?? are called crypto-arithmetic puzzles.
D+E = Y + 10 · X1
S E N D
X1 + N + R = E + 10 · X2
+ M O R E
X2 + E + O = N + 10 · X3 M O N E Y
X3 + S + M = O + 10 · M
Problem: The problem structure gets hidden. (search algorithms can get
confused)
Constraint Graph
Definition 8.3.13. A binary CSP is a CSP where each constraint is unary or binary.
Observation 8.3.14. A binary CSP forms a graph called the constraint graph
whose nodes are variables, and whose edges represent the constraints.
Example
204 204 8.3.15. Australia as a binary CSP
Chapter 6.
Chapter 6.Constraint Satisfaction
Constraint Problems
Satisfaction Problems
NT NT
Q Q
NorthernNorthern WA WA
Territory
Territory
Queensland
Queensland
WesternWestern
Australia
SA SA NSW NSW
Australia
South South
Australia
Australia New New
South South V
Wales Wales
V
VictoriaVictoria
Tasmania
Tasmania
T T
(a) (a) (b) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Intuition: General-purpose
be viewed as a constraint
be viewed CSP
satisfaction
as a constraint algorithms
problem
satisfaction use
(CSP).(CSP).
problem The Thethe
goal isgoal graph
to assign
is colorsstructure
to assign to each to speed up
to each
colors
search. regionregion
so thatsonothat
represented
neighboring
no neighboring
as a constraint
represented
(E.g.,
regions
graph.graph.
as a constraint
have the
regions Tasmania
same
have the color. is
(b) The
same color. an independent
(b)map-coloring problem subproblem!)
problem
The map-coloring
immediately discard
immediately furtherfurther
discard refinements of theofpartial
refinements assignment.
the partial Furthermore,
assignment. we can
Furthermore, wesee
can see
why the
whyassignment is not isa solution—we
the assignment see which
not a solution—we
Michael Kohlhase: Artificial Intelligence 1
variables
see which violate
variables
261
a constraint—so
violate we
a constraint—so can
we can
2025-02-06
focus focus
attention on the variables that matter. As a result, many problems that are intractable
attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when formulated as a
for regular state-space search can be solved quickly when formulated as a CSP. CSP.
6.1.26.1.2
Example problem:
Example Job-shop
problem: scheduling
Job-shop scheduling
Real-world Factories
CSPs have the
Factories problem
have of scheduling
the problem a day’s
of scheduling worthworth
a day’s of jobs,
of subject to various
jobs, subject constraints.
to various constraints.
In practice, manymany
In practice, of these problems
of these are solved
problems with CSP
are solved withtechniques.
CSP techniques. Consider the problem
Consider of of
the problem
scheduling the assembly
scheduling of a car.
the assembly of aThe
car.whole
The wholejob is job
composed
is composed of tasks, and we
of tasks, can
and wemodel each each
can model
Example 8.3.16
task as
task as(Assignment
a variable, wherewhere
a variable, the value problems).
of each
the value of variable e.g.,
is the
each variable istime who
the that
time the teaches
thattask
thestarts, what
expressed
task starts, class
expressed
as an asinteger number
an integer of minutes.
number of minutes.Constraints can assert
Constraints that one
can assert that task
one must occuroccur
task must beforebefore
another—for example,
another—for a wheel
example, must must
a wheel be installed beforebefore
be installed the hubcap is putison—and
the hubcap that only
put on—and that only
so many tasks tasks
so many can go canongoat ononce. Constraints
at once. can also
Constraints can specify that athat
also specify taska takes a certain
task takes a certain
amount of time
amount of to complete.
time to complete.
We consider a small
We consider part ofpart
a small theofcartheassembly, consisting
car assembly, of 15 of
consisting tasks: installinstall
15 tasks: axles axles
(front(front
and back), affix all
and back), four
affix all wheels (right(right
four wheels and left,andfront and back),
left, front tighten
and back), nuts for
tighten nutseach
for wheel,
each wheel,
affix hubcaps, and inspect
affix hubcaps, the final
and inspect the assembly.
final assembly.We can Werepresent
can representthe tasks with 15
the tasks variables:
with 15 variables:
180 CHAPTER 8. CONSTRAINT SATISFACTION PROBLEMS
Example 8.3.17 (Timetabling problems). e.g., which class is offered when and
where?
Example 8.3.18 (Hardware configuration).
Note that the ideas are still the same as ??, but in constraint networks we have a
language to formulate things precisely.
Idea: We will explore that idea for algorithms that solve constraint networks.
W A = red W A = red
N T = green N T = blue
W A = red W A = red
N T = green N T = green
Q = red Q = blue
Backtracking Search
Assignments for different variables are independent!
e.g. first WA = red then NT = green vs. first NT = green then WA = red
; we only need to consider assignments to a single variable at each node
; b = d and there are dn leaves.
Definition 8.5.3. Depth first search for CSPs with single-variable assignment
extensions actions is called backtracking search.
Backtracking search is the basic uninformed algorithm for CSPs.
Backtracking in Australia
Example 8.5.5. We apply backtracking search for a map coloring problem:
Step 1:
Step 2:
8.5. CSP AS SEARCH 185
Step 3:
Step 4:
Where in ?? does the most constraining variable play a role in the choice? SA (only possible
choice), NT (all choices possible except WA, V, T). Where in the illustration does most con-
strained variable play a role in the choice? NT (all choices possible except T), Q (only Q and WA
8.6. CONCLUSION & PREVIEW 187
possible).
By choosing the least constraining value first, we increase the chances to not rule
out the solutions below the current node.
Example 8.5.11.
Suggested Reading: p
– Compared to our treatment of the topic “Constraint Satisfaction Problems” (?? and ??),
RN covers much more material, but less formally and in much less detail (in particular, my
slides contain many additional in-depth examples). Nice background/additional reading, can’t
replace the lectures.
– Section 6.1: Similar to our “Introduction” and “Constraint Networks”, less/different examples,
much less detail, more discussion of extensions/variations.
– Section 6.3: Similar to my “Naïve Backtracking” and “Variable- and Value Ordering”, with
less examples and details; contains part of what we cover in ?? (RN does inference first, then
backtracking). Additional discussion of backjumping.
Chapter 9
Constraint Propagation
In this chapter we discuss another idea that is central to symbolic AI as a whole. The first com-
ponent is that with the factored states representations, we need to use a representation language
for (sets of) states. The second component is that instead of state-level search, we can graduate
to representation-level search (inference), which can be much more efficient that state level search
as the respective representation language actions correspond to groups of state-level actions.
9.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/22321.
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Question: Can we add a constraint without losing have the any solutions?
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
region so that no neighboring regions same color. (b) The map-coloring problem
represented as a constraint graph.
Illustration: Decomposition
CONSTRAINTS
task T1 takes duration d1 to complete, we add an arithmetic constraint of the form
T1 + d1 ≤ T2 .
189
190 204
CHAPTERChapter
9. 6.CONSTRAINT PROPAGATION
Constraint Satisfaction Problems
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Tasmania is not adjacent to any other state. Thus we can color Australia first, and
immediately discard further refinements of the partial assignment. Furthermore, we can see
why the assignment is not a solution—we see which variables violate a constraint—so we can
assign an arbitrary color to Tasmania afterwards. focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when formulated as a CSP.
v1 v1
γ red red
γ′
blue blue
̸= ̸= ̸= ̸=
v1 v1
γ red red
γ′
blue blue
̸= ̸= ̸= ̸=
Tightness
Definition 9.2.5 (Tightness). Let γ := ⟨V , D, C ⟩ and γ ′ = ⟨V , D′ , C ′ ⟩ be
constraint networks sharing the same set of variables, then γ ′ is tighter than γ,
(write γ ′ ⊑γ), if:
(i) For all v ∈ V : D′ v ⊆ Dv .
(ii) For all u ̸= v ∈ V and C ′ uv ∈ C ′ : either C ′ uv ̸∈ C or C ′ uv ⊆ C uv .
v1 v1
γ red red
γ′
blue blue
̸= ̸= ̸= ̸=
v1 v1
γ red red
γ′
blue blue
̸= ̸= ̸= ̸=
v1 v1
γ red red
γ′
blue blue
̸= ̸= ̸=
v1 v1
γ red red
γ′
blue blue
̸= ̸= ̸= ̸=
WA NT Q NSW V SA T
WA NT Q NSW V SA T
Note: It’s a bit strange that we start with d′ here; this is to make link to arc consistency –
coming up next – as obvious as possible (same notations u, and d vs. v and d′ ).
Incremental computation: Instead of the first for-loop in ??, use only the inner one
every time a new assignment a(v) = d′ is added.
Practical Properties:
Cheap but useful inference method.
Rarely a good idea to not use forward checking (or a stronger inference method
subsuming it).
Up next: A stronger inference method (subsuming forward checking).
196 CHAPTER 9. CONSTRAINT PROPAGATION
Example 9.4.1.
v1 v1 v1
1 1 1
v1 < v2 v1 < v2 v1 < v2
v2 123 1 2 3 v3 v2 23 1 2 3 v3 v2 23 3 v3
v2 < v3 v2 < v 3 v2 < v 3
We could do better here: value 3 for v2 is not consistent with any remaining value
for v3 ; it can be removed!
But forward checking does not catch this.
v1 v1 v1
1 1 1
v1 < v2 v1 < v2 v1 < v2
v2 123 1 2 3 v3 v2 23 1 2 3 v3 v2 23 3 v3
v2 < v3 v2 < v 3 v2 < v 3
WA NT Q NSW V SA T
WA NT Q NSW V SA T
;?
;?
Forward checking
Note: SA is not makes arc
Kohlhase: Künstliche inferences
consistent
Intelligenz 1 only “from
relative
295 assigned
to NT in 3rd torow.
July 5, unassigned”
2018 variables.
Kohlhase: Künstliche Intelligenz 1 297 July 5, 2018
Lemma 9.4.10. If d is maximal domain size in γ and the test “(d,d′ ) ∈ C uv ?” has
time complexity O(1), then the running time of Revise(γ, u, v) is O(d2 ).
v1 v1
1 1
v1 < v2 v1 < v 2
v2 23 123 v3 v2 23 123 v3
v2 < v3 v2 < v3
v1 v1
1 1
v1 < v2 v1 < v 2
v2 23 123 v3 v2 23 23 v3
v2 < v3 v2 < v3
v1 v1
1 1
v1 < v2 v1 < v 2
v2 23 23 v3 v2 23 3 v3
v2 < v3 v2 < v3
v1
v1 < v2
v2 23 3 v3
v2 < v3
AC-3: Example
Example 9.4.15. y div x = 0: y modulo x is 0, i.e., y is divisible by x
9.4. ARC CONSISTENCY 201
v1
25
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )
v1
25
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
(v 1 ,v 3 )
v1
25
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
(v 3 ,v 1 )
202 CHAPTER 9. CONSTRAINT PROPAGATION
v1
25
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
(v 1 ,v 2 )
v1
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
v1
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
(v 3 ,v 1 )
9.4. ARC CONSISTENCY 203
v1
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 25 v3 (v 2 ,v 1 )
(v 3 ,v 1 )
v1
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 2 v3 (v 2 ,v 1 )
v1
v 2 div v 1 = 0 v 3 div v 1 = 0
M
v2 24 2 v3
AC-3: Runtime
Theorem 9.4.16 (Runtime of AC-3). Let γ := ⟨V , D, C ⟩ be a constraint network
with m constraints, and maximal domain size d. Then AC − 3(γ) runs in time
O(md3 ).
Proof: by counting how often Revise is called.
204 CHAPTER 9. CONSTRAINT PROPAGATION
Problem Structure
9.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 205
T
E.g., n = 80, d = 2, c = 20 Tasmania
(a) (b)
b 4 billion years at
280 = 106.1million
Figure (a) The nodes/sec
principal states and territories of Australia. Coloring this map can
be viewed as a constraint satisfaction problem (CSP). The goal is to assign colors to each
b 0.4 seconds at 10 million nodes/sec
4 · 220 = region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.
immediately
Michael Kohlhase: Artificial Intelligence 1 discard further refinements
299 of the partial assignment. Furthermore, we can see
2025-02-06
why the assignment is not a solution—we see which variables violate a constraint—so we can
focus attention on the variables that matter. As a result, many problems that are intractable
for regular state-space search can be solved quickly when formulated as a CSP.
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
Figure 6.1 (a) The principal states and territories of Australia. Coloring this map can
Example 9.5.4
be viewed (Doing
as a constraint satisfactionthe
problemNumbers).
(CSP). The goal is to assign colors to each
region so that no neighboring regions have the same color. (b) The map-coloring problem
represented as a constraint graph.
γ with n = 40 variables, each domain size k = 2. Four separate connected
immediately discard further refinements of the partial assignment. Furthermore, we can see
components each
why the assignment is not of size
a solution—we 10. variables violate a constraint—so we can
see which
focus attention on the variables that matter. As a result, many problems that are intractable
Reduction ofsearch
for regular state-space worst-case when
can be solved quickly whenusing decomposition:
formulated as a CSP.
Tree-structured CSPs
Definition 9.5.6. We call a CSP tree-structured, iff its constraint graph is acyclic
Theorem 9.5.7. Tree-structured CSP can be solved in O(nd2 ) time.
Example 9.5.9.
9.5. DECOMPOSITION: CONSTRAINT GRAPHS, AND THREE SIMPLE CASES 207
Definition 9.5.10. Cutset conditioning: instantiate (in all ways) a set of variables
such that the remaining constraint graph is a tree.
Cutset size c ; running time O(dc (n − c)d2 ), very fast for small c.
Constraint networks with acyclic constraint graphs can be solved in (low order)
polynomial time.
204 Chapter 6. Constraint Satisfaction Problems
Example 9.5.12. Australia is not acyclic. (But see next section)
NT
Q
Northern WA
Territory
Queensland
Western
Australia
SA NSW
South
Australia New
South V
Wales
Victoria
Tasmania
T
(a) (b)
a We assume here that γ’s constraint graph is connected. If it is not, do this and the following
AcyclicCG(γ): Example
Example 9.5.16 (AcyclicCG() execution).
v1
123
v1 < v2
v2 123 123 v3
v2 < v3
v1
123
v1 < v2
v2 123 123 v3
v2 < v3
Input network γ.
Step 1: Directed tree for root v 1 .
9.6. CUTSET CONDITIONING 209
v1
123
v1 < v2
v2 12 123 v3
v2 < v3
Step 2: Order v 1 , v 2 , v 3 .
v1
v1 < v2
v2 12 123 v3
v2 < v3
v1 < v2
v2 2 123 v3
v2 < v3
v1 < v2
v2 2 3 v3
v2 < v3
v1 < v2
v2 2 3 v3
v2 < v3
Remark 9.6.5. Finding optimal cutsets is NP hard, but good approximations exist.
Example: 4-Queens
States: 4 queens in 4 columns (44 = 256 states)
Actions: Move queen in column
Goal state: No conflicts
Heuristic: h(n) =
b number of conflict
Performance of min-conflicts
Given a random initial state, can solve n-queens in almost constant time for
arbitrary n with high probability (e.g., n = 10,000,000)
The same appears to be true for any randomly-generated CSP except in a narrow
range of the ratio
number of constraints
R=
number of variables
212 CHAPTER 9. CONSTRAINT PROPAGATION
Arc consistency removes values that do not comply with any value still available at
the other end of a constraint. This subsumes forward checking.
The constraint graph captures the dependencies between variables. Separate con-
nected components can be solved independently. Networks with acyclic constraint
graphs can be solved in low order polynomial time.
A cutset is a subset of variables removing which renders the constraint graph acyclic.
Cutset conditioning backtracks only on such a cutset, and solves a sub-problem with
acyclic constraint graph at each search leaf.
Suggested Reading:
• Chapter 6: Constraint Satisfaction Problems in [RN09], in particular Sections 6.2, 6.3.2, and
6.5.
– Compared to our treatment of the topic “constraint satisfaction problems” (?? and ??),
RN covers much more material, but less formally and in much less detail (in particular, our
slides contain many additional in-depth examples). Nice background/additional reading, can’t
replace the lectures.
– Section 6.3.2: Somewhat comparable to our “inference” (except that equivalence and tightness
are not made explicit in RN) together with “forward checking”.
– Section 6.2: Similar to our “arc consistency”, less/different examples, much less detail, addi-
tional discussion of path consistency and global constraints.
– Section 6.5: Similar to our “decomposition” and “cutset conditioning”, less/different examples,
much less detail, additional discussion of tree decomposition.
214 CHAPTER 9. CONSTRAINT PROPAGATION
Part III
215
217
Recall: We have used atomic representations in search problems and tree search
algorithms.
But: We already allowed peeking into state in
informed search to compute heuristics
adversarial search ⇝ too many state!
Recall: We have used factored representations in
backtracking search for CSPs ; universally useful heuristics
constraint propagation: inference ; lifting search to the CSP description level.
219
220 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
The Wumpus world is a very simple game modeled after the early text adventure games of the
1960 and 70ies, where the player entered a world and was provided with textual information about
percepts and could explore the world via actions. The main difference is that we use it as an agent
environment in this course.
Definition 10.1.2 (Actions). The agent can perform the following actions: goForward,
turnRight (by 90◦ ), turnLeft (by 90◦ ), shoot arrow in direction you’re facing (you
got exactly one arrow), grab an object in current cell, leave cave if you’re in cell
[1, 1].
Definition 10.1.3 (Initial and Terminal States). Initially, the agent is in cell
[1, 1] facing east. If the agent falls down a pit or meets live Wumpus it dies.
Definition 10.1.4 (Percepts). The agent can experience the following percepts:
stench, breeze, glitter, bump, scream, none.
Cell adjacent (i.e. north, south, west, east) to Wumpus: stench (else: none).
Cell adjacent to pit: breeze (else: none).
Cell that contains gold: glitter (else: none).
You walk into a wall: bump (else: none).
Wumpus shot by arrow: scream (else: none).
The game is complex enough to warrant structured state representations and can easily be extended
to include uncertainty and non-determinism later.
As our focus is on inference processes here, let us see how a human player would reason when
entering the Wumpus world. This can serve as a model for designing our artificial agents.
(1) Initial state (2) One step to right (3) Back, and up to [1,2]
Let us now look into what kind of agent we would need to be successful in the Wumpus world:
it seems reasonable that we should build on a model-based agent and specialize it to structured
state representations and inference.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
The formal language of the logical system acts as a world description language.
Agent function:
function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action
persistent: state, the agent’s current conception of the world state
function KB−AGENTmodel , a description returns
(percept) of how the next anstateaction
depends on current state and action
a set of condition–action rules
persistent: KB, a rules, knowledge
action, base
the most recent action, initially none
a ←counter,
t,state initially 0, indicating time
U PDATE -S TATE(state, action , percept , model )
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))
rule ← RULE -M ATCH(state, rules)
action := ASK(KB, MAKE−ACTION−QUERY(t))
action ← rule.ACTION
return action
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
is responsible for creating the new internal state description. The details of how models and
states are represented vary widely depending on the type of environment and the particular
technology used in the agent design. Detailed examples of models and updating algorithms
222 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
t := t+1
return action
Its agent function maintains a knowledge base about the environment, which is
updated with percept descriptions (formalizations of the percepts) and action de-
scriptions. The next action is the result of a suitable inference-based query to the
knowledge base.
It is critical to understand that while PL0 as a logical system is given once and for all, the agent
designer still has to formalize the situation (here the Wumpus world) in the world description
10.1. INTRODUCTION: INFERENCE WITH STRUCTURED STATE REPRESENTATIONS223
language (here PL0 ; but we will look at more expressive logical systems below). This formalization
is the seed of the knowledge base, the logic-based agent can then add to via its percepts and action
descriptions, and that also forms the basis of its inferences. We will look at this aspect now.
Syntax: Atomic propositions that can be either true or false, connected by “and,
or, and not”.
Semantics: Assign value to every proposition, evaluate connectives.
Applications: Despite its simplicity, widely applied!
224 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
Killing a Wumpus: How can we use all this to figure out where the Wumpus is?
Coming back to our introductory example.
We will now develop the formal theory behind the ideas previewed in the last section and use
that as a prototype for the theory of the more expressive logical systems still to come in AI-1. As
PL0 is a very simple logical system, we could cut some corners in the exposition but we will stick
closer to a generalizable theory.
Definition 10.2.1 (Syntax). The formulae of propositional logic (write PL0 ) are
made up from
propositional variables: V0 := {P , Q, R, P 1 , P 2 , . . .} (countably infinite)
A propositional signature: constants/constructors called connectives: Σ0 :=
{T , F , ¬, ∨, ∧, ⇒, ⇔, . . .}
Propositional logic is a very old and widely used logical system. So it should not be surprising
that there are other notations for the connectives than the ones we are using in AI-1. We list the
226 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
These notations will not be used in AI-1, but sometimes appear in the literature.
The semantics of PL0 is defined relative to a model, which consists of a universe of discourse and
an interpretation function that we specify now.
Warning: For the official semantics of PL0 we will separate the tasks of giving
meaning to connectives and propositional variables to different mappings.
This will generalize better to other logical systems. (and thus applications)
Definition 10.2.4. A model M := ⟨Do , I⟩ for propositional logic consists of
We call a constant a logical constant, iff its value is fixed by the interpretation.
Treat the other connectives as abbreviations, e.g. A ∨ B= b ¬(¬A ∧ ¬B) and
A ⇒ B= b ¬A ∨ B, and T =b P ∨ ¬P (only need to treat ¬, ∧ directly)
Note: PL0 is a single-model logical system with canonical model ⟨Do , I⟩.
We have a problem in the exposition of the theory here: As PL0 semantics only has a single,
canonical model, we could simplify the exposition by just not mentioning the universe and inter-
pretation function. But we choose to expose both of them in the construction, since other versions
of propositional logic – in particular the system PLnq below – that have a choice of models as they
use a different distribution of the representation among constants and variables.
In particular in a interpretation-less exposition of propositional logic would have elided the homo-
morphic construction of the value function and could have simplified the recursive cases in ?? to
I φ (A ∧ B) = T, iff I φ (A) = T = I φ (B).
But the homomorphic construction via I(∧) is standard to definitions in other logical systems
and thus generalizes better.
Computing Semantics
Example 10.2.8. Let φ := [T/P 1 ], [F/P 2 ], [T/P 3 ], [F/P 4 ], . . . then
I φ (P 1 ∨ P 2 ∨ ¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 )
= I(∨)(I φ (P 1 ∨ P 2 ), I φ (¬(¬P 1 ∧ P 2 ) ∨ P 3 ∧ P 4 ))
= I(∨)(I(∨)(I φ (P 1 ), I φ (P 2 )), I(∨)(I φ (¬(¬P 1 ∧ P 2 )), I φ (P 3 ∧ P 4 )))
= I(∨)(I(∨)(φ(P 1 ), φ(P 2 )), I(∨)(I(¬)(I φ (¬P 1 ∧ P 2 )), I(∧)(I φ (P 3 ), I φ (P 4 ))))
= I(∨)(I(∨)(T, F), I(∨)(I(¬)(I(∧)(I φ (¬P 1 ), I φ (P 2 ))), I(∧)(φ(P 3 ), φ(P 4 ))))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(I φ (P 1 )), φ(P 2 ))), I(∧)(T, F)))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(φ(P 1 )), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(I(¬)(T), F)), F))
= I(∨)(T, I(∨)(I(¬)(I(∧)(F, F)), F))
= I(∨)(T, I(∨)(I(¬)(F), F))
= I(∨)(T, I(∨)(T, F))
= I(∨)(T, T)
= T
What a mess!
Now we will also review some propositional identities that will be useful later on. Some of them we
have already seen, and some are new. All of them can be proven by simple truth table arguments.
228 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
Propositional Identities
Definition 10.2.9. We have the following identities in propositional logic:
We will now use the distribution of values of a propositional formula under all variable assignments
to characterize them semantically. The intuition here is that we want to understand theorems,
examples, counterexamples, and inconsistencies in mathematics and everyday reasoning1 .
The idea is to use the formal language of propositional formulae as a model for mathematical
language. Of course, we cannot express all of mathematics as propositional formulae, but we can
at least study the interplay of mathematical statements (which can be true or false) with the
copula “and”, “or” and “not”.
1 Here (and elsewhere) we will use mathematics (and the language of mathematics) as a test tube for under-
standing reasoning, since mathematics has a long history of studying its own reasoning processes and assumptions.
10.2. PROPOSITIONAL LOGIC (SYNTAX/SEMANTICS) 229
Let us now see how these semantic properties model mathematical practice.
In mathematics we are interested in assertions that are true in all circumstances. In our model
of mathematics, we use variable assignments to stand for “circumstances”. So we are interested
in propositional formulae which are true under all variable assignments; we call them valid. We
often give examples (or show situations) which make a conjectured formula false; we call such
examples counterexamples, and such assertions falsifiable. We also often give examples for certain
formulae to show that they can indeed be made true (which is not the same as being valid yet);
such assertions we call satisfiable. Finally, if a formula cannot be made true in any circumstances
we call it unsatisfiable; such assertions naturally arise in mathematical practice in the form of
refutation proofs, where we show that an assertion (usually the negation of the theorem we want
to prove) leads to an obviously unsatisfiable conclusion, showing that the negation of the theorem
is unsatisfiable, and thus the theorem valid.
Let us finally test our intuitions about propositional logic with a “real-world example”: a logic
puzzle, as you could find it in a Sunday edition of the local newspaper.
Answer: You can solve this using PL0 , if we accept bla(S), etc. as propositional variables.
We first express what we know: For every x ∈ {S, N , J} (Stefan, Nicole, Jochen) we have
3. 1. together with 2.2a entails that ai(x) ⇒ bla(x) for every x ∈ {S, N , J},
4. thus ¬bla(S) ∧ ¬bla(J) by 2.2c and 2.2b and
5. so ¬ai(S) ∧ ¬ai(J) by 3. and 4.
6. With 2. the latter entails ai(N ).
The example shows that puzzles like that are a bit difficult to solve without writing things down.
But if we formalize the situation in PL0 , then we can solve the puzzle quite handily with inference.
Note that we have been a bit generous with the names of propositional variables; e.g. bla(x),
where x ∈ {S, N , J}, to keep the representation small enough to fit on the slide. This does not
hinder the method in any way.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
The formal language of the logical system acts as a world description language.
Agent function:
function M ODEL -BASED -R EFLEX -AGENT( percept ) returns an action
persistent: state, the agent’s current conception of the world state
function KB−AGENTmodel , a description returns
(percept) of how the next anstateaction
depends on current state and action
a set of condition–action rules
persistent: KB, a rules, knowledge
action, base
the most recent action, initially none
a ←counter,
t,state initially 0, indicating time
U PDATE -S TATE(state, action , percept , model )
TELL(KB, MAKE−PERCEPT−SENTENCE(percept,t))
rule ← RULE -M ATCH(state, rules)
action := ASK(KB, MAKE−ACTION−QUERY(t))
action ← rule.ACTION
return action
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
t := t+1 Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
return actionusing an internal model. It then chooses an action in the same way as the reflex agent.
is responsible for creating the new internal state description. The details of how models and
Its agent function maintains a knowledge base about the environment, which is
states are represented vary widely depending on the type of environment and the particular
updated withtechnology used descriptions
percept in the agent design.(formalizations of the
Detailed examples of models and percepts) and action de-
updating algorithms
scriptions. The
appear next action
in Chapters is 15,
4, 12, 11, the17,result
and 25. of a suitable inference-based query to the
Regardless of the kind of representation used, it is seldom possible for the agent to
knowledge base.
determine the current state of a partially observable environment exactly. Instead, the box
labeled “what the world is like now” (Figure 2.11) represents the agent’s “best guess” (or
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess
Michael Kohlhase: Artificial Intelligence 1 334
about what may be causing the
2025-02-06
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
to make a decision.
A perhaps less obvious point about the internal “state” maintained by a model-based
K S
P ⇒Q⇒P (P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R
A⇒B A A
MP Subst
B [B/X](A)
This is indeed a very simple formal system, but it has all the required parts:
• A formal language: expressions built up from variables and implications.
• A semantics: given by the obvious interpretation function
• A calculus: given by the two axioms and the two inference rules.
The calculus gives us a set of rules with which we can derive new formulae from old ones. The
axioms are very simple rules, they allow us to derive these two formulae in any situation. The
proper inference rules are slightly more complicated: we read the formulae above the horizontal
line as assumptions and the (single) formula below as the conclusion. An inference rule allows us
to derive the conclusion, if we have already derived the assumptions.
Now, we can use these inference rules to perform a proof – a sequence of formulae that can be
derived from each other. The representation of the proof in the slide is slightly compactified to fit
onto the slide: We will make it more explicit here. We first start out by deriving the formula
(P ⇒ Q ⇒ R) ⇒ (P ⇒ Q) ⇒ P ⇒ R (10.1)
which we can always do, since we have an axiom for this formula, then we apply the rule Subst,
where A is this result, B is C, and X is the variable P to obtain
(C ⇒ Q ⇒ R) ⇒ (C ⇒ Q) ⇒ C ⇒ R (10.2)
Next we apply the rule Subst to this where B is C ⇒ C and X is the variable Q this time to obtain
(C ⇒ (C ⇒ C) ⇒ R) ⇒ (C ⇒ C ⇒ C) ⇒ C ⇒ R (10.3)
And again, we apply the rule Subst this time, B is C and X is the variable R yielding the first
formula in our proof on the slide. To conserve space, we have combined these three steps into one
in the slide. The next steps are done in exactly the same way.
In general, formulae can be used to represent facts about the world as propositions; they have a
semantics that is a mapping of formulae into the real world (propositions are mapped to truth
values.) We have seen two relations on formulae: the entailment relation and the derivation
relation. The first one is defined purely in terms of the semantics, the second one is given by a
calculus, i.e. purely syntactically. Is there any relation between these relations?
Goal: Find calculi C, such that ⊢C A iff ⊨ A (provability and validity coincide)
To TRUTH through PROOF (CALCULEMUS [Leibniz ∼1680])
10.4. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 233
Ideally, both relations would be the same, then the calculus would allow us to infer all facts that
can be represented in the given formal language and that are true in the real world, and only
those. In other words, our representation and inference is faithful to the world.
A consequence of this is that we can rely on purely syntactical means to make predictions
about the world. Computers rely on formal representations of the world; if we want to solve a
problem on our computer, we first represent it in the computer (as data structures, which can be
seen as a formal language) and do syntactic manipulations on these structures (a form of calculus).
Now, if the provability relation induced by the calculus and the validity relation coincide (this will
be quite difficult to establish in general), then the solutions of the program will be correct, and we
will find all possible ones. Of course, the logics we have studied so far are very simple, and not
able to express interesting facts about the world, but we will study them as a simple example of
the fundamental problem of computer science: How do the formal representations correlate with
the real world.
Within the world of logics, one can derive new propositions (the conclusions, here: Socrates is
mortal) from given ones (the premises, here: Every human is mortal and Sokrates is human). Such
derivations are proofs.
In particular, logics can describe the internal structure of real-life facts; e.g. individual things,
actions, properties. A famous example, which is in fact as old as it appears, is illustrated in the
slide below.
If a formal system is correct, the conclusions one can prove are true (= hold in the real world)
whenever the premises are true. This is a miraculous fact (think about it!)
We will now introduce the “natural deduction” calculus for propositional logic. The calculus
was created to model the natural mode of reasoning e.g. in everyday mathematical practice. In
particular, it was intended as a counter-approach to the well-known Hilbert style calculi, which
were mainly used as theoretical devices for studying reasoning in principle, not for modeling
particular reasoning styles. We will introduce natural deduction in two styles/notations,
both were invented by Gerhard Gentzen in the 1930’s and are very much related. The Natural
Deduction style (ND) uses local hypotheses in proofs for hypothetical reasoning, while the “sequent
style” is a rationalized version and extension of the ND calculus that makes certain meta-proofs
simpler to push through by making the context of local hypotheses explicit in the notation. The
sequent notation also constitutes a more adequate data struture for implementations, and user
interfaces.
Rather than using a minimal set of inference rules, we introduce a natural deduction calculus that
provides two/three inference rules for every logical constant, one “introduction rule” (an inference
rule that derives a formula with that logical constant at the head) and one “elimination rule” (an
inference rule that acts on a formula with this head and derives a set of subformulae).
Definition 10.4.1. The propositional natural deduction calculus ND0 has inference
rules for the introduction and elimination of connectives:
B A⇒B A
ND0 ⇒I 1 ND0 ⇒E
A⇒B B
The most characteristic rule in the natural deduction calculus is the ND0 ⇒I a rule and the hy-
pothetical reasoning it introduce. ND0 ⇒I a corresponds to the mathematical way of proving an
implication A ⇒ B: We assume that A is true and show B from this local hypothesis. When we
can do this we discharge the assumption and conclude A ⇒ B.
Note that the local hypothesis is discharged by the rule ND0 ⇒I a , i.e. it cannot be used in any
other part of the proof. As the ND0 ⇒I a rules may be nested, we decorate both the rule and the
corresponding local hypothesis with a marker (here the number 1).
Let us now consider an example of hypothetical reasoning in action.
10.4. PROPOSITIONAL NATURAL DEDUCTION CALCULUS 235
1
[A ∧ B]1 [A ∧ B]1 [A]
ND0 ∧Er ND0 ∧El 2
B A [B]
ND0 ∧I A
B∧A ND0 ⇒I 2
1
ND0 ⇒I B⇒A
A∧B⇒B∧A ND0 ⇒I 1
A⇒B⇒A
Here we see hypothetical reasoning with local local hypotheses at work. In the left example, we
assume the formula A ∧ B and can use it in the proof until it is discharged by the rule ND0 ∧El on
the bottom – therefore we decorate the hypothesis and the rule by corresponding numbers (here
the label “1”). Note the local assumption A ∧ B is local to the proof fragment delineated by the
corresponding (local) hypothesis and the discharging rule, i.e. even if this derivation is only a
fragment of a larger proof, then we cannot use its (local) hypothesis anywhere else.
Note also that we can use as many copies of the local hypothesis as we need; they are all
discharged at the same time.
In the right example we see that local hypotheses can be nested as long as they are kept local.
In particular, we may not use the hypothesis B after the ND0 ⇒I 2 , e.g. to continue with a
ND0 ⇒E.
One of the nice things about the natural deduction calculus is that the deduction theorem is
almost trivial to prove. In a sense, the triviality of the deduction theorem is the central idea of
the calculus and the feature that makes it so natural.
Another characteristic of the natural deduction calculus is that it has inference rules (introduction
and elimination rules) for all connectives. So we extend the set of rules from ?? for disjunction,
negation and falsity.
Definition 10.4.5. ND0 has the following additional inference rules for the remain-
236 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
ing connectives.
1 1
[A] [B]
.. ..
A∨B . .
A B C C
ND0 ∨Il ND0 ∨Ir ND0 ∨E 1
A∨B A∨B C
1 1
[A] [A]
.. ..
. .
C ¬C ND ¬I 1 ¬¬A
0 ND0¬E
¬A A
¬A A F
ND0FI ND0FE
F A
ND⊢0 Ax ND⊢0 Ax
A ∧ B⊢A ∧ B A ∧ B⊢A ∧ B ND⊢0 Ax
ND⊢0 ∧ Er ND⊢0 ∧ El A, B⊢A
A ∧ B⊢B A ∧ B⊢A ND⊢0 ⇒I
ND⊢0 ∧ I A⊢B ⇒ A
A ∧ B⊢B ∧ A ND⊢0 ⇒I
ND⊢0 ⇒I ⊢A ⇒ B ⇒ A
⊢A ∧ B ⇒ B ∧ A
Definition 10.4.9. The following inference rules make up the propositional sequent
style natural deduction calculus ND⊢0 :
Γ⊢B
ND⊢0 Ax ND⊢0 weaken ND⊢0 TND
Γ, A⊢A Γ, A⊢B Γ⊢A ∨ ¬A
Γ, A⊢F Γ⊢¬¬A
ND⊢0 ¬I ND⊢0 ¬E
Γ⊢¬A Γ⊢A
ND⊢0 Ax ND⊢0 Ax
A ∧ B⊢A ∧ B A ∧ B⊢A ∧ B ND⊢0 Ax
ND⊢0 ∧ Er ND⊢0 ∧ El A, B⊢A
A ∧ B⊢B A ∧ B⊢A ND⊢0 ⇒I
ND⊢0 ∧ I A⊢B ⇒ A
A ∧ B⊢B ∧ A ND⊢0 ⇒I
ND⊢0 ⇒I ⊢A ⇒ B ⇒ A
⊢A ∧ B ⇒ B ∧ A
Each row in the table represents one inference step in the proof. It consists of line number (for
referencing), a formula for the statement, a justification via a ND inference rule (and the rows this
one is derived from), and finally a sequence of row numbers of proof steps that are local hypotheses
in effect for the current row.
238 CHAPTER 10. PROPOSITIONAL LOGIC & REASONING, PART I: PRINCIPLES
This is essentially the same as PL0 , so we can reuse the calculi. (up next)
Idea: Re-use PL0 , but replace propositional variables with something more expres-
sive! (instead of fancy variable name
trick)
Definition 10.5.1. A first-order signature ⟨Σf , Σp ⟩ consists of
S
Σ
f
:= k∈N Σfk of function constants, where members of Σfk denote k-ary
functions on individuals,
S
k∈N Σ k of predicate constants, where members of Σ k denote k-ary
p p p
Σ :=
relations among individuals,
where Σfk and Σp k are pairwise disjoint, countable sets of symbols for each k ∈ N.
A 0-ary function constant refers to a single individual, therefore we call it a individual
constant.
10.5. PREDICATE LOGIC WITHOUT QUANTIFIERS 239
PLnq Semantics
Definition 10.5.3. Domains D0 = {T, F} of truth values and Dι ̸= ∅ of individuals.
Definition 10.5.4. Interpretation I assigns values to constants, e.g.
I(¬) : D0 → D0 ; T 7→ F; F 7→ T and I(∧) = . . . (as in PL0 )
I : Σf0 → Dι (interpret individual constants as individuals)
I: Σfk → Dι → Dι k
(interpret function constants as functions)
k
I: Σ p
k → P(Dι ) (interpret predicate constants as relations)
Definition 10.5.5. The value function I assigns values to formulae: (recursively)
All of the definitions above are quite abstract, we now look at them again using a very concrete –
if somewhat contrived – example: The relevant parts are a universe D with four elements, and an
interpretation that maps the signature into individuals, functions, and predicates over D, which
are given as concrete sets.
The example above also shows how we can compute of meaning by in a concrete model: we just
follow the evaluation rules to the letter.
We now come to the central technical result about PLnq : it is essentially the same as propositional
logic (PL0 ). We say that the two logic are isomorphic. Technically, this means that the formulae
of PLnq can be translated to PL0 and there is a corresponding model translation from the models
of PL0 to those of PLnq such that the respective notions of evaluation are assignped to each other.
Corollary 10.5.12. PLnq is isomorphic to PL0 , i.e. the following diagram commutes:
ψ 7→ Mψ
⟨Dψ , I ψ ⟩ V Σ → {T, F}
I ψ () I φM ()
θΣ
PLnq (Σ) PL0 (AΣ )
The practical upshot of the commutative diagram from ?? is that if we have a way of computing
evaluation (or entailment for that matter) in PL0 , then we can “borrow” it for PLnq by composing
it with the language and model translations. In other words, we can reuse calculi and automated
10.6. CONCLUSION 241
3.2. If A = ¬B, then I ψ (A) = T, iff I ψ (B) = F, iff I ψ (B) = I ψ (B), iff
I ψ (A) = I ψ (A).
3.3. If A = B ∧ C then we argue similarly
4. Hence I ψ (A) = I ψ (A) for all PLnq formulae and we have concluded the proof.
10.6 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25027.
Summary
Sometimes, it pays off to think before acting.
Propositional logic formulae are built from atomic propositions, with the connectives
and, or, not.
Suggested Reading:
We will now take a more abstract view and introduce the necessary prerequisites of abstract rule
systems. We will also take the opportunity to discuss the quality criteria for calculi.
The notion of a logical system is at the basis of the field of logic. In its most abstract form, a logical
system consists of a formal language, a class of models, and a satisfaction relation between models
and expressions of the formal language. The satisfaction relation tells us when an expression is
deemed true in this model.
243
244 CHAPTER 11. FORMAL SYSTEMS
Logical Systems
Definition 11.0.1. A logical system (or simply a logic) is a triple L := ⟨L, K, ⊨⟩,
where the language L is a formal language, the model class K is a set, and ⊨ ⊆ K×L.
Members of L are called formulae of L, members of K models for L, and ⊨ the
satisfaction relation.
Example 11.0.2 (Propositional Logic). ⟨wff(ΣP L0 , V P L0 ), Ko , |=⟩ is a logical
system, if we define Ko := V0 ⇀ D0 (the set of variable assignments) and φ |= A
iff I φ (A) = T.
Let us now turn to the syntactical counterpart of the entailment relation: derivability in a cal-
culus. Again, we take care to define the concepts at the general level of logical systems.
The intuition of a calculus is that it provides a set of syntactic rules that allow to reason by
considering the form of propositions alone. Such rules are called inference rules, and they can be
strung together to derivations — which can alternatively be viewed either as sequences of formulae
where all formulae are justified by prior formulae or as trees of inference rule applications. But we
can also define a calculus in the more general setting of logical systems as an arbitrary relation on
formulae with some general properties. That allows us to abstract away from the homomorphic
setup of logics and calculi and concentrate on the basics.
H ⊢ A, if A ∈ H (⊢ is proof reflexive),
H ⊢ A and (H′ ∪ {A}) ⊢ B imply (H ∪ H′ ) ⊢ B (⊢ is proof transitive),
H ⊢ A and H ⊆ H′ imply H′ ⊢ A (⊢ is monotonic or admits weakening).
Definition 11.0.5. Let L be a formal language, then an inference rule over L is a
decidable n + 1 ary relation on L. Inference rules are traditionally written as
A1 . . . An
N
C
where A1 , . . ., An and C are schemata for words in L and N is a name. The Ai
are called assumptions of N , and C is called its conclusion.
245
Any n + 1-tuple
a1 . . . an
c
in N is called an application of N and we say that we apply N to a set M of words
with a1 , . . ., an ∈ M to obtain c.
Definition 11.0.6. An inference rule without assumptions is called an axiom.
Definition 11.0.7. A calculus (or inference system) is a formal language L equipped
with a set C of inference rules over L.
With formula schemata we mean representations of sets of formulae, we use boldface uppercase
letters as (meta)-variables for formulae, for instance the formula schema A ⇒ B represents the set
of formulae whose head is ⇒.
Derivations
Definition 11.0.8.Let L := ⟨L, K, ⊨⟩ be a logical system and C a calculus for L,
then a C-derivation of a formula C ∈ L from a set H ⊆ L of hypotheses (write
H⊢C C) is a sequence A1 , . . ., Am of L-formulae, such that
Am = C, (derivation culminates in C)
for all 1 ≤ i ≤ m, either Ai ∈ H, or (hypothesis)
Al 1 . . . Al k
there is an inference rule in C with lj < i for all j ≤ k. (rule
Ai
application)
We can also see a derivation as a derivation tree, where the Alj are the children of
the node Ai .
Example 11.0.9.
In the propositional Hilbert calculus H0 we have the K
derivation P ⊢H0 Q ⇒ P : the sequence is P ⇒ Q ⇒ P ⇒ Q ⇒ P P
MP
P , P , Q ⇒ P and the corresponding tree on the right. Q⇒P
Inference rules are relations on formulae represented by formula schemata (where boldface, up-
percase letters are used as metavariables for formulae). For instance, in ?? the inference rule
A⇒B A
was applied in a situation, where the metavariables A and B were instantiated by the
B
formulae P and Q ⇒ P .
As axioms do not have assumptions, they can be added to a derivation at any time. This is just
what we did with the axioms in ??.
Formal Systems
Let ⟨L, K, ⊨⟩ be a logical system and C a calculus, then ⊢C is a derivation relation
and thus ⟨L, K, ⊨, ⊢C ⟩ a derivation system.
Therefore we will sometimes also call ⟨L, C , K, ⊨⟩ a formal system, iff L :=
246 CHAPTER 11. FORMAL SYSTEMS
Observation 11.0.14. Derivable inference rules are admissible, but not the other
way around.
The notion of a formal system encapsulates the most general way we can conceptualize a logical
system with a calculus, i.e. a system in which we can do “formal reasoning”.
Chapter 12
Recall: Our knowledge of the cave entails a definite Wumpus position!(slide 316)
Problem: That was human reasoning, can we build an agent function that does
this?
Answer: As for constraint networks, we use inference, here resolution/tableaux.
Unsatisfiability Theorem
Theorem 12.1.1 (Unsatisfiability Theorem). H ⊨ A iff H ∪ {¬A} is unsatisfi-
able.
247
248 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
Idea: Turn the search around – using the unsatisfiability theorem (??).
Definition 12.1.5. For a given conjecture A and hypotheses H a test calculus T
tries to derive a refutation H, A⊢T ⊥ instead of H⊢A, where A is unsatisfiable iff
A is valid and ⊥, an “obviously” unsatisfiable formula.
Observation: A test calculus C induces a search problem where the initial state is
H ∪ {¬A} and S ∈ S is a goal state iff ⊥ ∈ S.(proximity of ⊥ easier for heuristics)
Searching for ⊥ admits simple heuristics, e.g. size reduction. (⊥ minimal)
The idea about literals is that they are atoms (the simplest formulae) that carry around their
intended truth value.
Normal Forms
There are two quintessential normal forms for propositional formulae: (there are
others as well)
Definition 12.1.12. A formula is in conjunctive normal form^ (CNF)
_ if it is T or a
conjunction of disjunctions of literals: i.e. if it is of the form i=1 m
n
j=1 lij
i
Observation 12.1.14. Every formula has equivalent formulae in CNF and DNF.
Algorithm: Fully expand all possible tableaux, (no rule can be applied)
Satisfiable, iff there are open branches (correspond to models)
Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis
on when a formula can be made true (or false). Therefore the formulae are decorated with upper
indices that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T). This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one. The latter corresponds a model.
Now that we have seen the examples, we can write down the tableau rules formally.
Definition 12.2.2. The propositional tableau calculus T0 has two inference rules
per connective (one for each possible label)
Aα
T F α ̸= β
(A ∧ B) (A ∧ B) ¬AT
¬AF
Aβ
T0 ∧ T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT
Definition 12.2.4. Call a tableau saturated, iff no rule adds new material and a
branch closed, iff it ends in ⊥, else open. A tableau is closed, iff all of its branches
are.
In analogy to the ⊥ at the end of closed branches, we sometimes decorate open
branches with a 2 symbol.
These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 12.2.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
The saturated tableau represents a full case analysis of what is necessary to give A the truth
value α; since all branches are closed (contain contradictions) this is impossible.
Definition 12.2.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We now look at a formulation of propositional logic with fancy variable names. Note that
loves(mary, bill) is just a variable name like P or X, which we have used earlier.
Example 12.2.8. If Mary loves Bill and John loves Mary, then John loves Mary
F
(loves(mary, bill) ∧ loves(john, mary) ⇒ loves(john, mary))
F
¬(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
(¬¬(loves(mary, bill) ∧ loves(john, mary)) ∧ ¬loves(john, mary))
T
¬¬(loves(mary, bill) ∧ loves(john, mary))
F
¬(loves(mary, bill) ∧ loves(john, mary))
T
(loves(mary, bill) ∧ loves(john, mary))
T
¬loves(john, mary)
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)
⊥
is valid.
We could have used the unsatisfiability theorem (??) here to show that If Mary loves Bill and John
loves Mary entails John loves Mary. But there is a better way to show entailment: we directly
use derivability in T0 .
Deriving Entailment in T0
Example 12.2.9. Mary loves Bill and John loves Mary together entail that John
loves Mary
T
loves(mary, bill)
T
loves(john, mary)
F
loves(john, mary)
⊥
This is a closed tableau, so {loves(mary, bill), loves(john, mary)}⊢T0 loves(john, mary).
Again, as T0 is sound and complete we have
Note: We can also use the tableau calculus to try and show entailment (and fail). The nice thing
is that the failed proof, we can see what went wrong.
Obviously, the tableau above is saturated, but not closed, so it is not a tableau proof for our initial
entailment conjecture. We have marked the literal on the open branch green, since they allow us
to read of the conditions of the situation, in which the entailment fails to hold. As we intuitively
argued above, this is the situation, where Mary loves Bill. In particular, the open branch gives us
a variable assignment (marked in green) that satisfies the initial formula. In this case, Mary loves
Bill, which is a situation, where the entailment fails.
Again, the derivability version is much simpler:
We have seen in the examples above that while it is possible to get by with only the connectives
∨ and ¬, it is a bit unnatural and tedious, since we need to eliminate the other connectives first.
In this section, we will make the calculus less frugal by adding rules for the other connectives,
without losing the advantage of dealing with a small calculus, which is good making statements
about the calculus itself.
AT AT
T T
(A ⇒ B)
T
(A ⇒ B)
F
(A ⇒ B) (A ⇒ B)
T
AT BT (¬A ∨ B)
AF BT T
BF ¬(¬¬A ∧ ¬B)
F
T F T F
(¬¬A ∧ ¬B)
(A ∨ B) (A ∨ B) (A ⇔ B) (A ⇔ B) ¬¬AF ¬BF
AT BT AF AT AF AT AF ¬AT BT
BF BT BF BF BT AF
⊥
With these derived rules, theorem proving becomes quite efficient. With these rules, the tableau
(??) would have the following simpler form:
Soundness (Tableau)
12.2. ANALYTICAL TABLEAUX 255
Idea: A test calculus is refutation sound, iff its inference rules preserve satisfiability
and the goal formulae are unsatisfiable.
Definition 12.2.15. A labeled formula Aα is valid under φ, iff I φ (A) = α.
Proof: by contradiction
1. Suppose Φ isfalsifiable =b not valid.
2. Then the initial tableau is satisfiable, (ΦF satisfiable)
3. so T is satisfiable, by ??.
4. Thus there is a satisfiable branch (by definition)
5. but all branches are closed (T closed)
Theorem 12.2.19 (Completeness). T0 is complete, i.e. if Φ ⊆ wff0 (V0 ) is valid,
then there is a closed tableau T for ΦF .
Proof sketch: Proof difficult/interesting; see ??
Thus we only have to prove ??, this is relatively easy to do. For instance for the first rule: if we
T
have a tableau that contains (A ∧ B) and is satisfiable, then it must have a satisfiable branch.
T
If (A ∧ B) is not on this branch, the tableau extension will not change satisfiability, so we can
assume that it is on the satisfiable branch and thus I φ (A ∧ B) = T for some variable assignment
φ. Thus I φ (A) = T and I φ (B) = T, so after the extension (which adds the formulae AT and BT
to the branch), the branch is still satisfiable. The cases for the other rules are similar.
The next result is a very important one, it shows that there is a procedure (the tableau procedure)
that will always terminate and answer the question whether a given propositional formula is valid
or not. This is very important, since other logics (like the often-studied first-order logic) does not
enjoy this property.
Note: The proof above only works for the “base T0 ” because (only) there the rules do not “copy”.
A rule like
T
(A ⇔ B)
AT AF
BT BF
does, and in particular the number of non-worked-off variables below the line is larger than above
the line. For such rules, we would have a more intricate version of µ which – instead of returning
a natural number – returns a more complex object; a multiset of numbers. would work here. In
our proof we are just assuming that the defined connectives have already eliminated. The
tableau calculus basically computes the disjunctive normal form: every branch is a disjunct that
is a conjunction of literals. The method relies on the fact that a DNF is unsatisfiable, iff each
literal is, i.e. iff each branch contains a contradiction in form of a pair of opposite literals.
2 for the “empty” disjunction (no disjuncts) and call it the empty clause. A clause
with exactly one literal is called a unit clause.
PT ∨ A PF ∨ B
R
A∨B
This rule allows to add the resolvent (the clause below the line) to a clause set which
contains the two clauses above. The literals P T and P F are called cut literals.
Definition 12.3.3 (Resolution Refutation). Let S be a clause set, then we call
an R0 -derivation of 2 from S R0 -refutation and write D : S⊢R0 2.
Definition 12.3.6. We write CNF0 (Aα ) for the set of all clauses derivable from
Aα via the rules above.
that the C-terms in the definition of the inference rules are necessary, since we assumed that
the assumptions of the inference rule must match full clauses. The C terms are used with the
T
convention that they are optional. So that we can also simplify (A ∨ B) to AT ∨ BT .
Background: The background behind this notation is that A and T ∨ A are equivalent for any
A. That allows us to interpret the C-terms in the assumptions as T and thus leave them out.
The clause normal form translation as we have formulated it here is quite frugal; we have left
out rules for the connectives ∨, ⇒, and ⇔, relying on the fact that formulae containing these
connectives can be translated into ones without before CNF transformation. The advantage of
having a calculus with few inference rules is that we can prove meta properties like soundness and
completeness with less effort (these proofs usually require one case per inference rule). On the
other hand, adding specialized inference rules makes proofs shorter and more readable.
Fortunately, there is a way to have your cake and eat it. Derivable inference rules have the property
that they are formally redundant, since they do not change the expressive power of the calculus.
Therefore we can leave them out when proving meta-properties, but include them when actually
using the calculus.
T
C ∨ (A ⇒ B)
T
C ∨ (¬A ∨ B) C ∨ (A ⇒ B)
T
Example 12.3.8. ;
C ∨ ¬AT ∨ BT C ∨ AF ∨ BT
C ∨ AF ∨ BT
258 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
With these derivable rules, theorem proving becomes quite efficient. To get a better understanding
of the calculus, we look at an example: we prove an axiom of the Hilbert Calculus we have studied
above.
Result {P F ∨ QF ∨ RT , P F ∨ QT , P T , RF }
Example 12.3.10. Resolution Proof
1 P F ∨ QF ∨ RT initial
2 P F ∨ QT initial
3 PT initial
4 RF initial
5 P F ∨ QF resolve 1.3 with 4.1
6 QF resolve 5.1 with 3.1
7 PF resolve 2.2 with 6.1
8 2 resolve 7.1 with 3.1
Before we come to the general mechanism, we will go into how we would “convince ourselves that
the Wumpus is in [1, 3].
Idea: We formalize the knowledge about the Wumpus world in PL0 and use a test
calculus to check for entailment.
Simplification: We worry only about the Wumpus and stench:
b stench in [i, j], W i,j =
S i,j = b Wumpus in [i, j].
260 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
Propositions whose value we know: ¬S 1,1 , ¬W 1,1 , ¬S 2,1 , ¬W 2,1 , S 1,2 , ¬W 1,2 .
Knowledge about the Wumpus and smell:
From Cell adjacent to Wumpus: Stench (else: None), we get
The first in is to compute the clause normal form of the relevant knowledge.
Given this clause normal form, we only need to find generate empty clause via repeated applications
of the resolution rule.
We’ve been to (1, 1), and there’s no Wumpus there, so it can’t be (1, 1).
Parents: W 1,1 F and W 2,2 T ∨ W 1,1 T .
Resolvent: W 2,2 T .
There is no stench in (2, 1) so it can’t be (2, 2) either, in contradiction.
Parents: S 2,1 F and S 2,1 T ∨ W 2,2 F .
Resolvent: W 2,2 F .
Parents: W 2,2 F and W 2,2 T .
Resolvent: 2.
Now that we have seen how we can use propositional inference to derive consequences of the
percepts and world knowledge, let us come back to the question of a general mechanism for agent
functions with propositional inference.
Admittedly, the search framework from ?? does not quite cover the agent function we have here,
since that assumes that the world is fully observable, which the Wumpus world is emphatically not.
But it already gives us a good impression of what would be needed for the “general mechanism”.
12.4 Conclusion
Summary
Every propositional formula can be brought into conjunctive normal form (CNF),
which can be identified with a set of clauses.
262 CHAPTER 12. MACHINE-ORIENTED CALCULI FOR PROPOSITIONAL LOGIC
The tableau and resolution calculi are deduction procedures based on trying to
derive a contradiction from the negated theorem (a closed tableau or the empty
clause). They are refutation complete, and can be used to prove KB ⊨ A by
showing that KB ∪ {¬A} is unsatisfiable.
Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
AI-1, but provide one for the calculi introduced so far in??.
Chapter 13
13.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25019.
Definition 13.1.2. Tools addressing SAT are commonly referred to as SAT solvers.
263
264 CHAPTER 13. PROPOSITIONAL REASONING: SAT SOLVERS
Upshot: Anything we can do with CSP, we can (in principle) do with SAT.
2 bits x1 and x0 ; c = 2 ∗ x1 + x0 .
(FF=
b Flip-Flop, D =
b Data IN, CLK =
b Clock)
To Verify: If c < 3 in current clock cycle,
then c < 3 in next clock cycle.
Clauses: y 1 F ∨ x0 T , y 1 T ∨ x0 F , y 0 T ∨ x1 T ∨ x0 T , y 0 F ∨ x1 F , y 0 F ∨ x0 F , x1 F ∨ x0 F ,
y1 T , y0 T .
Step 3: Call a SAT solver (up next).
Why Did Unit Propagation Yield a Conflict?: How can we analyze which
mistakes were made in “dead” search branches?
Knowledge is power, see next.
Clause Learning: How can we learn from our mistakes?
One of the key concepts, perhaps the key concept, underlying the success of
SAT.
Phase Transitions – Where the Really Hard Problems Are: Are all formulas
“hard” to solve?
The answer is “no”. And in some cases we can figure out exactly when they
are/aren’t hard to solve.
if ∆′ = {} then return I ′
/∗ Splitting Rule: ∗/
select some proposition P for which I ′ is not defined
I ′′ := I ′ extended with one truth value for P ; ∆′′ := a copy of ∆′ ; simplify ∆′′
if I ′′′ := DPLL(∆′′ ,I ′′ ) ̸= ‘‘unsatisfiable’’ then return I ′′′
I ′′ := I ′ extended with the other truth value for P ; ∆′′ := ∆′ ; simplify ∆′′
return DPLL(∆′′ ,I ′′ )
Example 13.2.2 (UP and Splitting). Let ∆ := P T ∨QT ∨RF ;P F ∨QF ;RT ;P T ∨QF
1. UP Rule: R 7→ T
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
2. Splitting Rule:
2a. P 7→ F 2b. P 7→ T
QT ; QF QF
3a. UP Rule: Q 7→ T 3b. UP Rule: Q 7→ F
2 clause set empty
returning “unsatisfiable” returning “R 7→ T, P 7→ T, Q 7→ F
2. UP Rule: Q 7→ T
P F ; P T ∨ RF ; RT
3. UP Rule: R 7→ T
PF ; PT
4. UP Rule: P 7→ T
2
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
Properties of DPLL
Unsatisfiable case: What can we say if “unsatisfiable” is returned?
In this case, we know that ∆ is unsatisfiable: Unit propagation is sound, in the
sense that it does not reduce the set of solutions.
DPLL =
b backtracking with inference, where inference =
b unit propagation.
Unit propagation is sound: It does not reduce the set of solutions.
Running time is exponential in worst case, good variable/value selection strate-
gies required.
UP =
b Unit Resolution
Observation: The unit propagation (UP) rule corresponds to a calculus:
while ∆′ contains a unit clause {l} do
extend I ′ with the respective truth value for the proposition underlying l
simplify ∆′ /∗ remove false literals ∗/
Definition 13.3.1 (Unit Resolution). Unit resolution (UR) is the test calculus
consisting of the following inference rule:
C ∨ P α P β α ̸= β
UR
C
Unit propagation =
b resolution restricted to cases where one parent is unit clause.
Observation 13.3.2 (Soundness). UR is refutation sound. (since resolution is)
Observation 13.3.3 (Completeness). UR is not refutation complete (alone).
DPLL: (Without UP; leaves an- Resolution proof from that DPLL tree:
notated with clauses that became
empty)
S 2
F
T
Q ST SF ST
F
T
R QT ∨ S F QF ∨ S F QT ∨ S F
F
T
P F
RT ∨ S F QF ∨ RF ∨ S F RT ∨ S F
T
Q ∨ P F P T ∨ QF ∨ RF ∨ S F
F
QF ∨ P F P T ∨ QF ∨ RF ∨ S F
branching over it was completely unnecessary). If so, however, we can simply remove N k and
all its descendants from the tree as well. We attach C(N ) at the L(k−1) branch of N (k−1) |,
in the role of C(N (k−1) , L(k−1) ). If L(k−1) ∈ C(N ) then we have (a) for N ′ := N (k−1) and
can stop. If L(k−1) F ̸∈ C(N ), then we remove N (k−1) and so forth, until either we stop
with (a), or have removed N 1 and thus must already have derived the empty clause (because
C(N ) ⊆ {L1 , . . . , Lk }\{L1 , . . . , Lk }).
8. Unit propagation can be simulated via applications of the splitting rule, choosing a proposi-
tion that is constrained by a unit clause: One of the two truth values then immediately yields
an empty clause.
Definition 13.3.9. In a tree resolution, each derived clause C is used only once
(at its parent).
Problem: The same C must be derived anew every time it is used!
This is a fundamental weakness: There are inputs ∆ whose shortest tree reso-
lution proof is exponentially longer than their shortest (general) resolution proof.
Intuitively: DPLL makes the same mistakes over and over again.
Idea: DPLL should learn from its mistakes on one search branch, and apply the
learned knowledge to other branches.
Excursion: Practical SAT solvers use a technique called CDCL that analyzes failure and learns
from that in terms of inferred clauses. Unfortunately, we cannot cover this in AI-1.??.
13.4 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/25090.
Summary
SAT solvers decide satisfiability of CNF formulas. This can be used for deduction,
and is highly successful as a general problem solving technique (e.g., in verification).
DPLL = b backtracking with inference performed by unit propagation (UP), which
iteratively instantiates unit clauses and simplifies the formula.
DPLL proofs of unsatisfiability correspond to a restricted form of resolution. The
restriction forces DPLL to “makes the same mistakes over again”.
Implication graphs capture how UP derives conflicts. Their analysis enables us to
do clause learning. DPLL with clause learning is called CDCL. It corresponds to full
13.4. CONCLUSION 271
local search is not as successful in SAT applications, and the underlying ideas are
very similar to those presented in ?? (Not covered here)
Proof complexity: Can one resolution special case X simulate another one Y
polynomially? Or is there an exponential separation (example families where X is
exponentially less efficient than Y )?
Suggested Reading:
Let’s
Let’s Talk
Talk About
AboutBlocks,
Blocks,Baby
Baby. .... .
Question: What do you see here?
I Question: What do you see here?
A D B E C
I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
I And now: Say it in propositional logic!
And now: Say it in propositional logic!
Answer: “isRedA”,“isRedB”, . . . , “onTableA”, “onTableB”, . . . , “isBlockA”, . . .
Wait a sec!: Why don’t we just say, e.g., “AllBlocksAreRed” and “isBlockA”?
Problem: Could we conclude that A is red? (No)
These statements are atomic (just strings); their inner structure (“all blocks”, “is a
block”) is not captured.
Idea: Predicate Logic (PL1 ) extends propositional logic with the ability to explicitly
speak about objects and their properties.
How?: Variables ranging over objects, predicates describing object properties, . . .
Kohlhase: Künstliche Intelligenz 1 416 July 5, 2018
Example 14.1.1. “∀x.block(x) ⇒ red(x)”; “block(A)”
273
274 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC
Note: Even when we can describe the problem suitably, for the desired reasoning,
the propositional formulation typically is way too large to write (by hand).
PL1 solution: “∀x.Wumpus(x) ⇒ (∀y.adj(x, y) ⇒ stench(y))”
Example 14.1.3.
There is a surjective function from the natural numbers into the reals.
First-Order Predicate Logic has many good properties (complete calculi,
compactness, unitary, linear unification,. . . )
But too weak for formalizing: (at least directly)
We make the deliberate, but non-standard design choice here to include Skolem constants into
the signature from the start. These are used in inference systems to give names to objects and
construct witnesses. Other than the fact that they are usually introduced by need, they work
exactly like regular constants, which makes the inclusion rather painless. As we can never predict
how many Skolem constants we are going to need, we give ourselves countably infinitely many for
every arity. Our supply of individual variables is countably infinite for the same reason.
The formulae of first-order logic are built up from the signature and variables as terms (to represent
individuals) and proposition (to represent proposition). The latter include the connectives from
PL0 , but also quantifiers.
Note: We only need e.g. conjunction, negation, and universal quantifier, all other logi-
14.2. FIRST-ORDER LOGIC 279
cal constants can be defined from them (as we will see when we have fixed their interpreta-
tions).
Here Elsewhere
V
∀x.A x.A (x)A
W
∃x.A x.A
The introduction of quantifiers to first-order logic brings a new phenomenon: variables that are
under the scope of a quantifiers will behave very differently from the ones that are not. Therefore
we build up a vocabulary that distinguishes the two.
free(X) := {X} S
free(f (A1 , . . ., An )) := S 1≤i≤n free(Ai )
free(p(A1 , . . ., An )) := 1≤i≤n free(Ai )
free(¬A) := free(A)
free(A ∧ B) := free(A) ∪ free(B)
free(∀X.A) := free(A)\{X}
We will be mainly interested in (sets of) sentences – i.e. closed propositions – as the representations
of meaningful statements about individuals. Indeed, we will see below that free variables do
not gives us expressivity, since they behave like constants and could be replaced by them in all
situations, except the recursive definition of quantified formulae. Indeed in all situations where
variables occur freely, they have the character of metavariables, i.e. syntactic placeholders that
can be instantiated with terms when needed in a calculus.
The semantics of first-order logic is a Tarski-style set-theoretic semantics where the atomic syn-
tactic entities are interpreted by mapping them into a well-understood structure, a first-order
universe that is just an arbitrary set.
280 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC
Definition 14.2.13. We inherit the domain D0 = {T, F} of truth values from PL0
and assume an arbitrary domain Dι ̸= ∅ of individuals. (this choice is a parameter
to the semantics)
Definition 14.2.14. An interpretation I assigns values to constants, e.g.
We do not have to make the domain of truth values part of the model, since it is always the same;
we determine the model by choosing a domain and an interpretation functiong.
Given a first-order model, we can define the evaluation function as a homomorphism over the
construction of formulae.
The only new (and interesting) case in this definition is the quantifier case, there we define the
value of a quantified formula by the value of its scope – but with an extension of the incoming
variable assignment. Note that by passing to the scope A of ∀x.A, the occurrences of the variable
x in A that were bound in ∀x.A become free and are amenable to evaluation by the variable
14.2. FIRST-ORDER LOGIC 281
assignment ψ := φ,[a/X]. Note that as an extension of φ, the assignment ψ supplies exactly the
right value for x in A. This variability of the variable assignment in the definition of the value
function justifies the somewhat complex setup of first-order evaluation, where we have the (static)
interpretation function for the symbols from the signature and the (dynamic) variable assignment
for the variables.
Note furthermore, that the value I φ (∃x.A) of ∃x.A, which we have defined to be ¬(∀x.¬A) is
true, iff it is not the case that I φ (∀x.¬A) = I ψ (¬A) = F for all a ∈ Dι and ψ := φ,[a/X]. This is
the case, iff I ψ (A) = T for some a ∈ Dι . So our definition of the existential quantifier yields the
appropriate semantics.
Substitutions on Terms
Intuition: If B is a term and X is a variable, then we denote the result of
systematically replacing all occurrences of X in a term A by B with [B/X](A).
Problem: What about [Z/Y ], [Y /X](X), is that Y or Z?
The extension of a substitution is an important operation, which you will run into from time
to time. Given a substitution σ, a variable x, and an expression A, σ,[A/x] extends σ with a
new value for x. The intuition is that the values right of the comma overwrite the pairs in the
substitution on the left, which already has a value for x, even though the representation of σ may
not show it.
Substitution Extension
Definition 14.2.23 (Substitution Extension). Let σ be a substitution, then we
denote the extension of σ with [A/X] by σ,[A/X] and define it as {(Y ,B) ∈
σ | Y ̸= X} ∪ {(X,A)}: σ,[A/X] coincides with σ off X, and gives the result A
there.
Note: If σ is a substitution, then σ,[A/X] is also a substitution.
We also need the dual operation: removing a variable from the support:
Note that the use of the comma notation for substitutions defined in ?? is consistent with sub-
stitution extension. We can view a substitution [a/x], [f (b)/y] as the extension of the empty
substitution (the identity function on variables) by [f (b)/y] and then by [a/x]. Note furthermore,
that substitution extension is not commutative in general.
For first-order substitutions we need to extend the substitutions defined on terms to act on propo-
sitions. This is technically more involved, since we have to take care of bound variables.
Substitutions on Propositions
Problem: We want to extend substitutions to propositions, in particular to quan-
tified formulae: What is σ(∀X.A)?
ation, whereas it was free before. Solution: Rename away the bound variable X
in ∀X.p(X, Y ) before applying the substitution.
Definition 14.2.26 (Capture-Avoiding Substitution Application). Let σ be a
substitution, A a formula, and A′ an alphabetic variant of A, such that intro(σ) ∩
BVar(A) = ∅. Then we define capture-avoiding substitution application via
σ(A) := σ(A′ ).
We now introduce a central tool for reasoning about the semantics of substitutions: the “sub-
stitution value Lemma”, which relates the process of instantiation to (semantic) evaluation. This
result will be the motor of all soundness proofs on axioms and inference rules acting on variables
via substitutions. In fact, any logic with variables and substitutions will have (to have) some form
of a substitution value Lemma to get the meta-theory going, so it is usually the first target in any
development of such a logic. We establish the substitution-value Lemma for first-order logic in
two steps, first on terms, where it is very simple, and then on propositions.
by induction hypothesis
2.2. This completes the induction step, and we have proven the assertion.
1. n = 0
1.1. then A is an atomic proposition, and we can argue like in the induction
step of the substitution value lemma for terms.
2. n > 0 and A = ¬B or A = C ◦ D
2.1. Here we argue like in the induction step of the term lemma as well.
3. n > 0 and A = ∀Y .C where (WLOG) X ̸= Y (otherwise rename)
3.1. then I ψ (A) = I ψ (∀Y .C) = T, iff I ψ,[a/Y ] (C) = T for all a ∈ Dι .
3.2. But I ψ,[a/Y ] (C) = I φ,[a/Y ] ([B/X](C)) = T, by induction hypothesis.
3.3. So I ψ (A) = I φ (∀Y .[B/X](C)) = I φ ([B/X](∀Y .C)) = I φ ([B/X](A))
To understand the proof fully, you should think about where the WLOG – it stands for without
loss of generality comes from.
A ∀X.A
ND1 ∀I ∗ ND1 ∀E
∀X.A [B/X](A)
1
[[c/X](A)]
..
∃X.A . 0 new
c ∈ Σsk
[B/X](A) C
ND1 ∃I ND1 ∃E 1
∃X.A C
∗
means that A does not depend on any hypothesis in which X is free.
The intuition behind the rule ND1 ∀I is that a formula A with a (free) variable X can be generalized
to ∀X.A, if X stands for an arbitrary object, i.e. there are no restricting assumptions about X.
The ND1 ∀E rule is just a substitution rule that allows to instantiate arbitrary terms B for X
14.3. FIRST-ORDER NATURAL DEDUCTION 285
in A. The ND1 ∃I rule says if we have a witness B for X in A (i.e. a concrete term B that
makes A true), then we can existentially close A. The ND1 ∃E rule corresponds to the common
mathematical practice, where we give objects we know exist a new name c and continue the proof
by reasoning about this concrete object c. Anything we can prove from the assumption [c/X](A)
we can prove outright if ∃X.A is known.
Now we reformulate the classical formulation of the calculus of natural deduction as a
sequent calculus by lifting it to the “judgments level” as we did for propositional logic. We only
need provide new quantifier rules.
Γ⊢[B/X](A) 0 new
Γ⊢∃X.A Γ, [c/X](A)⊢C c ∈ Σsk
ND⊢1 ∃I ND⊢1 ∃E
Γ⊢∃X.A Γ⊢C
A = B C [A]p
=I =E
A=A [B/p]C
where C [A]p if the formula C has a subterm A at position p and [B/p]C is the
result of replacing that subterm with B.
In many ways equivalence behaves like equality, we will use the following rules in
ND1
Definition 14.3.5. ⇔I is derivable and ⇔E is admissible in ND1 :
A ⇔ B C [A]p
⇔I ⇔E
A⇔A [B/p]C
Again, we have two rules that follow the introduction/elimination pattern of natural deduction
286 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC
calculi.
Definition 14.3.6. We have the canonical sequent rules that correspond to them: =I, =E, ⇔I,
and ⇔E
To make sure that we understand the constructions here, let us get back to the “replacement at
position” operation used in the equality rules.
Positions in Formulae
Idea: Formulae are (naturally) trees, so we can use tree positions to talk about
subformulae
Definition 14.3.7. A position p is a tuple of natural numbers that in each node
of an expression (tree) specifies into which child to descend. For an expression A
we denote the subexpression at p with A|p .
We will sometimes write an expression C as C [A]p to indicate that C the subex-
pression A at position p.
If C [A]p and A is atomic, then we speak of an occurrence of A in C.
Definition 14.3.8. Let p be a position, then [A/p]C is the expression obtained
from C by replacing the subexpression at p by A.
C [B/p]C
p p
A = C|p B
The operation of replacing a subformula at position p is quite different from e.g. (first-order)
substitutions:
• We are replacing subformulae with subformulae instead of instantiating variables with terms.
• Substitutions replace all occurrences of a variable in a formula, whereas formula replacement
only affects the (one) subformula at position p.
We conclude this section with an extended example: the proof of a classical mathematical result
in the natural deduction calculus with equality. This shows us that we can derive strong properties
about complex situations (here the real numbers; an uncountably infinite set of numbers).
1
√
ND= Example: 2 is Irrational
3. So we know 2q 2 = p2 .
4. But 2q 2 has an odd number of prime factors while p2 an even number.
5. This is a contradiction (since they are equal), so we have proven the assertion
If we want to formalize this into ND1 , we have to write down all the assertions in the proof steps
in PL1 syntax and come up with justifications for them in terms of ND1 inference rules. The next
two slides show such a proof, where we write ′n to denote that n is prime, use #(n) for the number
of prime factors of a number n, and write irr(r) if r is irrational.
1
√
ND= Example: 2 is Irrational (the Proof)
Lines 6 and 9 are local hypotheses for the proof (they only have an implicit counterpart in the
inference rules as defined above). Finally we have abbreviated the arithmetic simplification of line
9 with the justification “arith” to avoid having to formalize elementary arithmetic.
1
√
ND= Example: 2 is Irrational (the Proof continued)
13 prime(2) lemma
14 6,9 #(2q 2 ) = #(q 2 ) + 1 ND0 ⇒E(13, 12)
15 6,9 #(q 2 ) = 2#(q) ND1 ∀E 2 (2)
16 6,9 #(2q 2 ) = 2#(q) + 1 =E(14, 15)
17 #(p2 ) = #(p2 ) =I
18 6,9 #(2q 2 ) = #(q 2 ) =E(17, 10)
19 6.9 2#(q) + 1 = #(p2 ) =E(18, 16)
20 6.9 2#(q) + 1 = 2#(p) =E(19, 11)
21 6.9 ¬(2#(q) + 1) = (2#(p)) ND1 ∀E 2 (1)
22 6,9 F ND0FI(20, 21)
23 6 F √ ND1 ∃E 6 (22)
24 ¬¬irr(
√ 2) ND0¬I 6 (23)
25 irr( 2) ND0¬E 2 (23)
288 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC
We observe that the ND1 proof is much more detailed, and needs quite a few Lemmata about
# to go through. Furthermore, we have added a definition of irrationality (and treat definitional
equality via the equality rules). Apart from these artefacts of formalization, the two representations
of proofs correspond to each other very directly.
14.4 Conclusion
Summary (Predicate Logic)
First-order logic allows to explicitly speak about objects and their properties. It is
thus a more natural and compact representation language than propositional logic;
it also enables us to speak about infinite sets of objects.
Logic has thousands of years of history. A major current application in AI is semantic
technology. (up soon)
First-order logic (PL1 ) allows universal and existential quantifier quantification over
individuals.
A PL1 model consists of a universe Dι and a function I mapping individual con-
stants/predicate constants/function constants to elements/relations/functions on
Dι .
First-order natural deduction is a sound and complete calculus for PL1 intended
and optimized for human understanding.
Recap: We can express mathematical theorems in PL1 and prove them in ND1 .
Problem: These proofs can be huge (giga-steps), how can we trust them?
Idea: If the logic can express (safety)-properties of programs, we can use proof
checkers for formal program verification. (there are extensions of PL1 that can)
Problem: These proofs can be humongous, how can humans write them?
Idea: Automate proof construction via
Suggested Reading:
• Chapter 8: First-Order Logic, Sections 8.1 and 8.2 in [RN09]
– A less formal account of what I cover in “Syntax” and “Semantics”. Contains different exam-
ples, and complementary explanations. Nice as additional background reading.
• Sections 8.3 and 8.4 provide additional material on using PL1, and on modeling in PL1, that I
don’t cover in this lecture. Nice reading, not required for exam.
• Chapter 9: Inference in First-Order Logic, Section 9.5.1 in [RN09]
– A very brief (2 pages) description of what I cover in “Normal Forms”. Much less formal; I
couldn’t find where (if at all) RN cover transformation into prenex normal form. Can serve
as additional reading, can’t replace the lecture.
• Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this
in AI-1, but provide one for the calculi introduced so far in??.
290 CHAPTER 14. FIRST-ORDER PREDICATE LOGIC
Chapter 15
In this chapter, we take up the machine-oriented calculi for propositional logic from ?? and extend
them to the first-order case. While this has been relatively easy for the natural deduction calculus
– we only had to introduce the notion of substitutions for the elimination rule for the universal
quantifier we have to work much more here to make the calculi effective for implementation.
291
292 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Tableau calculi develop a formula in a tree-shaped arrangement that represents a case analysis
on when a formula can be made true (or false). Therefore the formulae are decorated with upper
indices that hold the intended truth value.
On the left we have a refutation tableau that analyzes a negated formula (it is decorated with the
intended truth value F). Both branches contain an elementary contradiction ⊥.
On the right we have a model generation tableau, which analyzes a positive formula (it is
decorated with the intended truth value T). This tableau uses the same rules as the refutation
tableau, but makes a case analysis of when this formula can be satisfied. In this case we have a
closed branch and an open one. The latter corresponds a model.
Now that we have seen the examples, we can write down the tableau rules formally.
Aα
T F α ̸= β
(A ∧ B) (A ∧ B) ¬AT
¬AF
Aβ
T0 ∧ T0 ∨ T0 ¬T T0 ¬F T0 ⊥
AT AF BF AF AT ⊥
BT
These inference rules act on tableaux have to be read as follows: if the formulae over the line
appear in a tableau branch, then the branch can be extended by the formulae or branches below
the line. There are two rules for each primary connective, and a branch closing rule that adds the
special symbol ⊥ (for unsatisfiability) to a branch.
We use the tableau rules with the convention that they are only applied, if they contribute new
material to the branch. This ensures termination of the tableau procedure for propositional logic
(every rule eliminates one primary connective).
Definition 15.1.5. We will call a closed tableau with the labeled formula Aα at the root a
tableau refutation for Aα .
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 293
The saturated tableau represents a full case analysis of what is necessary to give A the truth
value α; since all branches are closed (contain contradictions) this is impossible.
Definition 15.1.7. We will call a tableau refutation for AF a tableau proof for A, since it refutes
the possibility of finding a model where A evaluates to F. Thus A must evaluate to T in all
models, which is just our definition of validity.
Thus the tableau procedure can be used as a calculus for propositional logic. In contrast to the
propositional Hilbert calculus it does not prove a theorem A by deriving it from a set of axioms,
but it proves it by refuting its negation. Such calculi are called negative or test calculi. Generally
negative calculi have computational advantages over positive ones, since they have a built-in sense
of direction.
We have rules for all the necessary connectives (we restrict ourselves to ∧ and ¬, since the others
can be expressed in terms of these two via the propositional identities above. For instance, we can
write A ∨ B as ¬(¬A ∧ ¬B), and A ⇒ B as ¬A ∨ B,. . . .)
We will now extend the propositional tableau techniques to first-order logic. We only have to add
two new rules for the universal quantifier (in positive and negative polarity).
The rule T1 ∀ operationalizes the intuition that a universally quantified formula is true, iff all
of the instances of the scope are. To understand the T1 ∃ rule, we have to keep in mind that
F T
∃X.A abbreviates ¬(∀X.¬A), so that we have to read (∀X.A) existentially — i.e. as (∃X.¬A) ,
stating that there is an object with property ¬A. In this situation, we can simply give this
object a name: c, which we take from our (infinite) set of witness constants Σsk 0 , which we have
given ourselves expressly for this purpose when we defined first-order syntax. In other words
T F
([c/X](¬A)) = ([c/X](A)) holds, and this is just the conclusion of the T1 ∃ rule.
Note that the T1 ∀ rule is computationally extremely inefficient: we have to guess an (i.e. in a
search setting to systematically consider all) instance C ∈ wff ι (Σι , Vι ) for X. This makes the rule
infinitely branching.
294 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
In the next calculus we will try to remedy the computational inefficiency of the T1 ∀ rule. We do
this by delaying the choice in the universal rule.
Definition 15.1.9. The free variable tableau calculus (T1f ) extends T0 (proposi-
tional tableau calculus) with the quantifier rules:
Aα
α ̸= β σ(A) = σ(B)
Bβ
T1f⊥
⊥:σ
Metavariables: Instead of guessing a concrete instance for the universally quantified variable
as in the T1 ∀ rule, T1f ∀ instantiates it with a new metavariable Y , which will be instantiated by
need in the course of the derivation.
Skolem terms as witnesses: The introduction of metavariables makes is necessary to extend
the treatment of witnesses in the existential rule. Intuitively, we cannot simply invent a new name,
since the meaning of the body A may contain metavariables introduced by the T1f ∀ rule. As we
do not know their values yet, the witness for the existential statement in the antecedent of the
T1f ∃ rule needs to depend on that. So witness it using a witness term, concretely by applying a
Skolem function to the metavariables in A.
Instantiating Metavariables: Finally, the T1f⊥ rule completes the treatment of metavariables,
it allows to instantiate the whole tableau in a way that the current branch closes. This leaves us
with the problem of finding substitutions that make two terms equal.
Let’s Talk
Tableau Aboutabout
Reasons Blocks, Baby . . .
Blocks
Example 15.1.11 (Reasoning about Blocks). Returing to slide 405
I Question: What do you see here?
A D B E C
I You say: “All blocks are red”; “All blocks are on the table”; “A is a block”.
Can we prove red(A) from ∀x.block(x) ⇒ red(x) and block(A)?
I And now: Say it in propositional logic!
T
(∀X.block(X) ⇒ red(X))
T
block(A)
F
red(A)
T
(block(Y ) ⇒ red(Y ))
F T
block(Y ) red(A)
⊥ : [A/Y ] ⊥
Unification (Definitions)
Definition 15.1.12. For given terms A and B, unification is the problem of finding
a substitution σ, such that σ(A) = σ(B).
Notation: We write term pairs as A=?B e.g. f (X)=?f (g(Y )).
296 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
The idea behind a most general unifier is that all other unifiers can be obtained from it by (further)
instantiation. In an automated theorem proving setting, this means that using most general
unifiers is the least committed choice — any other choice of unifiers (that would be necessary for
completeness) can later be obtained by other substitutions.
Note that there is a subtlety in the definition of the ordering on substitutions: we only compare
on a subset of the variables. The reason for this is that we have defined substitutions to be total
on (the infinite set of) variables for flexibility, but in the applications (see the definition of most
general unifiers), we are only interested in a subset of variables: the ones that occur in the initial
problem formulation. Intuitively, we do not care what the unifiers do off that set. If we did not
have the restriction to the set W of variables, the ordering relation on substitutions would become
much too fine-grained to be useful (i.e. to guarantee unique most general unifiers in our case).
Now that we have defined the problem, we can turn to the unification algorithm itself. We
will define it in a way that is very similar to logic programming: we first define a calculus that
generates “solved forms” (formulae from which we can read off the solution) and reason about
control later. In this case we will reason that control does not matter.
Unification Problems (=
b Equational Systems)
Idea: Unification is equation solving.
Definition 15.1.16. We call a formula A1=?B1 ∧ . . . ∧ An=?Bn an unification
problem iff Ai , Bi ∈ wff ι (Σι , Vι ).
Note: We consider unification problems as sets of equations (∧ is ACI), and
equations as two-element multisets (=? is C).
In principle, unification problems are sets of equations, which we write as conjunctions, since all of
them have to be solved for finding a unifier. Note that it is not a problem for the “logical view” that
the representation as conjunctions induces an order, since we know that conjunction is associative,
commutative and idempotent, i.e. that conjuncts do not have an intrinsic order or multiplicity,
if we consider two equational problems as equal, if they are equivalent as propositional formulae.
In the same way, we will abstract from the order in equations, since we know that the equality
relation is symmetric. Of course we would have to deal with this somehow in the implementation
(typically, we would implement equational problems as lists of pairs), but that belongs into the
“control” aspect of the algorithm, which we are abstracting from at the moment.
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 297
Lemma 15.1.19. Solved forms are of the form X 1=?B1 ∧ . . . ∧ X n=?Bn where
the X i are distinct and X i ∈
̸ free(Bj ).
Definition 15.1.20. Any substitution σ = [B1 /X 1 ], . . . ,[Bn /X n ] induces a solved
unification problem E σ :=(X 1=?B1 ∧ . . . ∧ X n=?Bn ).
Lemma 15.1.21. If E = X 1=?B1 ∧ . . . ∧ X n=?Bn is a solved form, then E has
the unique most general unifier σ E :=[B1 /X 1 ], . . . ,[Bn /X n ].
Proof: Let θ ∈ U(E)
1. then θ(X i ) = θ(Bi ) = θ ◦ σ E (X i )
2. and thus θ=θ ◦ σ E [supp(σ)].
It is essential to our “logical” analysis of the unification algorithm that we arrive at unification
problems whose unifiers we can read off easily. Solved forms serve that need perfectly as ??
shows.
Given the idea that unification problems can be expressed as formulae, we can express the algo-
rithm in three simple rules that transform unification problems into solved forms (or unsolvable
ones).
Unification Algorithm
Definition 15.1.22. The inference system U consists of the following rules:
The decomposition rule Udec is completely straightforward, but note that it transforms one unifi-
cation pair into multiple argument pairs; this is the reason, why we have to directly use unification
298 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Unification Examples
Example 15.1.27. Two similar unification problems:
We will now convince ourselves that there cannot be any infinite sequences of transformations in
U. Termination is an important property for an algorithm.
The proof we present here is very typical for termination proofs. We map unification problems
into a partially ordered set ⟨S, ≺⟩ where we know that there cannot be any infinitely descending
sequences (we think of this as measuring the unification problems). Then we show that all trans-
formations in U strictly decrease the measure of the unification problems and argue that if there
were an infinite transformation in U, then there would be an infinite descending chain in S, which
contradicts our choice of ⟨S, ≺⟩.
The crucial step in coming up with such proofs is finding the right partially ordered set.
Fortunately, there are some tools we can make use of. We know that ⟨N, <⟩ is terminating, and
there are some ways of lifting component orderings to complex structures. For instance it is well-
known that the lexicographic ordering lifts a terminating ordering to a terminating ordering on
finite dimensional Cartesian spaces. We show a similar, but less known construction with multisets
for our proof.
15.1. FIRST-ORDER INFERENCE WITH TABLEAUX 299
Unification (Termination)
Definition 15.1.28. Let S and T be multisets and ≤ a partial ordering on S ∪ T .
Then we define S ≺m S, iff S = C ⊎ T ′ and T = C ⊎ {t}, where s≤t for all s ∈ S ′ .
We call ≤m the multiset ordering induced by ≤.
Definition 15.1.29. We call a variable X solved in an unification problem E, iff E
contains a solved pair X=?A.
But it is very simple to create terminating calculi, e.g. by having no inference rules. So there
is one more step to go to turn the termination result into a decidability result: we must make sure
that we have enough inference rules so that any unification problem is transformed into solved
form if it is unifiable.
Proof: We assume that E is unifiable but unsolved and show the U rule that applies.
1. There is an unsolved pair A=?B in E = E ∧ A=?B′ .
we have two cases
2. A, B ̸∈ Vι
2.1. then A = f (A1 . . . An ) and B = f (B1 . . . Bn ), and thus Udec is appli-
cable
3. A = X ∈ free(E)
3.1. then Uelim (if B ̸= X) or Utriv (if B = X) is applicable.
Corollary 15.1.34. First-order unification is decidable in PL1 .
Proof:
1. U-irreducible unification problems can be reached in finite time by ??.
2. They are either solved or unsolvable by ??, so they provide the answer.
300 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Complexity of Unification
Observation: Naive implementations of unification are exponential in time and
space.
Indeed, the only way to escape this combinatorial explosion is to find representations of substitu-
tions that are more space efficient.
s3 t3 σ3 (t3 )
f f f
f
f f f f
f
x0 f f f
x1 x2 x3 x0
If we look at the unification algorithm from ?? and the considerations in the termination proof
(??) with a particular focus on the role of copying, we easily find the culprit for the exponential
blowup: Uelim, which applies solved pairs as substitutions.
We will now turn the ideas we have developed in the last couple of slides into a usable func-
tional algorithm. The starting point is treating terms as DAGs. Then we try to conduct the
transformation into solved form without adding new nodes.
Unification by DAG-chase
Idea: Extend the Input-DAGs by edges that represent unifiers.
Definition 15.1.40. Write n.a, if a is the symbol of node n.
Algorithm dag−unify
Input: symmetric pairs of nodes in DAGs
fun dag−unify(n,n) = true
| dag−unify(n.x,m) = if occur(n,m) then true else union(n,m)
| dag−unify(n.f ,m.g) =
if g!=f then false
else
forall (i,j) => dag−unify(find(i),find(j)) (chld m,chld n)
end
Observation 15.1.41. dag−unify uses linear space, since no new nodes are created,
and at most one link per variable.
Problem: dag−unify still uses exponential time.
Example 15.1.42. Consider terms f (sn , f (t′ n , xn )), f (tn , f (s′ n , y n ))), where s′ n =
[y i /xi ](sn ) und t′ n = [y i /xi ](tn ).
dag−unify needs exponentially many recursive calls to unify the nodes xn and y n .
(they are unified after n calls, but checking needs the time)
Algorithm uf−unify
Recall: dag−unify still uses exponential time.
Idea: Also bind the function nodes, if the arguments are unified.
uf−unify(n.f ,m.g) =
if g!=f then false
else union(n,m);
forall (i,j) => uf−unify(find(i),find(j)) (chld m,chld n)
end
This only needs linearly many recursive calls as it directly returns with true or makes
a node inaccessible for find.
Linearly many calls to linear procedures give quadratic running time.
Remark: There are versions of uf−unify that are linear in time and space, but for
most purposes, our algorithm suffices.
⊥ : [b/z]
After we have used up p(y) by applying [a/y] in T1f⊥, we have to get a new instance
F
Proof sketch: All T1f rules reduce the number of connectives and negative ∀ or the
multiplicity of positive ∀.
Theorem 15.1.47. T1f is only complete with unbounded multiplicities.
The other thing we need to realize is that there may be multiple ways we can use T1f⊥ to close a
branch in a tableau, and – as T1f⊥ instantiates the whole tableau and not just the branch itself –
304 CHAPTER 15. AUTOMATED THEOREM PROVING IN FIRST-ORDER LOGIC
Treating T1f⊥
Example 15.1.48. Choosing which matters – this tableau does not close!
F
(∃x.(p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(x)))
F
((p(a) ∧ p(b) ⇒ p()) ∧ (q(b) ⇒ q(y)))
F F
(p(a) ∧ p(b) ⇒ p()) (q(b) ⇒ q(y))
T T
p(a) q(b)
T F
p(b) q(y)
F
p(y)
⊥ : [a/y]
The method of spanning matings follows the intuition that if we do not have good information
on how to decide for a pair of opposite literals on a branch to use in T1f⊥, we delay the choice by
initially disregarding the rule altogether during saturation and then – in a later phase– looking
for a configuration of cuts that have a joint overall unifier. The big advantage of this is that we
only need to know that one exists, we do not need to compute or apply it, which would lead to
exponential blow-up as we have seen above.
Observation 15.1.49. T1f without T1f⊥ is terminating and confluent for given
multiplicities.
Idea: Saturate without T1f⊥ and treat all cuts at the same time (later).
Definition 15.1.50.
Let T be a T1f tableau, then we call a unification problem E := A1=?B1 ∧ . . . ∧
An=?Bn a mating for T , iff Ai T and Bi F occur in the same branch in T .
We say that E is a spanning mating, if E is unifiable and every branch B of T
contains Ai T and Bi F for some i.
Theorem 15.1.51. A T1f -tableau with a spanning mating induces a closed T1
tableau.
Excursion: Now that we understand basic unification theory, we can come to the meta-theoretical
properties of the tableau calculus. We delegate this discussion to??.
AT ∨ C BF ∨ D σ = mgu(A, B) Aα ∨ Bα ∨ C σ = mgu(A, B)
(σ(C)) ∨ (σ(D)) (σ(A)) ∨ (σ(C))
Excursion: Again, we relegate the meta-theoretical properties of the first-order resolution calculus
to??.
Problem: That is only true, if we only give the theorem prover exactly the right
laws and background knowledge. If we give it all of them, it drowns in the combi-
natorial explosion.
Let us build a resolution proof for the claim above.
But first we must translate the situation into first-order logic clauses.
Convention: In what follows, for better readability we will sometimes write impli-
cations P ∧ Q ∧ R ⇒ S instead of clauses P F ∨ QF ∨ RF ∨ S T .
West is an American:
Clause: ami(West)
The country Nono is an enemy of America:
enmy(NN, USA)
[c/Y1 ]
missile(X2 )F ∨ own(NoNo, X2 )F ∨ sell(West, X2 , NoNo)T
F T
Clause: animal(X4 ) ∨ love(jack, X4 )
Cats are animals:
F T
Clause: cat(X5 ) ∨ animal(X5 )
love(g(jack), jack)T
Excursion: A full analysis of any calculus needs a completeness proof. We will not cover this in
the course, but provide one for the calculi introduced so far in??.
Definition 15.3.4. A Horn clause is a clause with at most one positive literal.
Recall: Backchaining as search:
state = tuple of goals; goal state = empty list (of goals).
next(⟨G, R1 , . . ., Rl ⟩) := ⟨σ(B 1 ), . . ., σ(B m ), σ(R1 ), . . ., σ(Rl )⟩ if there is a
rule H:−B 1 ,. . ., B m . and a substitution σ with σ(H) = σ(G).
Note: Backchaining becomes resolution
PT ∨ A PF ∨ B
A∨B
positive, unit-resulting hyperresolution (PURR)
This observation helps us understand Prolog better, and use implementation techniques from
automated theorem proving.
Definition 15.3.6. Horn logic is the formal system whose language is the set of
Horn clauses together with the calculus H given by MP, ∧I, and Subst.
Definition 15.3.7. A logic program P entails a query Q with answer substitution
σ, iff there is a H derivation D of Q from P and σ is the combined substitution of
the Subst instances in D.
To gain an intuition for this quite abstract definition let us consider a concrete knowledge base
about cars. Instead of writing down everything we know about cars, we only write down that cars
are motor vehicles with four wheels and that a particular object c has a motor and four wheels. We
can see that the fact that c is a car can be derived from this. Given our definition of a knowledge
base as the deductive closure of the facts and rule explicitly written down, the assertion that c is
a car is in the induced knowledge base, which is what we are after.
In this very simple example car(c) is about the only fact we can derive, but in general, knowledge
bases can be infinite (we will see examples below).
e.g. greek(sokrates),greek(perikles)
Question: Are there fallible greeks?
Indefinite answer: Yes, Perikles or Sokrates
Warning: how about Sokrates and Perikles?
e.g. greek(sokrates),roman(sokrates):−.
Query: Are there fallible greeks?
Answer: Yes, Sokrates, if he is not a roman
Is this abduction?????
313
314 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
According to an influential view of [PRR97], knowledge appears in layers. Staring with a character
set that defines a set of glyphs, we can add syntax that turns mere strings into data. Adding context
information gives information, and finally, by relating the information to other information allows
to draw conclusions, turning information into knowledge.
Note that we already have aspects of representation and function in the diagram at the top of the
slide. In this, the additional functionaltiy added in the successive layers gives the representations
more and more functions, until we reach the knowledge level, where the function is given by infer-
encing. In the second example, we can see that representations determine possible functions.
Let us now strengthen our intuition about knowledge by contrasting knowledge representations
from “regular” data structures in computation.
As knowledge is such a central notion in artificial intelligence, it is not surprising that there are
multiple approaches to dealing with it. We will only deal with the first one and leave the others
to self-study.
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 315
When assessing the relative strengths of the respective approaches, we should evaluate them with
respect to a pre-determined set of criteria.
KR Approaches/Evaluation Criteria
Definition 16.1.1. The evaluation criteria for knowledge representation approaches
are:
Expressive adequacy: What can be represented, what distinctions are supported.
Reasoning efficiency: Can the representation support processing that generates
results in acceptable speed?
Primitives: What are the primitive elements of representation, are they intuitive,
cognitively adequate?
Meta representation: Knowledge about knowledge
Completeness: The problems of reasoning with knowledge that is known to be
incomplete.
Even though the network in ?? is very intuitive (we immediately understand the concepts de-
picted), it is unclear how we (and more importantly a machine that does not associate meaning
with the labels of the nodes and edges) can draw inferences from the “knowledge” represented.
Idea: Links labeled with “isa” and “inst” are special: they propagate properties
encoded by other links.
Definition 16.1.6. We call links labeled by
“isa” an inclusion or isa link (inclusion of concepts)
“inst” instance or inst link (concept membership)
We now make the idea of “propagating properties” rigorous by defining the notion of derived
relations, i.e. the relations that are left implicit in the network, but can be added without changing
its meaning.
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 317
isa
bird / Jack Person
isa inst
inst inst
has_part robin owner_of Mary
has_part
has_part loves
wings John
Slogan: Get out more knowledge from a semantic networks than you put in.
Note that ?? does not quite allow to derive that Jack is a bird (did you spot that “isa” is not a
relation that can be inferred?), even though we know it is true in the world. This shows us that
inference in semantic networks has be to very carefully defined and may not be “complete”, i.e.
there are things that are true in the real world that our inference procedure does not capture.
Dually, if we are not careful, then the inference procedure might derive properties that are not
true in the real world even if all the properties explicitly put into the network are. We call such
an inference procedure unsound or incorrect.
These are two general phenomena we have to keep an eye on.
Another problem is that semantic networks (e.g. in ??) confuse two kinds of concepts: individuals
(represented by proper names like John and Jack) and concepts (nouns like robin and bird). Even
though the isa and inst link already acknowledge this distinction, the “has_part” and “loves”
relations are at different levels entirely, but not distinguished in the networks.
can
animal move
TBox isa isa
amoeba
has_part higher animal has_part
legs head
isa isa
pattern eat color
striped tiger elephant gray
eat
ABox Roy eat Rex Clyde
In particular we have objects “Rex”, “Roy”, and “Clyde”, which have (derived) rela-
tions (e.g. Clyde is gray).
But there are severe shortcomings of semantic networks: the suggestive shape and node names
give (humans) a false sense of meaning, and the inference rules are only given in the process model
(the implementation of the semantic network processing system).
This makes it very difficult to assess the strength of the inference system and make assertions
e.g. about completeness.
Example 16.1.12. Consider a robin that has lost its wings in an accident:
has_part has_part
bird wings bird wings
isa isa
robin robin cancel
inst inst
jack joe
“Cancel-links” have been proposed, but their status and process model are debatable.
To alleviate the perceived drawbacks of semantic networks, we can contemplate another notation
that is more linear and thus more easily implemented: function/argument notation.
Evaluation:
+ linear notation (equivalent, but better to implement on a computer)
+ easy to give process model by deduction (e.g. in Prolog)
– worse locality properties (networks are associative)
Indeed the function/argument notation is the immediate idea how one would naturally represent
semantic networks for implementation.
This notation has been also characterized as subject/predicate/object triples, alluding to simple
(English) sentences. This will play a role in the “semantic web” later.
Building on the function/argument notation from above, we can now give a formal semantics for
semantic network: we translate them into first-order logic and use the semantics of that.
Indeed, the semantics induced by the translation to first-order logic, gives the intuitive meaning to
the semantic networks. Note that this only holds only for the features of semantic networks that
are representable in this way, e.g. the “cancel links” shown above are not (and that is a feature,
not a bug).
But even more importantly, the translation to first-order logic gives a first process model: we
can use first-order inference to compute the set of inferences that can be drawn from a semantic
network.
Before we go on, let us have a look at an important application of knowledge representation
technologies: the semantic web.
320 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
Humans understand the text and combine the information to get the answer. Ma-
chines need more than just text ; semantic web technology.
The term “semantic web” was coined by Tim Berners Lee in analogy to semantic networks, only
applied to the world wide web. And as for semantic networks, where we have inference processes
that allow us the recover information that is not explicitly represented from the network (here the
world-wide-web).
To see that problems have to be solved, to arrive at the semantic web, we will now look at a
concrete example about the “semantics” in web pages. Here is one that looks typical enough.
WWW2002
The eleventh International World Wide Web Conference
Sheraton Waikiki Hotel
Honolulu, Hawaii, USA
16.1. INTRODUCTION TO KNOWLEDGE REPRESENTATION 321
On the 7th May Honolulu will provide the backdrop of the eleventh
International World Wide Web Conference.
Speakers confirmed
Tim Berners-Lee: Tim is the well known inventor of the Web,
Ian Foster: Ian is the pioneer of the Grid, the next generation internet.
But as for semantic networks, what you as a human can see (“understand” really) is deceptive, so
let us obfuscate the document to confuse your “semantic processor”. This gives an impression of
what the computer “sees”.
R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨
I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙
S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊⇔
I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔↙
Obviously, there is not much the computer understands, and as a consequence, there is not a lot
the computer can support the reader with. So we have to “help” the computer by providing some
meaning. Conventional wisdom is that we add some semantic/functional markup. Here we pick
XML without loss of generality, and characterize some fragments of text e.g. as dates.
322 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
ℜ⊔⟩⊔↕⌉⊤WWW∈′′∈
T⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉ℜ∝⊔⟩⊔↕⌉⊤
ℜ√↕⊣⌋⌉⊤S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USAℜ∝√↕⊣⌋⌉⊤
ℜ⌈⊣⊔⌉⊤7↖∞∞M⊣†∈′′∈ℜ∝⌈⊣⊔⌉⊤
parse 7∞∞M⊣†∈′′∈ as the date May 7 11 2002 and add this to the user’s calendar,
parse S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA as a destination and find flights.
But: do not be deceived by your ability to understand English!
To understand what a machine can understand we have to obfuscate the markup as well, since it
does not carry any intrinsic meaning to the machine either.
<√↕⊣⌋⌉>S⟨⌉∇⊣⊔≀\W⊣⟩∥⟩∥⟩H≀⊔⌉↕H≀\≀↕⊓↕⊓⇔H⊣⊒⊣⟩⟩⇔USA</√↕⊣⌋⌉>
<⌈⊣⊔⌉>7↖∞∞M⊣†∈′′∈</⌈⊣⊔⌉>
<√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >R⌉}⟩∫⊔⌉∇⌉⌈√⊣∇⊔⟩⌋⟩√⊣\⊔∫⌋≀⇕⟩\}{∇≀⇕
A⊓∫⊔∇⊣↕⟩⊣⇔C⊣\⊣⌈⊣⇔C⟨⟩↕⌉D⌉\⇕⊣∇∥⇔F∇⊣\⌋⌉⇔G⌉∇⇕⊣\†⇔G⟨⊣\⊣⇔H≀\}K≀\}⇔I\⌈⟩⊣⇔
I∇⌉↕⊣\⌈⇔I⊔⊣↕†⇔J⊣√⊣\⇔M⊣↕⊔⊣⇔N⌉⊒Z⌉⊣↕⊣\⌈⇔T⟨⌉N⌉⊔⟨⌉∇↕⊣\⌈∫⇔N≀∇⊒⊣†⇔
S⟩\}⊣√≀∇⌉⇔S⊒⟩⊔‡⌉∇↕⊣\⌈⇔⊔⟨⌉U\⟩⊔⌉⌈K⟩\}⌈≀⇕⇔⊔⟨⌉U\⟩⊔⌉⌈S⊔⊣⊔⌉∫⇔V⟩⌉⊔\⊣⇕⇔Z⊣⟩∇⌉
</√⊣∇⊔⟩⌋⟩√⊣\⊔∫ >
<⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>O\⊔⟨⌉7⊔⟨M⊣†H≀\≀↕⊓↕⊓⊒⟩↕↕√∇≀⊑⟩⌈⌉⊔⟨⌉⌊⊣⌋∥⌈∇≀√≀{⊔⟨⌉⌉↕⌉⊑⌉\⊔⟨I\⊔⌉∇\⊣↖
⊔⟩≀\⊣↕W≀∇↕⌈W⟩⌈⌉W⌉⌊C≀\{⌉∇⌉\⌋⌉↙</⟩\⊔∇≀⌈⊓⌋⊔⟩≀\>
<√∇≀}∇⊣⇕>S√⌉⊣∥⌉∇∫⌋≀\{⟩∇⇕⌉⌈
<∫√⌉⊣∥⌉∇>T⟩⇕B⌉∇\⌉∇∫↖L⌉⌉¬T⟩⇕⟩∫⊔⟨⌉⊒⌉↕↕∥\≀⊒\⟩\⊑⌉\⊔≀∇≀{⊔⟨⌉W⌉⌊</∫√⌉⊣∥⌉∇>
<∫√⌉⊣∥⌉∇>I⊣\F≀∫⊔⌉∇¬I⊣\⟩∫⊔⟨⌉√⟩≀\⌉⌉∇≀{⊔⟨⌉G∇⟩⌈⇔⊔⟨⌉\⌉§⊔}⌉\⌉∇⊣⊔⟩≀\⟩\⊔⌉∇\⌉⊔<∫√⌉⊣∥⌉∇>
</√∇≀}∇⊣⇕>
So we have not really gained much either with the markup, we really have to give meaning to the
markup as well, this is where techniques from semenatic web come into play.
To understand how we can make the web more semantic, let us first take stock of the current status
of (markup on) the web. It is well-known that world-wide-web is a hypertext, where multimedia
documents (text, images, videos, etc. and their fragments) are connected by hyperlinks. As we
have seen, all of these are largely opaque (non-understandable), so we end up with the following
situation (from the viewpoint of a machine).
Essentially, to make the web more machine-processable, we need to classify the resources by the
concepts they represent and give the links a meaning in a way, that we can do inference with that.
The ideas presented here gave rise to a set of technologies jointly called the “semantic web”, which
we will now summarize before we return to our logical investigations of knowledge representation
techniques.
Example 16.1.24. getting your hair cut (at tell receptionist you’re here
a beauty parlor)
Beautician cuts hair
props, actors as “script variables”
pay
events in a (generalized) sequence
happy unhappy
use script material for
big tip small tip
anaphora, bridging references
default common ground
to fill in missing material into situations
But of course logic-based approaches have big drawbacks as well. The first is that we have to obtain
the symbolic representations of knowledge to do anything – a non-trivial challenge, since most
knowledge does not exist in this form in the wild, to obtain it, some agent has to experience the
word, pass it through its cognitive apparatus, conceptualize the phenomena involved, systematize
them sufficiently to form symbols, and then represent those in the respective formalism at hand.
The second drawback is that the process models induced by logic-based approaches (inference
with calculi) are quite intractable. We will see that all inferences can be played back to satisfiability
tests in the underlying logical system, which are exponential at best, and undecidable or even
incomplete at worst.
Therefore a major thrust in logic-based knowledge representation is to investigate logical sys-
tems that are expressive enough to be able to represent most knowledge, but still have a decidable
– and maybe even tractable in practice – satisfiability problem. Such logics are called “description
logics”. We will study the basics of such logical systems and their inference procedures in the
following.
16.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 327
L::=C | ⊤ | ⊥ | L | L ⊓ L | L ⊔ L | L ⊑ L | L ≡ L
The main use of the set-theoretic semantics for PL0 is that we can use it to give meaning to concept
axioms, which we use to describe the “world”.
Concept Axioms
Idea: Use logical axioms to describe the world (Axioms restrict the class of
328 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
sons daughters
child
daughter
son
children
Concept axioms are used to restrict the set of admissible domains to the intended ones. In our
situation, we require them to be true – as usual – which here means that they denote the whole
domain D.
Let us fortify our intuition about concept axioms with a simple example about the sibling relation.
We give four concept axioms and study their effect on the admissible models by looking at the
respective Venn diagrams. In the end we see that in all admissible models, the denotations of the
concepts son and daughter are disjointq, and child is the union of the two – just as intended.
Axioms Semantics
son ⊑ child
iff [ son]] ∪ [ child]] = D
iff [ son]] ⊆ [ child]]
sons daughters
daughter
⊑child
iff daughter ∪ [ child]] = D
iff [ daughter]] ⊆ [ child]] children
The set-theoretic semantics introduced above is compatible with the regular semantics of proposi-
tional logic, therefore we have the same propositional identities. Their validity can be established
16.2. LOGIC-BASED KNOWLEDGE REPRESENTATION 329
Propositional Identities
Name for ⊓ for ⊔
Idempot. φ⊓φ=φ φ⊔φ=φ
Identity φ⊓⊤=φ φ⊔⊥=φ
Absorpt. φ⊔⊤=⊤ φ⊓⊥=⊥
Commut. φ⊓ψ =ψ⊓φ φ⊔ψ =ψ⊔φ
Assoc. φ ⊓ (ψ ⊓ θ) = (φ ⊓ ψ) ⊓ θ φ ⊔ (ψ ⊔ θ) = (φ ⊔ ψ) ⊔ θ
Distrib. φ ⊓ (ψ ⊔ θ) = (φ ⊓ ψ) ⊔ (φ ⊓ θ) φ ⊔ (ψ ⊓ θ) = (φ ⊔ ψ) ⊓ (φ ⊔ θ)
Absorpt. φ ⊓ (φ ⊔ θ) = φ φ⊔φ⊓θ =φ⊓θ
Morgan φ⊓ψ =φ⊔ψ φ⊔ψ =φ⊓ψ
dneg φ=φ
There is another way we can approach the set description interpretation of propositional logic: by
translation into a logic that can express knowledge about sets – first-order logic.
Definition Comment
pfo(x) := p(x)
fo(x) fo(x)
A := ¬A
fo(x) fo(x) fo(x)
A⊓B := A ∧B ∧ vs. ⊓
fo(x) fo(x) fo(x)
A⊔B := A ∨B ∨ vs. ⊔
fo(x) fo(x) fo(x)
A⊑B := A ⇒B ⇒ vs. ⊑
fo(x) fo(x) fo(x)
A=B := A ⇔B ⇔ vs. =
fo fo(x)
A := (∀x.A ) for formulae
Translation Examples
Example 16.2.8. We translate the concept axioms from ?? to fortify our intuition:
fo
son ⊑ child = ∀x.son(x) ⇒ child(x)
fo
daughter ⊑ child = ∀x.daughter(x) ⇒ child(x)
fo
son ⊓ daughter = ∀x.son(x) ∧ daughter(x)
fo
child ⊑ son ⊔ daughter = ∀x.child(x) ⇒ (son(x) ∨ daughter(x))
As we will see, the situation for PL0DL is typical for formal ontologies (even though it only offers
concepts), so we state the general description logic paradigm for ontologies. The important idea
is that having a formal system as an ontology format allows us to capture, study, and implement
ontological inference.
Idea: Build a whole family of logics for describing sets and their relations. (tailor
their expressivity and computational properties)
Definition 16.2.14. A description logic is a formal system for talking about col-
lections of objects and their relations that is at least as expressive as PL0 with
set-theoretic semantics and offers individuals and relations.
A description logic has the following four components:
a formal language L with logical con-
stants ⊓, ·, ⊔, ⊑, and ≡, PL1 undecideable
ψ decideable
a set-theoretic semantics ⟨D, [ ·]]⟩,
C 7→ p ∈ Σp 1
DL
a translation into first-order logic that is ψ := ⊓ 7→ ∩
· 7→ D\·
compatible with ⟨D, [ ·]]⟩, and φ
X ∈ V0 7→ C
a calculus for L that induces a decision φ := ∧ 7→ ⊓
PL0 ¬ 7→ ·
procedure for L-satisfiability.
a terminology (or TBox): concepts and roles and a set of concept axioms that
describe them, and
assertions (or ABox): a set of individuals and statements about concept mem-
bership and role relationships for them.
For convenience we add concept definitions as a mechanism for defining new concepts from old
ones. The so-defined concepts inherit the properties from the concepts they are defined from.
As PL0DL does not offer any guidance on this, we will leave the discussion of ABoxes to ?? when
we have introduced our first proper description logic ALC.
Consistency Test
Definition 16.2.24. We call a concept C consistent, iff there is no concept A,
with both C ⊑ A and C ⊑ A.
Or equivalently:
Even though consistency in our example seems trivial, large ontologies can make machine support
necessary. This is even more true for ontologies that change over time. Say that an ontology
initially has the concept definitions woman=person⊓long_hair and man=person⊓bearded, and then
is modernized to a more biologically correct state. In the initial version the concept hermaphrodite
is consistent, but becomes inconsistent after the renovation; the authors of the renovation should
be made aware of this by the system.
The subsumption test determines whether the sets denoted by two concepts are in a subset relation.
The main justification for this is that humans tend to be aware of concept subsumption, and tend
to think in taxonomic hierarchies. To cater to this, the subsumption test is useful.
Subsumption Test
Example 16.2.27. In this case trivial
The good news is that we can reduce the subsumption test to the consistency test, so we can
re-use our existing implementation.
The main user-visible service of the subsumption test is to compute the actual taxonomy induced
by an ontology.
Classification
The subsumption relation among all concepts (subsumption graph)
Visualization of the subsumption graph for inspection (plausibility)
Definition 16.2.29. Classification is the computation of the subsumption graph.
334 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
object
person
Remark: This only works in the presence of concept definitions, not in a purely
descriptive framework like semantic networks:
can
animal move
TBox isa isa
amoeba
has_part higher animal has_part
legs head
isa isa
pattern eat color
striped tiger elephant gray
eat
ABox Roy eat Rex Clyde
If we take stock of what we have developed so far, then we can see PL0DL as a rational reconstruction
of semantic networks restricted to the “isa” relation. We relegate the “instance” relation to ??.
This reconstruction can now be used as a basis on which we can extend the expressivity and
inference procedures without running into problems.
Reason: There are no quantifiers in PL0 (existential (∃) and universal (∀))
Idea: Use first-order predicate logic (PL1 )
ALC extends the concept operators of PL0DL with binary relations (called “roles” in ALC). This
gives ALC the expressive power we had for the basic semantic networks from ??.
Syntax of ALC
Definition 16.3.2 (Concepts). (aka. “predicates” in PL1 or “propositional
variables” in PL0DL )
Concepts in DLs represent collections of objects.
Definition 16.3.7 (Grammar). The formulae of ALC are given by the following
grammar: FALC ::=C | ⊤ | ⊥ | FALC | FALC ⊓ FALC | FALC ⊔ FALC | ∃R.FALC | ∀R.FALC
ALC restricts the quantification to range all individuals reachable as role successor. The distinction
336 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
between universal and existential quantifiers clarifies an implicit ambiguity in semantic networks.
As before we allow concept definitions so that we can express new concepts from old ones, and
obtain more concise descriptions.
Example 16.3.18.
Definition rec?
man = person ⊓ ∃has_chrom.Y_chrom -
woman = person ⊓ ∀has_chrom.Y_chrom -
mother = woman ⊓ ∃has_child.person -
father = man ⊓ ∃has_child.person -
grandparent = person ⊓ ∃has_child.(mother ⊔ father) -
german = person ⊓ ∃has_parents.german +
number_list = empty_list ⊔ ∃is_first.number ⊓ ∃is_rest.number_list +
As before, we can normalize a TBox by definition expansion if it is acyclic. With the introduction
of roles and quantification, concept definitions in ALC have a more “interesting” way to be cyclic
as ?? shows.
Now that we have motivated and fixed the syntax of ALC, we will give it a formal semantics.
The semantics of ALC is an extension of the set-theoretic semantics for PL0 , thus the interpretation
[[·]] assigns subsets of the domain of discourse to concepts and binary relations over the domain
338 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
of discourse to roles.
Semantics of ALC
ALC semantics is an extension of the set-semantics of propositional logic.
Definition 16.3.25. A model for ALC is a pair ⟨D, [[·]]⟩, where D is a non-empty
set called the domain of discourse and [[·]] a mapping called the interpretation, such
that
Definition 16.3.26. The translation of ALC into PL1 extends the one from ?? by
the following quantifier rules:
fo(x) fo(x)
∀R.φ := (∀y.R(x, y) ⇒ φfo(y) ) ∃R.φ := (∃y.R(x, y) ∧ φfo(y) )
We can now use the ALC identities above to establish a useful normal form for ALC. This will
play a role in the inference procedures we study next.
The following identitieswill be useful later on. They can be proven directly with the settings from
??; we carry this out for one of them below.
ALC Identities
1 ∃R.φ = ∀R.φ 3 ∀R.φ = ∃R.φ
2 ∀R.(φ ⊓ ψ) = ∀R.φ ⊓ ∀R.ψ 4 ∃R.(φ ⊔ ψ) = ∃R.φ ⊔ ∃R.ψ
Proof of 1
∃R.φ = D\ [ ∃R.φ]] = D\{x ∈ D | ∃y.(⟨x, y⟩ ∈ [ R]]) and (y ∈ [ φ]])}
= {x ∈ D | not ∃y.(⟨x, y⟩ ∈ [ R]]) and (y ∈ [ φ]])}
= {x ∈ D | ∀y.if (⟨x, y⟩ ∈ [ R]]) then (y ̸∈ [ φ]])}
= {x ∈ D | ∀y.if (⟨x, y⟩ ∈ [ R]]) then (y ∈ (D\ [ φ]]))}
= {x ∈ D | ∀y.if (⟨x, y⟩ ∈ [ R]]) then (y ∈ [ φ]])}
= [ ∀R.φ]]
The form of the identities (interchanging quantification with connectives) is reminiscient of iden-
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 339
example by rule
∃R.(∀S.e ⊓ ∀S.d)
7→ ∀R.∀S.e ⊓ ∀S.d ∃R.φ 7→ ∀R.φ
7→ ∀R.(∀S.e ⊔ ∀S.d) φ ⊓ ψ 7→ φ ⊔ ψ
7→ ∀R.(∃S.e ⊔ ∀S.d) ∀R.φ 7→ ∃R.φ
7→ ∀R.(∃S.e ⊔ ∀S.d) φ 7→ φ
Finally, we extend ALC with an ABox component. This mainly means that we define two new
assertions in ALC and specify their semantics and PL1 translation.
If we take stock of what we have developed so far, then we see that ALC as a rational recon-
struction of semantic networks restricted to the “isa” and “instance” relations – which are the only
ones that can really be given a denotational and operational semantics.
In this subsection we make good on the motivation from ?? that description logics enjoy tractable
inference procedures: We present a tableau calculus for ALC, show that it is a decision procedure,
and study its complexity.
where φ is a normalized ALC concept in negation normal form with the following
rules:
x:c x:∀R.φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R.φ
T⊥ T⊓ T⊔ T∀ T∃
⊥ x:φ x:φ x:ψ y:φ xRy
x:ψ y:φ
In contrast to the tableau tableau calculi for theorem proving we have studied earlier, TALC is run
in “model generation mode”. Instead of initializing the tableau with the axioms and the negated
conjecture and hope that all branches will close, we initialize the TALC tableau with axioms and
the “membership-conjecture” that a given concept φ is satisfiable – i.e. φ h as a member x, and
hope for branches that are open, i.e. that make the conjecture true (and at the same time give a
model).
Let us now work through two very simple examples; one unsatisfiable, and a satisfiable one.
TALC Examples
Example 16.3.34 (Tableau Proofs). We have two similar conjectures about
children.
x:∀has_child.man ⊓ ∃has_child.man (all sons, but a daughter)
x:∀has_child.man ⊓ ∃has_child.man initial
x:∀has_child.man T⊓
x:∃has_child.man T⊓
x has_child y T∃
y:man T∃
⊥ T⊥
inconsistent
x:∀has_child.man ⊓ ∃has_child.man (only sons, and at least one)
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 341
Another example: this one is more complex, but the concept is satisfiable.
7 y:ugrad y:grad T⊔
8 ⊥ open
The left branch is closed, the right one represents a model: y is a child of x, y
is a graduate student, x hat exactly one child: y.
After we got an intuition about TALC , we can now study the properties of the calculus to determine
that it is a decision procedure for ALC.
The soundness result for TALC is as usual: we start with a model of x:φ and show that an TALC
tableau must have an open branch.
Correctness
Lemma 16.3.36. If φ satisfiable, then TALC terminates on x:φ with open branch.
Proof: Let M := ⟨D, [ ·]]⟩ be a model for φ and w ∈ [ φ]].
M|=(x:ψ) iff [[x]] ∈ [[ψ]]
1. We define [ x]] := w and M|=x R y iff ⟨x, y⟩ ∈ [[R]]
M|=S iff I|=c for all c ∈ S
2. This gives us M|=(x:φ) (base case)
3. If the branch is satisfiable, then either
no rule applicable to leaf, (open branch)
or rule applicable and one new branch satisfiable. (inductive case: next)
4. There must be an open branch. (by termination)
We complete the proof by looking at all the TALC inference rules in turn.
For the completeness result for TALC we have to start with an open tableau branch and construct a
model that satisfies all judgments in the branch. We proceed by building a Herbrand model, whose
domain consists of all the individuals mentioned in the branch and which interprets all concepts
and roles as specified in the branch. Not surprisingly, the model thus constructed satisfies (all
judgments on) the branch.
we define
D : = {x | x:ψ ∈ B or z R x ∈ B}
[ c]] : = {x | x:c ∈ B}
[ R]] : = {⟨x, y⟩ | x R y ∈ B}
We complete the proof by looking at all the TALC inference rules in turn.
case y:ψ = y:∃R.θ then {y R z, z:θ} ⊆ B (z new variable) (T∃ -rules, saturation)
so M|=(z:θ) and M|=y R z, thus M|=(y:∃R.θ). (IH, Definition)
case y:ψ = y:∀R.θ Let ⟨ [ y]] , v⟩ ∈ [ R]] for some r ∈ D
then v = z for some variable z with y R z ∈ B (construction of [ R]])
So z:θ ∈ B and M|=(z:θ). (T∀ -rule, saturation, Def)
As v was arbitrary we have M|=(y:∀R.θ).
Termination
Theorem 16.3.38. TALC terminates.
To prove termination of a tableau algorithm, find a well-founded measure (function)
that is decreased by all rules
x:c x:∀R.φ
x:c x:φ ⊓ ψ x:φ ⊔ ψ xRy x:∃R.φ
T⊥ T⊓ T⊔ T∀ T∃
⊥ x:φ x:φ x:ψ y:φ xRy
x:ψ y:φ
We can turn the termination result into a worst-case complexity result by examining the sizes of
branches.
Complexity of TALC
Idea: Work off tableau branches one after the other. (Branch size =
b space
complexity)
Observation 16.3.39. The size of the branches is polynomial in the size of the
input formula:
Proof sketch: Re-examine the termination proof and count: the first summand
comes from ??, the second one from ?? and ??
Theorem 16.3.40. The satisfiability problem for ALC is in PSPACE.
Theorem 16.3.41. The satisfiability problem for ALC is PSPACE-Complete.
In summary, the theoretical complexity of ALC is the same as that for PL0 , but in practice ALC is
much more expressive. So this is a clear win.
But the description of the tableau algorithm TALC is still quite abstract, so we look at an exemplary
implementation in a functional programming language.
consistent(S) =
if {c, c} ⊆ S then false
elif ‘φ ⊓ ψ ′ ∈ S and (‘φ′ ̸∈ S or ‘ψ ′ ̸∈ S)
then consistent(S ∪ {φ, ψ})
elif ‘φ ⊔ ψ ′ ∈ S and {φ, ψ} ̸∈ S
then consistent(S ∪ {φ}) or consistent(S ∪ {ψ})
elif forall ‘∃R.ψ ′ ∈ S
consistent({ψ} ∪ {θ ∈ θ | ‘∀R.θ′ ∈ S})
else true
Note that we have (so far) only considered an empty TBox: we have initialized the tableau
with a normalized concept; so we did not need to include the concept definitions. To cover “real”
ontologies, we need to consider the case of concept axioms as well.
We now extend TALC with concept axioms. The key idea here is to realize that the concept axioms
apply to all individuals. As the individuals are generated by the T∃ rule, we can simply extend
that rule to apply all the concept axioms to the newly introduced individual.
Idea: Whenever a new variable y is introduced (by T∃ -rule) add the information
that axioms hold for y.
Initialize tableau with {x:φ} ∪ CA (CA : = set of concept axioms)
x:∃R.φ CA = {α1 , . . ., αn } ∃
New rule for ∃: TCA (instead of T∃ )
y:φ
xRy
y:α1
..
.
y:αn
The problem of this approach is that it spoils termination, since we cannot control the number of
rule applications by (fixed) properties of the input formulae. The example shows this very nicely.
We only sketch a path towards a solution.
x:d start
x:∃R.c in CA Solution: Loop-Check:
x R y1 T∃
Instead of a new variable y take an old
y 1 :c T∃
variable z, if we can guarantee that what-
y 1 :∃R.c TC∃A
ever holds for y already holds for z.
y1 R y2 T∃
y 2 :c T∃ We can only do this, iff the T∀ -rule has
y 2 :∃R.c TC∃A been exhaustively applied.
...
Theorem 16.3.44. The consistency problem of ALC with concept axioms is decid-
able.
Proof sketch: TALC with a suitable loop check terminates.
If we combine classification with the instance test, then we get the full picture of how concepts
and individuals relate to each other. We see that we get the full expressivity of semantic networks
in ALC.
Realization
Definition 16.3.47. Realization is the computation of all instance relations be-
tween ABox objects and TBox concepts.
Observation: It is sufficient to remember the lowest concepts in the subsumption
16.3. A SIMPLE DESCRIPTION LOGIC: ALC 347
object
person
Let us now get an intuition on what kinds of interactions between the various parts of an ontology.
property example
internally inconsistent tony:student, tony:student
TBox: student ⊓ prof
inconsistent with a TBox
ABox: tony:student, tony:prof
ABox: tony:∀has_grad.genius
implicit info that is not explicit tony has_grad mary
|= mary:genius
TBox: happy_prof = prof ⊓ ∀has_grad.genius
ABox: tony:happy_prof,
information that can be com-
tony has_grad mary
bined with TBox info
|= mary:genius
Algorithm: Test a:φ for consistency with ABox and TBox. (use our tableau
algorithm)
Necessary changes: (no big deal)
Normalize ABox wrt. TBox. (definition expansion)
Initialize the tableau with ABox in NNF. (so it can be used)
348 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
Example 16.3.50.
Idea: Extend to more complex ABox queries. (e.g. give me all instances of φ)
This completes our investigation of inference for ALC. We summarize that ALC is a logic-based on-
tology language where the inference problems are all decidable/computable via TALC . But of course,
while we have reached the expressivity of basic semantic networks, there are still things that we
cannot express in ALC, so we will try to extend ALC without losing decidability/computability.
Note that all these examples have in common that they are about “objects on the Web”, which is
an aspect we will come to now.
“Objects on the Web” are traditionally called “resources”, rather than defining them by their
intrinsic properties – which would be ambitious and prone to change – we take an external property
to define them: everything that has a URI is a web resource. This has repercussions on the design
of RDF.
Definition 16.4.3. A resource is anything that can have a URI, such as http:
//www.fau.de.
Definition 16.4.4. A property is a resource that has a name, such as author
or homepage, and a property value is the value of a property, such as Michael
Kohlhase or http://kwarc.info/kohlhase. (a property value can be another
resource)
Definition 16.4.5. A RDF statement s (also known as a triple) consists of a
resource (the subject of s), a property (the predicate of s), and a property value
(the object of s). A set of RDF triples is called an RDF graph.
The crucial observation here is that if we map “subjects” and “objects” to “individuals”, and
“predicates” to “relations”, the RDF triples are just relational ABox statements of description
logics. As a consequence, the techniques we developed apply.
Note: Actually, a RDF graph is technically a labeled multigraph, which allows multiple edges
between any two nodes (the resources) and where nodes and edges are labeled by URIs.
We now come to the concrete syntax of RDF. This is a relatively conventional XML syntax that
combines RDF statements with a common subject into a single “description” of that resource.
Note that XML namespaces play a crucial role in using element to encode the predicate URIs.
Recall that an element name is a qualified name that consists of a namespace URI and a proper
element name (without a colon character). Concatenating them gives a URI in our example the
predicate URI induced by the dc:creator element is http://purl.org/dc/elements/1.1/creator.
Note that as URIs go RDF URIs do not have to be URLs, but this one is and it references (is
redirected to) the relevant part of the Dublin Core elements specification [DCM12].
RDF was deliberately designed as a standoff markup format, where URIs are used to annotate
web resources by pointing to them, so that it can be used to give information about web resources
without having to change them. But this also creates maintenance problems, since web resources
may change or be deleted without warning.
RDFa gives authors a way to embed RDF triples into web resources and make keeping RDF
statements about them more in sync.
Example 16.4.9.
<div xmlns:dc="http://purl.org/dc/elements/1.1/" id="address">
<h2 about="#address" property="dc:title">RDF as an Inline RDF Markup Format</h2>
<h3 about="#address" property="dc:creator">Michael Kohlhase</h3>
<em about="#address" property="dc:date" datatype="xsd:date"
content="2009−11−11">November 11., 2009</em>
</div>
https://svn.kwarc.info/.../CompLog/kr/slides/rdfa.tex
http://purl.org/dc/elements/1.1/title
http://purl.org/dc/elements/1.1/date
http://purl.org/dc/elements/1.1/creator
RDFa as an Inline RDF Markup Format
2009−11−11 (xsd:date)
Michael Kohlhase
In the example above, the about and property attributes are reserved by RDFa and specify the
subject and predicate of the RDF statement. The object consists of the body of the element,
unless otherwise specified e.g. by the content and datatype attributes for literals content.
Let us now come back to the fact that RDF is just an XML syntax for ABox statements.
16.4. DESCRIPTION LOGICS AND THE SEMANTIC WEB 351
In this situation, we want a standardized representation language for TBox information; OWL
does just that: it standardizes a set of knowledge representation primitives and specifies a variety
of concrete syntaxes for them. OWL is designed to be compatible with RDF, so that the two
together can form an ontology language for the web.
But there are also other syntaxes in regular use. We show the functional syntax which is inspired
by the mathematical notation of relations.
Example 16.4.13. The semantic network from ?? can be expressed in OWL (in
functional syntax)
352 CHAPTER 16. KNOWLEDGE REPRESENTATION AND THE SEMANTIC WEB
We have introduced the ideas behind using description logics as the basis of a “machine-oriented
web of data”. While the first OWL specification (2004) had three sublanguages “OWL Lite”, “OWL
DL” and “OWL Full”, of which only the middle was based on description logics, with the OWL2
Recommendation from 2009, the foundation in description logics was nearly universally accepted.
The semantic web hype is by now nearly over, the technology has reached the “plateau of
productivity” with many applications being pursued in academia and industry. We will not go
into these, but briefly instroduce one of the tools that make this work.
SPARQL end-points can be used to build interesting applications, if fed with the appropriate data.
An interesting – and by now paradigmatic – example is the DBPedia project, which builds a large
ontology by analyzing Wikipedia fact boxes. These are in a standard HTML form which can be
analyzed e.g. by regular expressions, and their entries are essentially already in triple form: The
subject is the Wikipedia page they are on, the predicate is the key, and the object is either the
URI on the object value (if it carries a link) or the value itself.
We conclude our survey of the semantic web technology stack with the notion of a triplestore,
which refers to the database component, which stores vast collections of ABox triples.
357
359
This part covers the AI subfield of “planning”, i.e. search-based problem solving with a structured
representation language for environment state and actions — in planning, the focus is on the latter.
We first introduce the framework of planning (structured representation languages for problems
and actions) and then present algorithms and complexity results. Finally, we lift some of the
simplifying assumptions – deterministic, fully observable environments – we made in the previous
parts of the course.
360
Chapter 17
Planning I: Framework
Planning
Ambition: Write one program that can solve all classical search problems.
Idea: For CSP, going from “state/action-level search” to “problem-description level
search” did the trick.
Definition 17.0.2. Let Π be a search problem (see ??)
361
362 CHAPTER 17. PLANNING I: FRAMEWORK
Let us recall the agent-based setting we were using for the inference procedures from ??. We will
elaborate this further in this section.
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
Still Unspecified:
using an internal model. It then chooses an action in the same way as the reflex agent. (up next)
MAKE−PERCEPT−SENTENCE:
is responsible for creating the new internalthe
state effects
description.ofThe
percepts.
details of how models and
states are represented vary widely depending on the type of environment and the particular
MAKE−ACTION−QUERY: what
technology used in the agent design. is the
Detailed bestofnext
examples modelsaction?
and updating algorithms
appear in Chapters 4, 12, 11, 15, 17, and 25.
MAKE−ACTION−SENTENCE: the effects
Regardless of the kind of representation ofseldom
used, it is thatpossible
action. for the agent to
determine the current state of a partially observable environment exactly. Instead, the box
In particular,labeled
we “what
will look atisthe
the world effect
like now” of time/change.
(Figure (neglected
2.11) represents the agent’s “best guess” (or so far)
sometimes best guesses). For example, an automated taxi may not be able to see around the
large truck that has stopped in front of it and can only guess about what may be causing the
hold-up. Thus, uncertainty about the current state may be unavoidable, but the agent still has
Michael
to makeKohlhase: Artificial Intelligence 1
a decision. 552 2025-02-06
A perhaps less obvious point about the internal “state” maintained by a model-based
agent is that it does not have to describe “what the world is like now” in a literal sense. For
Now that we have the notion of fluents to represent the percepts at a given time point, let us try
to model how they influence the agent’s world model.
Axioms like these model the agent’s sensors – here that they are totally reliable:
there is a breeze, iff the agent feels a draft at.
Definition 17.1.5. We call fluents that describe the agent’s sensors sensor axioms.
You may have noticed that for the sensor axioms we have only used first-order logic. There is a
general story to tell here: If we have finite domains (as we do in the Wumpus cave) we can always
“compile first-order logic into propositional logic”; if domains are infinite, we usually cannot.
We will develop this here before we go on with the Wumpus models.
364 CHAPTER 17. PLANNING I: FRAMEWORK
We now continue to our logic-based agent models: Now we focus on effect axioms to model the
effects of an agent’s actions.
Example 17.1.6. The action of “going forward” at time t is captured by the fluent
forw(t).
Definition 17.1.7. Effect axioms describe how the environment changes under an
agent’s actions.
Example 17.1.8. If the agent is in cell [1, 1] facing east at time 0 and goes forward,
she is in cell [2, 1] and no longer in [1, 1]:
Unfortunately, the percept fluents, sensor axioms, and effect axioms are not enough, as we will
show in ??. We will see that this is a more general problem – the famous frame problem that
17.1. LOGIC-BASED PLANNING 365
then some special code for action selection, and then (up next)
action := POP(plan)
TELL(KB, MAKE−ACTION−SENTENCE(action,t))
t := t + 1
return action
Note that OK wumpus, and glitter are fluents, since the Wumpus might have died
or the gold might have been grabbed.
And finally the route planning part of the code. This is essentially just A∗ search.
Evaluation: Even though this works for the Wumpus world, it is not the “universal,
logic-based problem solver” we dreamed of!
Planning tries to solve this with another representation of actions. (up next)
{rabbit(r1)} {white(r1)}
rabbit
Application: Business
Business Process
Process Templates
Templates at
at SAP
SAP
Michael Kohlhase: Artificial Intelligence 1 562 2025-02-06
Approval: Decide CQ
Approval:
not Decide CQ
not
Necessary Approval
Create CQ Necessary Approval
Create CQ
Submit CQ
Submit CQ
Check CQ Check CQ
Check CQ Check CQ
Completeness Consistency
Completeness Consistency
Mark CQ as
Mark CQ as
Accepted
Accepted
Create Follow-
Create Follow-
Up for CQ
Up for CQ
Check CQ
Check CQ
Approval
Approval
Status
Status Archive CQ
Archive CQ
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
17.2. PLANNING: INTRODUCTION 369
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
DMZ
Web Server Application Server
Internet
Router
Firewall
Attacker
Workstation
DB Server
SENSITIVE USERS
Quick: Rapid prototyping: 10s lines of problem description vs. 1000s lines of C++
code. (E.g. language generation)
Flexible: Adapt/maintain the description. (E.g. network security)
Intelligent: Determines automatically how to solve a complex problem efficiently!
(The ultimate goal, no?!)
Efficiency loss: Without any domain-specific knowledge about chess, you don’t
beat Kasparov . . .
Trade-off between “automatic and general” vs. “manual work but efficient”.
Research Question: How to make fully automatic algorithms efficient?
Search Planning
States Lisp data structures Logical sentences
Actions Lisp code Preconditions/outcomes
Goal Lisp code Logical sentence (conjunction)
Plan Sequence from S0 Constraints on actions
Recall: Our heuristic search algorithms (duplicate pruning omitted for simplicity)
function Greedy_Best−First_Search (problem)
returns a solution, or failure
17.2. PLANNING: INTRODUCTION 371
For A∗
order f rontier by g + h instead of h (line 4)
′ ′ ′
insert g(n ) + h(n ) instead of h(n ) to f rontier (last line)
Observation 17.2.4. State spaces typically are huge even for simple problems.
In other words: Even solving “simple problems” automatically (without help from
a human) requires a form of intelligence.
With blind search, even the largest super computer in the world won’t scale beyond
20 blocks!
Focussing on heuristic search as the solution method, this is the main question
that needs to be answered.
3. The PDDL Language: What do the input files for off-the-shelf planning software
look like?
So you can actually play around with such software. (Exercises!)
4. Planning Complexity: How complex is planning?
The price of generality is complexity, and here’s what that “price” is, exactly.
Prerequisite/Result:
17.4. STRIPS PLANNING 375
A
C B
A B C
Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
STRIPS Planning
Definition 17.4.1. STRIPS = Stanford Research Institute Problem Solver.
376 CHAPTER 17. PLANNING I: FRAMEWORK
I’ll outline some extensions beyond STRIPS later on, when we discuss PDDL.
Historical note: STRIPS [FN71] was originally a planner (cf. Shakey), whose
language actually wasn’t quite that simple.
We will often give each action a ∈ A a name (a string), and identify a with that
name.
Note: We assume, for simplicity, that every action has cost 1. (Unit costs, cf.
??)
“TSP” in Australia
Example 17.4.3 (Salesman Travelling in Australia).
17.4. STRIPS PLANNING 377
at(Ad)
at(Br) drv(Br, Sy) at(Sy) drv(Sy, Ad) vis(Sy)
vis(Sy) vis(Sy)
vis(Br)
vis(Br) vis(Br)
vis(Ad)
drv(Ad, Sy)
drv(Sy, Br)
at(Sy)
at(Sy) vis(Sy)
vis(Sy) vis(Ad)
vis(Br)
drv(Sy, Ad)
drv(Br, Sy)
at(Br)
at(Ad) at(Sy)
vis(Sy)
vis(Sy) vis(Sy)
drv(Ad, Sy) drv(Sy, Br) vis(Ad)
vis(Ad) vis(Ad)
vis(Br)
Answer: Yes, two – plans for TSP− are solutions for ΘTSP− , dashed node =
b I,
thick nodes =
b G:
drv(Sy, Br), drv(Br, Sy), drv(Sy, Ad) (upper path)
drv(Sy, Ad), drv(Ad, Sy), drv(Sy, Br). (lower path)
The Blocksworld
Definition 17.4.8. The blocks world is a simple planning domain: a set of wooden
blocks of various shapes and colors sit on a table. The goal is to build one or more
vertical stacks of blocks. Only one block may be moved at a time: it may either be
placed on the table or placed atop another block.
Example 17.4.9.
E
D C B
E A B C A D
The next example for a planning task is not obvious at first sight, but has been quite influential,
showing that many industry problems can be specified declaratively by formalizing the domain
and the particular planning tasks in PDDL and then using off-the-shelf planners to solve them.
[KS00] reports that this has significantly reduced labor costs and increased maintainability of the
implementation.
VIP D
NA: Never-alone
AT: Attendant.
AT
P: Normal passenger
A
C B
A B C
Simple planners that split the goal into subgoals on(A, B) and on(B, C) fail:
382 CHAPTER 17. PLANNING I: FRAMEWORK
Before we go into the details, let us try to understand the main ideas of partial order planning.
We now make the ideas discussed above concrete by giving a mathematical formulation. It is
advantageous to cast a partially ordered plan as a labeled DAG rather than a partial ordering
since it draws the attention to the difference between actions and steps.
p
Notation: A causal link S −→ T can also be denoted by a direct arrow between the
effects p of S and the preconditions p of T in the STRIPS action notation above.
Show temporal constraints as dashed arrows.
Planning Process
Definition 17.5.10. Partial order planning is search in the space of partial plans
via the following operations:
add link from an existing action to an open precondition,
add step (an action with links to other steps) to fulfil an open precondition,
order one step wrt. another (by adding temporal constraints) to remove possible
conflicts.
384 CHAPTER 17. PLANNING I: FRAMEWORK
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
At(Home)
Go(HW S)
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
At(Home)
Go(HW S)
At(X)
Go(SM )
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
At(Home)
Go(HW S)
At(X)
Go(SM )
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
At(Home)
Go(HW S)
At(X)
Go(SM )
At(SM )
Go(Home)
Start
Sell(SM, M ilk)At(Home) Sell(HW S, Drill)Sell(SM
At(Home)
Go(HW S)
At(HW S)
Go(SM )
At(SM )
Go(Home)
Here we show a successful search for a partially ordered plan. We start out by initializing the plan
by with the respective start and finish steps. Then we consecutively add steps to fulfill the open
preconditions – marked in red – starting with those of the finish step.
In the end we add three temporal constraints that complete the partially ordered plan.
The search process for the links and steps is relatively plausible and standard in this example, but
we do not have any idea where the temporal constraints should systematically come from. We
look at this next.
392 CHAPTER 17. PLANNING I: FRAMEWORK
Go(SM )
At(SM )
demotion =
b put before
Go(Home)
At(Home)
At(SM ) promotion =
b put after
Buy(M ilk)
Properties of POP
Nondeterministic algorithm: backtracks at choice points on failure:
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Initializing the partial order plan with with Start and Finish.
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Cl(B) Cl(C)
M ove(B, C)
¬Cl(C), On(B, C)
Cl(A)Cl(A) Cl(B)
M ove(A, B)
¬Cl(B) On(A, B)
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
Start
On(C, A) On(A, T ) Cl(B) On(B, T ) Cl(C)
E
D C B
E A B C A D
E
D C B
E A B C A D
17.7 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26900.
Summary
General problem solving attempts to develop solvers that perform well across a large
class of problems.
Planning, as considered here, is a form of general problem solving dedicated to the
class of classical search problems. (Actually, we also address inaccessible, stochastic,
dynamic, continuous, and multi-agent settings.)
Suggested Reading:
• Chapters 10: Classical Planning and 11: Planning and Acting in the Real World in [RN09].
– Although the book is named “A Modern Approach”, the planning section was written long
before the IPC was even dreamt of, before PDDL was conceived, and several years before
heuristic search hit the scene. As such, what we have right now is the attempt of two outsiders
trying in vain to catch up with the dramatic changes in planning since 1995.
– Chapter 10 is Ok as a background read. Some issues are, imho, misrepresented, and it’s far
from being an up-to-date account. But it’s Ok to get some additional intuitions in words
different from my own.
– Chapter 11 is useful in our context here because we don’t cover any of it. If you’re interested
in extended/alternative planning paradigms, do read it.
• A good source for modern information (some of which we covered in the course) is Jörg
Hoffmann’s Everything You Always Wanted to Know About Planning (But Were Afraid to
Ask) [Hof11] which is available online at http://fai.cs.uni-saarland.de/hoffmann/papers/
ki11.pdf
Chapter 18
18.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26901.
(3,3,1)
03/23
In planning, this is referred to as forward search, or forward state-space search.
401
Kohlhase: Künstliche Intelligenz 1 532 July 5, 2018
402 CHAPTER 18. PLANNING II: ALGORITHMS
cos
t est
ateim
h
cost esti
mate h goal
init
mate h
cost esti
h
ate
e s tim
t
cos
Heuristic function h estimates the cost of an optimal path from a state s to the
goal state; search prefers to expand states s with small h(s).
Live Demo vs. Breadth-First Search:
http://qiao.github.io/PathFinding.js/visual/
Exactly like our definition from ??. Except, because we assume unit costs here, we
use N instead of R+ .
Definition 18.1.2. Let Π be a STRIPS task with states S. The perfect heuristic
h∗ assigns every s ∈ S the length of a shortest path from s to a goal state, or ∞
if no such path exists. A heuristic h for Π is admissible if, for all s ∈ S, we have
h(s) ≤ h∗ (s).
Exactly like our definition from ??, except for path length instead of path cost (cf.
above).
In all cases, we attempt to approximate h∗ (s), the length of an optimal plan for s.
Some algorithms guarantee to lower bound h∗ (s).
The delete relaxation is the most successful method for the automatic generation
of heuristic functions. It is a key ingredient to almost all IPC winners of the last
decade. It relaxes STRIPS tasks by ignoring the delete lists.
The h+ Heuristic: What is the resulting heuristic function?
How to Relax
404 CHAPTER 18. PLANNING II: ALGORITHMS
P N ∪ {∞}
h∗P
P′ h∗P ′
R
1. You have a class P of problems, whose perfect heuristic h∗P you wish to estimate.
2. You define a class P ′ of simpler problems, whose perfect heuristic h∗P ′ can be
used to estimate h∗P .
3. You define a transformation – the relaxation mapping R – that maps instances
Π ∈ P into instances Π′ ∈ P ′ .
4. Given Π ∈ P, you let Π′ := R(Π), and estimate h∗ P (Π) by h∗ P ′ (Π′ ).
Relaxation in Route-Finding
406 CHAPTER 18. PLANNING II: ALGORITHMS
We will start with a very simple relaxation, which could be termed “positive thinking”: we do not
18.2. HOW TO RELAX 407
consider preconditions of actions and leave out the delete lists as well.
R h∗P ′
R(Πs )
We are here
AC
Relaxed problem:
State s: AC; goal G: AD.
Actions A: add.
hR (s) =1: ⟨ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1
AC
408 CHAPTER 18. PLANNING II: ALGORITHMS
Relaxed problem:
State s: AC; goal G: AD.
Actions A: add.
hR (s) =1: ⟨ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1
AC
Real problem:
State s: BC; goal G: AD.
Actions A: pre, add, del.
drAB
AC −−−−→ BC.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB
AC BC
Relaxed problem:
State s: BC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB
AC BC
Relaxed problem:
State s: BC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB 2
AC BC
18.2. HOW TO RELAX 409
Real problem:
State s: CC; goal G: AD.
Actions A: pre, add, del.
drBC
BC −−−−→ CC.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB 2 drBC
AC BC CC
Relaxed problem:
State s: CC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB 2 drBC
AC BC CC
Relaxed problem:
State s: CC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB 2 drBC 2
AC BC CC
Real problem:
State s: AC; goal G: AD.
Actions A: pre, add, del.
drBA
BC −−−−→ AC.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB 2 drBC 2
dr
AC BC B ACC
AC
410 CHAPTER 18. PLANNING II: ALGORITHMS
Real problem:
State s: AC; goal G: AD.
Actions A: pre, add, del.
Duplicate state, prune.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
1 drAB 2 drBC 2
dr
AC BC B ACC
D
AC
Real problem:
State s: DC; goal G: AD.
Actions A: pre, add, del.
drCD
CC −−−−→ DC.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
D DC
r C
1 drAB 2 drBC 2 d
dr
AC BC B ACC
D
AC
Relaxed problem:
State s: DC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
D DC
rC
1 drAB 2 drBC 2 d
dr
AC BC B ACC
D
AC
Relaxed problem:
State s: DC; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
18.2. HOW TO RELAX 411
We are here
2
D DC
rC
1 drAB 2 drBC 2 d
dr
AC BC B ACC
D
AC
Real problem:
State s: CT ; goal G: AD.
Actions A: pre, add, del.
loC
CC −−→ CT .
Greedy best-first search: (tie-breaking: alphabetic)
We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC
dr
AC BC B ACC CT
D
AC
Relaxed problem:
State s: CT ; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC
dr
AC BC B ACC CT
D
AC
Relaxed problem:
State s: CT ; goal G: AD.
Actions A: add.
hR (s) =2: ⟨drBA, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
412 CHAPTER 18. PLANNING II: ALGORITHMS
We are here
2
D DC
rC
1 drAB 2 drBC 2 dloC 2
dr
AC BC B ACC CT
D
AC
Real problem:
State s: BC; goal G: AD.
Actions A: pre, add, del.
drCB
CC −−−−→ BC.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC 2
dr dr
AC BC B ACC C BCT
D
AC BC
Real problem:
State s: BC; goal G: AD.
Actions A: pre, add, del.
Duplicate state, prune.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
2
D
rC DC
1 drAB 2 drBC 2 dloC 2
dr dr
AC BC B ACC C BCT
D D
AC BC
Real problem:
State s: CT ; goal G: AD.
Actions A: pre, add, del.
Successors: BT , DT , CC.
Greedy best-first search: (tie-breaking: alphabetic)
18.2. HOW TO RELAX 413
We are here
2 2
D DC B BT
rC C
dr
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
State s: BT ; goal G: AD.
Actions A: pre, add, del.
Successors: AT , BB, CT .
Greedy best-first search: (tie-breaking: alphabetic)
1
AT
2
A
drB
We are here B BB
2 ul
2 drBC D
D B
rC DCdrC BT CT
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
State s: AT ; goal G: AD.
Actions A: pre, add, del.
Successors: AA, BT .
Greedy best-first search: (tie-breaking: alphabetic)
1 ulA 1
dr
AT ABAA
2 D
A
drB
We are here B BB BT
2 ul
2 drBC D
D B
r C DC rC BT CT
d
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
State s: AA; goal G: AD.
Actions A: pre, add, del.
Successors: BA, AT .
Greedy best-first search: (tie-breaking: alphabetic)
414 CHAPTER 18. PLANNING II: ALGORITHMS
1 ulA 1 drAB 2
dr l
AT ABAA oA BA
2 D D
A
dr B
We are here B BB BT AT
2 ul
2 drBC D
D B
d rC DCdrC BT CT
1 drAB 2 drBC 2 loC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
State s: BA; goal G: AD.
Actions A: pre, add, del.
Successors: CA, AA.
Greedy best-first search: (tie-breaking: alphabetic)
We are here B BB BT AT AA
2 ul
2 drBC D
D B
rC DCdrC BT CT
1 drAB 2 drBC 2 dloC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
State s: BA; goal G: AD.
Actions A: pre, add, del.
Successors: CA, AA.
Greedy best-first search: (tie-breaking: alphabetic)
We are here B BB BT AT AA
2 ul
2 drBC D
D B
d rC DCdrC BT CT
1 drAB 2 drBC 2 loC 2 drCD 2
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
P N ∪ {∞}
h∗P
P′ ⊆ P h∗P ′
R
“When the world changes, its previous state remains true as well.”
Real world: (before)
Real world:
(after)
Relaxed
world: (before)
416 CHAPTER 18. PLANNING II: ALGORITHMS
Relaxed
world: (after)
Real world:
Relaxed world:
In other words, the class of simpler problems P ′ is the set of all STRIPS tasks with
empty delete lists, and the relaxation mapping R drops the delete lists.
Definition 18.3.2 (Relaxed Plan). Let Π := ⟨P , A, I , G⟩ be a STRIPS task, and
let s be a state. A relaxed plan for s is a plan for ⟨P , A, s, G⟩+ . A relaxed plan for
I is called a relaxed plan for Π.
A relaxed plan for s is an action sequence that solves s when pretending that all
delete lists are empty.
Also called delete-relaxed plan: “relaxation” is often used to mean delete relaxation
by default.
+
load(x) : “truck(x), pack(x) ⇒ pack(T )”.
+
unload(x) : “truck(x), pack(T ) ⇒ pack(x)”.
Relaxed plan:
+ + + + +
⟨drive(A, B) , drive(B, C) , load(C) , drive(C, D) , unload(D) ⟩
We don’t need to drive the truck back, because “it is still at A”.
PlanEx+
Definition 18.3.3 (Relaxed Plan Existence Problem). By PlanEx+ , we denote
the problem of deciding, given a STRIPS task Π := ⟨P , A, I , G⟩, whether or not
there exists a relaxed plan for Π.
Iterations on F :
420 CHAPTER 18. PLANNING II: ALGORITHMS
1. {at(Sy), vis(Sy)}
2. ∪ {at(Ad), vis(Ad), at(Br), vis(Br)}
3. ∪ {at(Da), vis(Da), at(Pe), vis(Pe)}
2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{truck(D), pack(T )}
5. ∪{pack(A), pack(B), pack(D)}
2. ∪{truck(B)}
3. ∪{truck(C)}
4. ∪{pack(T )}
5. ∪{pack(A), pack(B)}
6. ∪∅
P′ ⊆ P h∗P ′
R
h+ is Admissible
Lemma 18.4.3. Let Π := ⟨P , A, I , G⟩ be a STRIPS task, and let s be a state. If
⟨a1 , . . ., an ⟩ is a plan for Πs := ⟨P , A, {s}, G⟩, then ⟨a+ + +
1 , . . ., an ⟩ is a plan for Π .
If we ignore deletes, the states along the plan can only get bigger.
Theorem 18.4.4. h+ is Admissible.
Proof:
1. Let Π := ⟨P , A, I , G⟩ be a STRIPS task with states P , and let s ∈ P .
s .
2. h+ (s) is defined as optimal plan length in Π+
3. With the lemma above, any plan for Π also constitutes a plan for Π+ s .
4. Thus optimal plan length in Π+ s can only be shorter than that in Πs i, and the
claim follows.
We are here
AC
Relaxed problem:
We are here
AC
Relaxed problem:
We are here
5
AC
Real problem:
State s: BC; goal G: AD.
We are here
5 drAB
AC BC
Relaxed problem:
State s: BC; goal G: AD.
Actions A: pre, add.
We are here
5 drAB
AC BC
Relaxed problem:
We are here
5 drAB 5
AC BC
Real problem:
State s: CC; goal G: AD.
We are here
5 drAB 5 drBC
AC BC CC
Relaxed problem:
State s: CC; goal G: AD.
Actions A: pre, add.
We are here
5 drAB 5 drBC
AC BC CC
Relaxed problem:
We are here
drAB 5 drBC 5
AC BC CC
Real problem:
State s: AC; goal G: AD.
We are here
5 drAB 5 drBC 5
dr
AC BC B ACC
AC
Real problem:
We are here
5 drAB 5 drBC 5
dr
AC BC B ACC
D
AC
Real problem:
State s: DC; goal G: AD.
Actions A: pre, add, del.
drCD
CC −−−−→ DC.
We are here
D DC
rC
5 drAB 5 drBC 5 d
dr
AC BC B ACC
D
AC
Relaxed problem:
State s: DC; goal G: AD.
We are here
D DC
rC
5 drAB 5 drBC 5 d
dr
AC BC B ACC
D
AC
Relaxed problem:
We are here
5
D DC
rC
5 drAB 5 drBC 5 d
dr
AC BC B ACC
D
AC
Real problem:
State s: CT ; goal G: AD.
Actions A: pre, add, del.
loC
CC −−→ CT .
We are here
5
D DC
rC
5 drAB 5 drBC 5 dloC
dr
AC BC B ACC CT
D
AC
Relaxed problem:
State s: CT ; goal G: AD.
Actions A: pre, add.
h+ (s) =4: e.g.
⟨drCB, drBA, drCD, ulD⟩.
Greedy best-first search: (tie-breaking: alphabetic)
We are here
5
D
drC DC
5 drAB 5 drBC 5 loC
dr
AC BC B ACC CT
D
AC
428 CHAPTER 18. PLANNING II: ALGORITHMS
Relaxed problem:
State s: CT ; goal G: AD.
Actions A: pre, add.
We are here
5
D DC
rC
5 drAB 5 drBC 5 dloC 4
dr
AC BC B ACC CT
D
AC
Real problem:
State s: BC; goal G: AD.
Actions A: pre, add, del.
drCB
CC −−−−→ BC.
We are here
5
D
rC DC
5 drAB 5 drBC 5 dloC 4
dr dr
AC BC B ACC C BCT
D
AC BC
Real problem:
We are here
5
D DC
rC
5 drAB 5 drBC 5 dloC 4
dr dr
AC BC B ACC C BCT
D D
AC BC
Real problem:
State s: CT ; goal G: AD.
Actions A: pre, add, del.
Successors: BT , DT , CC.
We are here
5 4
D DC B BT
rC C
dr
5 drAB 5 drBC 5 dloC 4 drCD 4
dr dr u
AC BC B ACC C lC DT
BCT
D D D
AC BC CC
Real problem:
State s: BT ; goal G: AD.
Actions A: pre, add, del.
Successors: AT , BB, CT .
4
AT
5
A
drB
We are here B BB
5 ul
4 drBC D
D B
r C DC rC BT CT
d
5 drAB 5 drBC 5 dloC 4 drCD 4
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
430 CHAPTER 18. PLANNING II: ALGORITHMS
Real problem:
State s: AT ; goal G: AD.
Actions A: pre, add, del.
Successors: AA, BT .
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
d rC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4
dr dr u
AC BC B ACC C BCT lC DT
D D D
AC BC CC
Real problem:
State s: DT ; goal G: AD.
Actions A: pre, add, del.
Successors: DD, CT .
4 ulA 5
dr
AT ABAA
5 D
A
dr B
We are here B BB BT
5 ul
4 drBC D
D B
d rC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4 ulD 3
dr dr u dr
AC BC B ACC C BCT lC DT DCDD
D D D D
AC BC CC CT
Real problem:
State s: DD; goal G: AD.
Actions A: pre, add, del.
Successors: CD, DT .
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
drC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4 ulD 3 drDC 2
dr dr u dr l
AC BC B ACC C BCT lC DT DCDD oD CD
D D D D D
AC BC CC CT DT
Real problem:
State s: CD; goal G: AD.
Actions A: pre, add, del.
Successors: BD, DD.
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
r C DC rC BT CT
d
5 drAB 5 drBC 5 dloC 4 drCD 4 ulD 3 drDC 2 drCB 1
dr dr u dr l dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD
D D D D D D
AC BC CC CT DT DD
Real problem:
State s: BD; goal G: AD.
Actions A: pre, add, del.
Successors: AD, CD.
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
drC DCdrC BT CT
5 drAB 5 drBC 5 loC 4 drCD 4 ulD 3 drDC 2 drCB 1 drBA 0
dr dr u dr l dr dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD B CAD
D D D D D D D
AC BC CC CT DT DD CD
Real problem:
State s: AD; goal G: AD.
Actions A: pre, add, del.
Goal state!
4 ulA 5
dr
AT ABAA
5 D
A
drB
We are here B BB BT
5 ul
4 drBC D
D B
r C DC rC BT CT
d
5 drAB 5 drBC 5 dloC 4 drCD 4 ulD 3 drDC 2 drCB 1 drBA 0
dr dr u dr l dr dr
AC BC B ACC C BCT lC DT DCDD oD CD C DBD B CAD
D D D D D D D
AC BC CC CT DT DD CD
h+ in the Blocksworld
A
A
B B
D C C
Optimal plan: ⟨putdown(A), unstack(B, D), stack(B, C), pickup(A), stack(A, B)⟩.
Optimal relaxed plan: ⟨stack(A, B), unstack(B, D), stack(B, C)⟩.
Observation: What can we say about the “search space surface” at the initial
state here?
18.5. CONCLUSION 433
The initial state lies on a local minimum under h+ , together with the successor
state s where we stacked A onto B. All direct other neighbors of these two states
have a strictly higher h+ value.
18.5 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26906.
Summary
Heuristic search on classical search problems relies on a function h mapping states
s to an estimate h(s) of their goal state distance. Such functions h are derived by
solving relaxed problems.
In planning, the relaxed problems are generated and solved automatically. There
are four known families of suitable relaxation methods: abstractions, landmarks,
critical paths, and ignoring deletes (aka delete relaxation).
The delete relaxation consists in dropping the deletes from STRIPS tasks. A relaxed
plan is a plan for such a relaxed task. h+ (s) is the length of an optimal relaxed plan
for state s. h+ is NP-hard to compute.
hFF approximates h+ by computing some, not necessarily optimal, relaxed plan.
That is done by a forward pass (building a relaxed planning graph), followed by a
backward pass (extracting a relaxed plan).
Hand-tailored planning: Automatic planning is the extreme case where the com-
puter is given no domain knowledge other than “physics”. We can instead allow the
434 CHAPTER 18. PLANNING II: ALGORITHMS
user to provide search control knowledge, trading off modeling effort against search
performance.
Numeric planning, temporal planning, planning under uncertainty . . .
Outline
So Far: we made idealizing/simplifying assumptions:
The environment is fully observable and deterministic.
19.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/26908.
435
436 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Definition 19.1.4. The qualification problem in planning is that we can never finish
listing all the required preconditions and possible conditional effects of actions.
Root Cause: The environment is partially observable and/or non-deterministic.
Technical Problem: We cannot know the “current state of the world”, but search/-
planning algorithms are based on this assumption.
We formalize the example in PDDL for simplicity. Note that the :percept scheme is not part of
the official PDDL, but fits in well with the design.
The PDDL problem file has a “free” variable ?c for the (undetermined) joint
color.
(define (problem tc−coloring)
(:domain furniture−objects)
(:objects table chair c1 c2)
(:init (object table) (object chair) (can c1) (can c2) (inview table))
(:goal (color chair ?c) (color table ?c)))
Two action schemata: remove can lid to open and paint with open can
(:action remove−lid
:parameters (?x)
:precondition (can ?x)
:effect (open can))
(:action paint
:parameters (?x ?y)
:precondition (and (object ?x) (can ?y) (color ?y ?c) (open ?y))
:effect (color ?x ?c))
has a universal variable ?c for the paint action ⇝ we cannot just give paint a
color argument in a partially observable environment.
Sensorless Plan: Open one can, paint chair and table in its color.
Note: Contingent planning can create better plans, but needs perception
Two percept schemata: color of an object and color in a can
(:percept color
:parameters (?x ?c)
:precondition (and (object ?x) (inview ?x)))
(:percept can−color
:parameters (?x ?c)
:precondition (and (can ?x) (inview ?x) (open ?x)))
To perceive the color of an object, it must be in view, a can must also be open.
Note: In a fully observable world, the percepts would not have preconditions.
An action schema: look at an object that causes it to come into view.
(:action lookat
:parameters (?x)
:precond: (and (inview ?y) and (notequal ?x ?y))
:effect (and (inview ?x) (not (inview ?y))))
Contingent Plan:
1. look at furniture to determine color, if same ; done.
2. else, look at open and look at paint in cans
3. if paint in one can is the same as an object, paint the other with this color
4. else paint both in any color
Conditional Plans
Definition 19.3.1. Conditional plans extend the possible actions in plans by condi-
tional steps that execute sub plans conditionally whether K + P ⊨ C, where K + P
is the current knowledge base + the percepts.
Definition 19.3.3. If the possible percepts are limited to determining the current
state in a conditional plan, then we speak of a contingency plan.
Note: Need some plan for every possible percept! Compare to
game playing: some response for every opponent move.
backchaining: some rule such that every premise satisfied.
Idea: Use an AND–OR tree search (very similar to backward chaining algorithm)
Suck Right
8 5
GOAL LOOP
Idea: Use AND-OR trees as data structures for representing problems (or goals)
that can be reduced to to conjunctions and disjunctions of subproblems (or sub-
goals).
Definition 19.3.5. An AND-OR graph is a is a graph whose non-terminal nodes
are partitioned into AND nodes and OR nodes. A valuation of an AND-OR graph
T is an assignment of T or F to the nodes of T . A valuation of the terminal nodes
of T can be extended by all nodes recursively: Assign T to an
OR node, iff at least one of its children is T.
AND node, iff all of its children are T.
Suck Right
[L1 : lef t, if AtR then L1 else [if CleanL then ∅ else suck fi] fi] or
[while AtR do [lef t] done, if CleanL then ∅ else suck fi]
We have an infinite loop but plan eventually works unless action always fails.
a belief state that has information about the possible states the world may be
in, and
a sensor model that updates the belief state based on sensor information
a transition model that updates the belief state based on actions.
Idea: The agent environment determines what the world model can be.
That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.
Let us now see what happens when we lift the restrictions of total observability and determin-
ism.
mix the ideas from the last two. (sensor model + transition relation)
Conformant/Sensorless Planning
Definition 19.5.1. Conformant or sensorless planning tries to find plans that work
444 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Observation 19.5.3. In a sensorless world we do not know the initial state. (or
any state after)
Observation 19.5.4. Sensorless planning must search in the space of belief states
(sets of possible actual states).
Let us see if we can understand the options for T b (a, S) a bit better. The first question is when we
want an action a to be applicable to a belief state S ⊆ S, i.e. when should T b (a, S) be non-empty.
19.5. SEARCHING/PLANNING WITHOUT OBSERVATIONS 445
In the first case, ab would be applicable iff a is applicable to some s ∈ S, in the second case if a
is applicable to all s ∈ S. So we only want to choose the first case if actions are harmless.
The second question we ask ourselves is what should be the results of applying a to S ⊆ S?,
again, if actions are harmless, we can just collect the results, otherwise, we need to make sure that
all members of the result ab are reached for all possible states in S.
R
L R
L
S S
R R
L R L R
L L
S S
S S
R
L R
S S
Figure 3.3 The state space for the vacuum world. Links denote actions: L = Left, R =
Right, S = Suck.
Problem: Belief states are HUGE; e.g. initial belief state for the 10 × 10 vacuum
446 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
In ??, since the environment was observable and deterministic we could just use
offline planning.
In ?? because we chose to.
Note: If the world is nondeterministic or partially observable then percepts usually
provide information, i.e., split up the belief state
bb}.
The update stage determines, for each possible percept, the resulting belief
state: UPDATE(bb, o) := {s | o = PERC(s) and s ∈ bb}
The functions PRED and PERC are the main parameters of this model. We define
RESULT(b, a):={UPDATE(PRED(b, a), o) | PossPERC(PRED(b, a))}
[B,Dirty] 2
Right
1 2
(a)
3 4
[B,Dirty] 2
Right
1 2
[B,Clean] 4
(a)
3 4
[B,Clean] 4
The action Right is deterministic, sensing disambiguates to singletons Slippery
World:
2
[B,Dirty]
2
[B,Dirty]
Right 2
Right 2
1 1 [A,Dirty] 1
(b) 1 1 [A,Dirty] 1
(b)
3 33 3
3
3
44
[B,Clean]
[B,Clean]4
4
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
ministic world,
and [R,Right
Clean],isleading
applied in belief
to two the initial belief
states, each state,isresulting
of which a singleton.in(b)a In
new predicted belief
the slippery
2
[B,Dirty]
Right 2
1 1 [A,Dirty] 1
Figure 4.14 Two examples of transitions in local-sensing vacuum worlds. (a) In the deter-
ministic world, Right is applied in the initial belief state, resulting in a new predicted belief
Belief-State Search with Percepts
state with two possible physical states; for those states, the possible percepts are [R, Dirty]
and [R, Clean], leading to two belief states, each of which is a singleton. (b) In the slippery
world, Right is The
Observation: applied in the initialtransition
belief-state belief state, givinginduces
model a new belief state with four
an AND-OR physi-
graph.
cal states; for those states, the possible percepts are [L, Dirty], [R, Dirty], and [R, Clean],
Idea: Use
leading AND-OR
to three search
belief states as in non deterministic environments.
shown.
3
Suck Right
5
2 4
7
36 Chapter 4 Search in Complex Environments
Figure 4.15 The first level of the AND – OR search tree for a problem in the local-sensing
Solution:
vacuum world;
[Suck, is theiffirst
Right,
Suck action=
Bstate in {6}
the solution.
then Suck else [] fi]
Note: Belief-state-problem ; conditional step tests on belief-state percept (plan
would not be executable
Suck
in a partially
[A,Clean]
observableRight
environment
2
otherwise)
[B,Dirty]
1 5 5 6 2
Michael Kohlhase: Artificial Intelligence 1 663 2025-02-06
3 7 7 4 6
Contingent Planning
Definition 19.6.7. The generation of plan with conditional branching based on
percepts is called contingent planning, solutions are called contingent plans.
Appropriate for partially observable or non-deterministic environments.
Example 19.6.8. Continuing ??.
One of the possible contingent plan is
((lookat table) (lookat chair)
(if (and (color table c) (color chair c)) (noop)
((removelid c1) (lookat c1) (removelid c2) (lookat c2)
(if (and (color table c) (color can c)) ((paint chair can))
(if (and (color chair c) (color can c)) ((paint table can))
((paint chair c1) (paint table c1)))))))
Note: Variables in this plan are existential; e.g. in
line 2: If there is come joint color c of the table and chair ; done.
line 4/5: Condition can be satisfied by [c1 /can] or [c2 /can] ; instantiate ac-
cordingly.
Definition 19.6.9. During plan execution the agent maintains the belief state b,
chooses the branch depending on whether b ⊨ c for the condition c.
Note: The planner must make sure b ⊨ c can always be decided.
Here:
Given an action a and percepts p = p1 ∧ . . . ∧ pn , we have
450 CHAPTER 19. SEARCHING, PLANNING, AND ACTING IN THE REAL WORLD
Idea: Given such a mechanism for generating (exact or approximate) updated belief
states, we can generate contingent plans with an extension of AND-OR search over
belief states.
Extension: This also works for non-deterministic actions: we extend the represen-
tation of effects to disjunctions.
Questions about how ALeA is used, what it is like usig ALeA, and questions about
demography
Token is generated at the end of the survey (SAVE THIS CODE!)
Completed survey count as a successfull prepquiz in AI1!
Look for Quiz 15 in the usual place (single question)
just submit the token to get full points
The token can also be used to exercise the rights of the GDPR.
Survey has no timelimit and is free, anonymous, can be paused and continued later
on and can be cancelled.
https:
//ddi-survey.cs.fau.de/limesurvey/index.php/667123?lang=en
This URL will also be posted on the forum tonight.
FigureG4.18 (i.e.
3
Down(1, 1) results in (1,1) back)
A simple maze problem. The agent starts at S and must reach G but
1 2 3
nothing of theFigure
environment.
4.18 A simple maze problem. The agent starts at S and must reach G but knows
nothing of the environment.
2
1 S
Observation 19.7.5. No online algorithm can avoid dead ends in all state
G spaces.
(a) (b)
Example 19.7.6. Two state spaces that lead an onlineFigure
agent4.19 (a)into
Two state dead ends:
spaces that might lead an online search agent into a dead end.
S
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
that can cause an online search agent to follow an arbitrarily inefficient route to the goal.
Whichever choice the agent makes, the adversary blocks that route with another long, thin
wall, so that the path followed is much longer than the best possible path.
G
S A S A
G
S (a)
G (b)
Any agent will fail in at least one of the spaces.
Figure 4.19 (a) Two state spaces that might lead an online search agent into a de
Any given agent will fail in at least one of these spaces. (b) A two-dimensional envir
Definition 19.7.7. We call ?? an adversarythat
argument.
can cause an online search agent to follow an arbitrarily inefficient route to th
S A Whichever choice the agent makes, the adversary blocks that route with another lon
Example 19.7.8. Forcing an online agent into
wall,an
so arbitrarily
that the pathinefficient route:longer than the best possible path.
followed is much
G
(a) (b)
Figure 4.19 (a) Two state spaces that might lead an online search agent into a dead end.
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
G
19.7. ONLINE SEARCH 453
S A
S G
Whichever choice the agent makes
the adversary can block with a
long, thin wall
S A
G
(a) (b)
FigureDead
Observation: 4.19 (a) Two are
ends stateaspaces that might lead
real problem an online search
for robots: ramps, agent into acliffs,
stairs, dead end.
...
Any given agent will fail in at least one of these spaces. (b) A two-dimensional environment
Definitionthat can causeAan state
19.7.9. online search
spaceagent to follow
is called an arbitrarily
safely inefficient
explorable, iff route to thestate
a goal goal. is
Whichever choice the agent makes, the adversary blocks that route with another long, thin
reachable from every
wall, so reachable
that the state.
path followed is much longer than the best possible path.
Idea: Depth first search seems a good fit. (must only travel for backtracking)
Figure 11.11 A hierarchical planning algorithm that uses angelic semantics to identify and
Replanning for Plan
commit to high-level Repair
plans that work while avoiding high-level plans that don’t. The predi-
cate M AKING -P ROGRESS checks to make sure that we aren’t stuck in an infinite regression
of Generally:
Replanning
refinements. At top level, when the agent’s
call A NGELIC modelwith
-S EARCH of the world
[Act] is initialPlan
as the incorrect. .
Example 19.8.4 (Plan Repair by Replanning). Given a plan from S to G.
Figure 11.12 At first, the sequence “whole plan” is expected to get the agent from S to G.
The agent executes steps of the plan until it expects to be in state E, but observes that it is
actually in O. The agent then replans for the minimal repair plus continuation to reach G.
19.8. REPLANNING AND EXECUTION MONITORING 455
The agent executes wholeplan step by step, monitoring the rest (plan).
After a few steps the agent expects to be in E, but observes state O.
Replanning: by calling the planner recursively
find state P in wholeplan and a plan repair from O to P . (P may be G)
minimize the cost of repair + continuation
Definition 19.8.5. There are three levels of execution monitoring: before executing
an action
action monitoring checks whether all preconditions still hold.
plan monitoring checks that the remaining plan will still succeed.
goal monitoring checks whether there is a better set of goals it could try to
achieve.
Note: ?? was a case of action monitoring leading to replanning.
Idea: On failure, resume planning (e.g. by POP) to achieve open conditions from
current state.
Definition 19.8.6. IPEM (Integrated Planning, Execution, and Monitoring):
Semester Change-Over
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
459
460 CHAPTER 20. SEMESTER CHANGE-OVER
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
We can imagine tabulating the agent function that describes any given agent; for most
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment
in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling in the right-hand column
rule ← RULE -M ATCH(state, rules) in various ways. The obvious question, then, is this: What
is theaction
right ←way to fill
rule.A CTIONout the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
Section
20.1. 2.4.
WHATThe
DIDStructure of AgentsIN AI 1?
WE LEARN 51 461
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Environment
What it will be like
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
What action I
Goals should do now
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
Agent Actuators
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
He estimates how much work this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
in initially
Idea: Tryunknown
to design environments
agents that andare
to become more competent than(do
successful its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Definition
Throughout 20.1.1.
the book, An we agent
comment is called rational, if
on opportunities andit methods
chooses for
whichever
learning inaction max-
particular
kinds of
imizes theagents.
expected Part Vvalue
goes into
of themuch more depth on
performance the learning
measure givenalgorithms
the perceptthemselves.
sequence
to date.A learning
This is calledagent canthebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT
Note:
sponsibleAfor rational
makingagent need notand
improvements, bethe
perfect
performance element, which is responsible for
selecting external actions. The performance element is what we have previously considered
to only
be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
need from critic on
notthepredict e.g.how verytheunlikely
agent is but
doing and determines
catastrophic how in
events thethe
performance
future
element should be modified to do better in the future.
percepts may not supply all relevant information (Rational ̸= clairvoyant)
The design of the learning element depends very much on the design of the performance
if we
element. When cannot
tryingperceive
to design things
an agentwe dolearns
that not need to react
a certain to them.
capability, the first question is
not “How am I going to get it to learn this?” but
but we may need to try to find out about hidden dangers “What kind of performance element will my
(exploration)
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
action outcomes may not be as expected (rational ̸= successful)
can be constructed to improve every part of the agent.
but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance
20.2 Administrativa
We will now go through the ground rules for the course. This is a kind of a social contract
between the instructor and the students. Both have to keep their side of the deal to make learning
as efficient and painless as possible. If you have questions please make sure you discuss them
with the instructor, the teaching assistants, or your fellow students. There are three sensible
venues for such discussions: online in the lectures, in the tutorials, which we discuss now, or in
the course forum – see below. Finally, it is always a very good idea to form study groups with
your friends.
Goal 2: Allow you to ask any question you have in a protected environment.
Instructor/Lead TA: Florian Rabe (KWARC Postdoc)
Room: 11.137 @ Händler building, [email protected]
Now we come to a topic that is always interesting to the students: the grading scheme.
Assessment, Grades
Overall (Module) Grade:
It is very well-established experience that without doing the homework assignments (or something
similar) on your own, you will not master the concepts, you will not even be able to ask sensible
questions, and take very little home from the course. Just sitting in the course and nodding is not
enough!
Start early! (many assignments need more than one evening’s work)
Don’t start by sitting at a blank screen (talking & study groups help)
Humans will be trying to understand the text/code/math when grading it.
Go to the tutorials, discuss with your TA! (they are there for you!)
One special case of academic rules that affects students is the question of cheating, which we will
cover next.
There is no need to cheat in this course!! (hard work will usually do)
Note: Cheating prevents you from learning (you are cutting into your own flesh)
We expect you to know what is useful collaboration and what is cheating.
You have to hand in your own original code/text/math for all assignments
You may discuss your homework assignments with others, but if doing so impairs
your ability to write truly original code/text/math, you will be cheating
Copying from peers, books or the Internet is plagiarism unless properly attributed
(even if you change most of the actual words)
I am aware that there may have been different standards about this at your previous
university! (these are the ground rules here)
There are data mining tools that monitor the originality of text/code.
Procedure: If we catch you at cheating. . . (correction: if we suspect cheating)
We will confront you with the allegation and impose a grade sanction.
If you have a reasonable explanation we lift that. (you have to convince us)
Note: Both active (copying from others) and passive cheating (allowing others to
copy) are penalized equally.
We are fully aware that the border between cheating and useful and legitimate collaboration is
468 CHAPTER 20. SEMESTER CHANGE-OVER
difficult to find and will depend on the special case. Therefore it is very difficult to put this into
firm rules. We expect you to develop a firm intuition about behavior with integrity over the course
of stay at FAU. Do use the opportunity to discuss the AI-2 topics with others. After all, one
of the non-trivial skills you want to learn in the course is how to talk about Artificial Intelligence
topics. And that takes practice, practice, and practice. Due to the current AI hype, the course
Artificial Intelligence is very popular and thus many degree programs at FAU have adopted it for
their curricula. Sometimes the course setup that fits for the CS program does not fit the other’s
very well, therefore there are some special conditions. I want to state here.
In “Wirtschafts-Informatik” you can only take AI-1 and AI-2 together in the “Wahlpflicht-
bereich”.
ECTS credits need to be divisible by five ⇝ 7.5 + 7.5 = 15.
I can only warn of what I am aware, so if your degree program lets you jump through extra hoops,
please tell me and then I can mention them here.
Maybe we can get around the problems of defining “what artificial intelligence is”, by just describing
the necessary components of AI (and how they interact). Let’s have a try to see whether that is
more informative.
Inference
Perception
470 CHAPTER 20. SEMESTER CHANGE-OVER
Language understanding
Emotion
Note that list of components is controversial as well. Some say that it lumps together cognitive
capacities that should be distinguished or forgets others, . . . . We state it here much more to get
AI-2 students to think about the issues than to make it normative.
in outer space
in outer space systems
need autonomous con-
trol:
remote control impos-
sible due to time lag
in artificial limbs
the user controls the
prosthesis via existing
nerves, can e.g. grip
a sheet of paper.
in household appliances
The iRobot Roomba
vacuums, mops, and
sweeps in corners, . . . ,
parks, charges, and
discharges.
general robotic house-
hold help is on the
horizon.
in hospitals
in the USA 90% of the
prostate operations are
carried out by Ro-
boDoc
Paro is a cuddly robot
that eases solitude in
nursing homes.
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 473
The AI Conundrum
Observation: Reserving the term “Artificial Intelligence” has been quite a land
grab!
But: researchers at the Dartmouth Conference (1956) really thought they would
solve/reach AI in two/three decades.
Consequence: AI still asks the big questions. (and still promises answers soon)
Another Consequence: AI as a field is an incubator for many innovative tech-
nologies.
AI Conundrum: Once AI solves a subfield it is called “computer science”.
(becomes a separate subfield of CS)
Example 20.3.4. Functional/Logic Programming, automated theorem proving,
Planning, machine learning, Knowledge Representation, . . .
Still Consequence: AI research was alternatingly flooded with money and cut off
brutally.
All of these phenomena can be seen in the growth of AI as an academic discipline over the course
of its now over 70 year long history.
AI becomes
scarily effective,
ubiquitous
Excitement fades;
some applications
AI-conse- profit a lot
quences,
Biases, AI-bubble bursts,
Regulation the next AI winter
Lighthill report WWW ; comes
Dartmouth Conference Data/-
Turing Test Computing
AI Winter 2
AI Winter 1 Explosion
1987-1994
1974-1980
Of course, the future of AI is still unclear, we are currently in a massive hype caused by the advent
of deep neural networks being trained on all the data of the Internet, using the computational
power of huge compute farms owned by an oligopoly of massive technology companies – we are
definitely in an AI summer.
But AI as a academic community and the tech industry also make outrageous promises, and
the media pick it up and distort it out of proportion, . . . So public opinion could flip again, sending
AI into the next winter.
interact with it via sensors and actuators. Here, the main method for realizing
intelligent behavior is by learning from the world.
As a consequence, the field of Artificial Intelligence (AI) is an engineering field at the intersection
of computer science (logic, programming, applied statistics), Cognitive Science (psychology, neu-
roscience), philosophy (can machines think, what does that mean?), linguistics (natural language
understanding), and mechatronics (robot hardware, sensors).
Subsymbolic AI and in particular machine learning is currently hyped to such an extent, that
many people take it to be synonymous with “Artificial Intelligence”. It is one of the goals of this
course to show students that this is a very impoverished view.
We combine the topics in this way in this course, not only because this reproduces the histor-
ical development but also as the methods of statistical and subsymbolic AI share a common
basis.
It is important to notice that all approaches to AI have their application domains and strong points.
We will now see that exactly the two areas, where symbolic AI and statistical/subsymbolic AI
have their respective fortes correspond to natural application areas.
Consumer tasks: consumer grade applications have tasks that must be fully
generic and wide coverage. ( e.g. machine translation like Google Translate)
Producer tasks: producer grade applications must be high-precision, but can be
476 CHAPTER 20. SEMESTER CHANGE-OVER
Precision
100% Producer Tasks
General Rule: Subsymbolic AI is well suited for consumer tasks, while symbolic
AI is better suited for producer tasks.
A domain of producer tasks I am interested in: mathematical/technical documents.
An example of a producer task – indeed this is where the name comes from – is the case of a
machine tool manufacturer T , which produces digitally programmed machine tools worth multiple
million Euro and sells them into dozens of countries. Thus T must also provide comprehensive
machine operation manuals, a non-trivial undertaking, since no two machines are identical and
they must be translated into many languages, leading to hundreds of documents. As those manual
share a lot of semantic content, their management should be supported by AI techniques. It is
critical that these methods maintain a high precision, operation errors can easily lead to very
costly machine damage and loss of production. On the other hand, the domain of these manuals is
quite restricted. A machine tool has a couple of hundred components only that can be described
by a couple of thousand attributes only.
Indeed companies like T employ high-precision AI techniques like the ones we will cover in this
course successfully; they are just not so much in the public eye as the consumer tasks.
Thus: reasoning components of some form are at the heart of many AI systems.
KWARC Angle: Scaling up (web-coverage) without dumbing down (too much)
Content markup instead of full formalization (too tedious)
User support and quality control instead of “The Truth” (elusive anyway)
use Mathematics as a test tube ( Mathematics =
b Anything Formal )
care more about applications than about philosophy (we cannot help getting
this right anyway as logicians)
The KWARC group was established at Jacobs Univ. in 2004, moved to FAU Erlan-
gen in 2016
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 477
simple decision making in such environments. Finally we extend this to probabilistic temporal
models and their decision theory.
One possible objection to this is that the agent and the environment are conceptualized as separate
entities; in particular, that the image suggests that the agent itself is not part of the environment.
Indeed that is intended, since it makes thinking about agents and environments easier and is of
little consequence in practice. In particular, the offending separation is relatively easily fixed if
needed.
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Different agents differ on the contents of the white box in the center.
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTIONdescribed by the agent function that maps any given percept sequence to an action.
We can
Michael imagine
Kohlhase: Intelligence the
tabulating
Artificial 2 agent function that
704describes any given agent; for most
2025-02-06
agents, this would be a very large table—infinite, in fact, unless we place a bound on the
length of percept sequences we want to consider. Given an agent to experiment with, we can,
in principle, construct this table by trying out all possible percept sequences and recording
Rationality which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
Idea: Try to mathematical
abstract design agents that the
description; areagent
successful!
program is a concrete(aka. “do therunning
implementation, right thing”)
within some physical system.
Problem: To What do these
illustrate we mean ideas, we byuse “successful”,
a very simple how do we vacuum-cleaner
example—the measure “success”? world
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
Definition 20.3.13.
it’s also A performance
a made-up world, so we can invent measure is a function
many variations. thatworld
This particular evaluates a sequence
has just two
locations: squares A and B. The vacuum agent perceives which square it is in and whether
of environments.
there is dirt in the square. It can choose to move left, move right, suck up the dirt, or do
nothing. One very simple agent function is the following: if the current square is dirty, then
Example 20.3.14. A performance measure for a vacuum cleaner could
suck; otherwise, move to the other square. A partial tabulation of this agent function is shown
in Figure 2.3 and an agent program that implements it appears in Figure 2.8 on page 48.
award one point per “square” cleaned up in time T ?
Looking at Figure 2.3, we see that various vacuum-world agents can be defined simply
by filling
award one pointin the right-hand
per clean column in variousper
“square” ways. The obvious
time question,one
step, minus then,per
is this: What
move?
is the right way to fill out the table? In other words, what makes an agent good or bad,
penalize for or>stupid?
intelligent k dirty squares?
We answer these questions in the next section.
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
imizes the
showexpected value
later in this chapter that itof
can the
be veryperformance
intelligent. measure given the percept sequence
to date.
Critical Observation: We only need to maximize the expected value, not the
actual value of the performance measure!
Let us see how the observation that we only need to maximize the expected value, not the actual
value of the performance measure affects the consequences.
For the design of agent for a specific task – i.e. choose an agent architecture and design an
agent program, we have to take into account the performance measure, the environment, and the
characteristics of the agent itself; in particular its actions and sensors.
The PEAS criteria are essentially a laundry list of what an agent design task description should
include.
Environment types
20.3. OVERVIEW OVER AI AND TOPICS OF AI-II 481
Agent Sensors
Actuators
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
the current state, as defined by the percept.
trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
482 CHAPTER 20. SEMESTER CHANGE-OVER
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Non-deterministic actions:
Unreliable Sensors
Robot Localization: Suppose we want to support localization using landmarks
to narrow down the area.
Example 20.3.25. If you see the Eiffel tower, then you’re in Paris.
We are now ready to proceed to environments which can only partially observed and where actions
are non deterministic. Both sources of uncertainty conspire to allow us only partial knowledge
about the world, so that we can only optimize “expected utility” instead of “actual utility” of our
actions.
484 CHAPTER 20. SEMESTER CHANGE-OVER
a belief state that has information about the possible states the world may be
in, and
a sensor model that updates the belief state based on sensor information
a transition model that updates the belief state based on actions.
Idea: The agent environment determines what the world model can be.
That is exactly what we have been doing until now: we have been studying methods that
build on descriptions of the “actual” world, and have been concentrating on the progression from
atomic to factored and ultimately structured representations. Tellingly, we spoke of “world states”
instead of “belief states”; we have now justified this practice in the brave new belief-based world
models by the (re-) definition of “world states” above. To fortify our intuitions, let us recap from
a belief-state-model perspective.
Let us now see what happens when we lift the restrictions of total observability and determin-
ism.
mix the ideas from the last two. (sensor model + transition relation)
Overview: AI2
Basics of probability theory (probability spaces, random variables, conditional
probabilities, independence,...)
networks,...)
Probabilistic Reasoning over time (Markov chains, Hidden Markov models,...)
⇒ We can update our world model episodically based on observations (i.e. sensor
data)
Decision theory: Making decisions under uncertainty (Preferences, Utilities,
Decision networks, Markov Decision Procedures,...)
⇒ We can choose the right action based on our world model and the likely outcomes
of our actions
487
489
This part of the lecture notes addresses inference and agent decision making in partially observable
environments, i.e. where we only know probabilities instead of certainties whether propositions
are true/false. We cover basic probability theory and – based on that – Bayesian Networks and
simple decision making in such environments. Finally we extend this to probabilistic temporal
models and their decision theory.
490
Chapter 21
Quantifying Uncertainty
Probabilistic Models
Definition 21.1.1 (Mathematically (slightly simplified)). A probability space
or (probability model) is a pair ⟨Ω, P ⟩ such that:
Example 21.1.2 (Dice throws). Assume we throw a (fair) die two times. Then
1
the sample space is {(i, j) | 1 ≤ i, j ≤ 6}. We define P by letting P ({A}) = 36 for
every A ∈ Ω.
Since the probability of any outcome is the same, we say P is uniformly distributed
The definition is simplified in two places: Firstly, we assume that P is defined on the full power
set. This is not always possible, especially if Ω is uncountable. In that case we need an additional
set of “events” instead, and lots of mathematical machinery to make sure that we can safely take
unions, intersections, complements etc. of these events.
Secondly, we would technically only demand that P is additive on countably many disjoint
sets.
In this course we will assume that our sample space is at most countable anyway; usually even
finite.
491
492 CHAPTER 21. QUANTIFYING UNCERTAINTY
Random Variables
In practice, we are rarely interested in the specific outcome of an experiment, but
rather in some property of the outcome. This is especially true in the very common
situation where we don’t even know the precise probabilities of the individual outcomes.
Example 21.1.3. The probability that the sum of our two dice throws is 7 is
6
P ({(i, j) ∈ Ω | i + j = 7}) = P ({(6, 1), (1, 6), (5, 2), (2, 5), (4, 3), (3, 4)}) = 36 =
1
6 .
Definition 21.1.5. We say that a random variable X is finite domain, iff its domain
dom(X) is finite and Boolean, iff dom(X) = {T, F}.
For a Boolean random variable, we will simply write P (X) for P (X = T) and
P (¬X) for P (X = F).
Note that a random variable, according to the formal definition, is neither random nor a variable:
It is a function with clearly defined domain and codomain – and what we call the domain of the
“variable” is actually its codomain... are you confused yet? ,
This confusion is a side-effect of the mathematical formalism. In practice, a random variable is
some indeterminate value that results from some statistical experiment – i.e. it is random, because
the result is not predetermined, and it is a variable, because it can take on different values.
It just so happens that if we want to model this scenario mathematically, a function is the most
natural way to do so.
Some Examples
Example 21.1.6. Summing up our two dice throws is a random variable S : Ω →
[2,12] with S((i, j)) = i + j. The probability that they sum up to 7 is written as
P (S = 7) = 16 .
Example 21.1.7. The first and second of our two dice throws are random variables
First, Second : Ω → [1,6] with First((i, j)) = i and Second((i, j)) = j.
Propositions
This is nice and all, but in practice we are interested in “compound” probabilities
like:
“What is the probability that the sum of our two dice throws is 7, but neither of the
two dice is a 3?”
Idea: Reuse the syntax of propositional logic and define the logical connectives for
random variables!
Example 21.1.11. We can express the above as: P (¬(First = 3) ∧ ¬(Second =
3) ∧ (S = 7))
Definition 21.1.12. Let X1 , X2 be random variables, x1 ∈ dom(X1 ) and x2 ∈
dom(X2 ). We define:
1. P (X1 ̸= x1 ):=P (¬(X1 = x1 )) := P ({ω ∈ Ω | X1 (ω) ̸= x1 })=1 − P (X1 = x1 ).
Events
Definition 21.1.14 (Again slightly simplified). Let ⟨Ω, P ⟩ be a probability space.
An event is a subset of Ω.
Definition 21.1.15 (Convention). We call an event (by extension) anything that
represents a subset of Ω: any statement formed from the logical connectives and values
of random variables, on which P (·) is defined.
Problem 1.1
Remember: We can define A ∨ B := ¬(¬A ∧ ¬B), T := A ∨ ¬A and F := ¬T
– is this compatible with the definition of probabilities on propositional formulae? And
why is P (X1 ̸= x1 ) = 1 − P (X1 = x1 )?
494 CHAPTER 21. QUANTIFYING UNCERTAINTY
Problem 1.3
Show that P (A) = P (A ∧ B) + P (A ∧ ¬B)
Conditional Probabilities
As we gather new information, our beliefs (should ) change, and thus our probabil-
ities!
Example 21.1.16. Your “probability of missing the connection train” increases
when you are informed that your current train has 30 minutes delay.
Example 21.1.17. The “probability of cavity” increases when the doctor is in-
formed that the patient has a toothache.
Example 21.1.18. The probability that S = 3 is clearly higher if I know that
First = 1 than otherwise – or if I know that First = 6!
P (A ∧ B)
P (A|B):=
P (B)
We also call P (A) the prior probability of A, and P (A|B) the posterior probability.
Examples
Example 21.1.20. If we assume First = 1, then P (S = 3|First = 1) should be
precisely P (Second = 2) = 61 . We check:
Example 21.1.21. Assume the prior probability P (cavity) is 0.122. The probability
that a patient has both a cavity and a toothache is P (cavity ∧toothache) = 0.067.
The probability that a patient has a toothache is P (toothache) = 0.15.
21.1. PROBABILITY THEORY 495
If the patient complains about a toothache, we can update our estimation by com-
puting the posterior probability:
Note: We just computed the probability of some underlying disease based on the
presence of a symptom!
Or more generally: We computed the probability of a cause from observing its effect.
Some Rules
Equations on unconditional probabilities have direct analogues for conditional proba-
bilities.
Problem 1.4
Convince yourself of the following:
P (A|C) = 1 − P (¬A|C).
Bayes’ Rule
P (B|A) · P (A)
P (A|B) =
P (B)
496 CHAPTER 21. QUANTIFYING UNCERTAINTY
Proof:
P (A∧B) P (B|A)·P (A)
1. P (A|B) = P (B) = P (B)
...okay, that was straightforward... what’s the big deal?
(Somewhat Dubious) Claim: Bayes’ Rule is the entire scientific method con-
densed into a single equation!
...if I keep gathering evidence and update, ultimately the impact of the prior belief
will diminish.
“You’re entitled to your own priors, but not your own likelihoods”
Independence
Question: What is the probability that S = 7 and the patient has a toothache?
Or less contrived: What is the probability that the patient has a gingivitis and a
cavity?
Proof:
1. ⇒ By definition, P (A|B) = P P(A∧B)
(B) =
P (A)·P (B)
P (B) = P (A),
2. ⇐ Assume P (A|B) = P (A). Then P (A ∧ B) = P (A|B) · P (B) = P (A) ·
P (B).
Note: Independence asserts that two events are “not related” – the probability of
one does not depend on the other.
Mathematically, we can determine independence by checking whether P (A ∧ B) =
P (A) · P (B).
In practice, this is impossible to check. Instead, we assume independence based on
domain knowledge, and then exploit this to compute P (A ∧ B).
Independence (Examples)
Example 21.1.25.
First = 2 and Second = 3 are independent – more generally, First and Second
are independent (The outcome of the first die does not affect the outcome of
the second die)
1
Quick check: P ((First = a) ∧ (Second = b)) = 36 = P (First = a) ·
P (Second = b) ✓
First and S are not independent.
(The outcome of the first die affects the sum of the two dice.) Counterexample:
1
P ((First = 1) ∧ (S = 4)) = 36 ̸= P (First = 1) · P (S = 4) = 16 · 12 = 72
1
Example 21.1.26.
Are cavity and toothache independent?
...since cavities can cause a toothache, that would probably be a bad design
decision...
Are cavity and gingivitis independent? Cavities do not cause gingivitis, and
gingivitis does not cause cavities, so... yes... right? (...as far as I know. I’m
not a dentist.)
Probably not! A patient who has cavities has probably worse dental hygiene
than those who don’t, and is thus more likely to have gingivitis as well.
⇒ cavity may be evidence that raises the probabilty of gingivitis, even if they
are not directly causally related.
Example 21.1.28. Let’s assume toothache and catch are conditionally indepen-
dent given cavity/¬cavity. Then we can finally compute:
P (toothache∧catch|cavity)·P (cavity)
P (cavity|toothache ∧ catch) = P (toothache∧catch)
P (toothache|cavity)·P (catch|cavity)·P (cavity)
= P (toothache|cavity)·P (catch|cavity)·P (cavity)+P (toothache|¬cavity)·P (catch|¬cavity)·P (¬cavity)
0.6·0.9·0.2
= 0.6·0.9·0.2+0.1·0.2·0.8 =0.87
Conditional Independence
Lemma 21.1.29. If A and B are conditionally independent given C, then P (A|B ∧
C) = P (A|C)
Proof:
P (A∧B∧C) P (A∧B|C)·P (C) P (A|C)·P (B|C)·P (C) P (A|C)·P (B∧C)
P (A|B∧C) = P (B∧C) = P (B∧C) = P (B∧C) = P (B∧C) =
P (A|C)
Question: If A and B are conditionally independent given C, does this imply that
A and B are independent? No. See previous slides for a counterexample.
Question: If A and B are independent, does this imply that A and B are also
conditionally independent given C? No. For example: First and Second are inde-
pendent, but not conditionally independent given S = 4.
Summary
Probability spaces serve as a mathematical model (and hence justification) for
everything related to probabilities.
The “atoms” of any statement of probability are the random variables. (Important
special cases: Boolean and finite domain)
We can define probabilities on compund (propositional logical) statements, with
(outcomes of) random variables as “propositional variables”.
Conditional probabilities represent posterior probabilities given some observed out-
comes.
independence and conditional independence are strong assumptions that allow us
to simplify computations of probabilities
Bayes’ Theorem
Pragmatics
Pragmatically, both interpretations amount to the same thing: I should act as if
I’m 30% confident that it will rain tomorrow. (Whether by fiat, or because in 30% of
comparable cases, it rained.)
Objection: Still: why should I? And why should my beliefs follow the seemingly
arbitrary Kolmogorov axioms?
[DF31]: If an agent has a belief that violates the Kolmogorov axioms, then there
exists a combination of “bets” on propositions so that the agent always loses money.
In other words: If your beliefs are not consistent with the mathematics, and you
act in accordance with your beliefs, there is a way to exploit this inconsistency to
your disadvantage.
use probability distributions, which are just arrays (of arrays of...) of probabilities.
And then we represent those are sparse as possible, by exploiting independence,
conditional independence, ...
Probability Distributions
Definition 21.2.1. The probability distribution for a random variable X, written
P(X), is the vector of probabilities for the (ordered) domain of X.
Note: The values in a probability distribution are all positive and sum to 1.
(Why?)
Example 21.2.2. P(First) = P(Second) = ⟨ 61 , 16 , 16 , 16 , 16 , 61 ⟩. (Both First and
Second are uniformly distributed)
Example 21.2.3. The probability distribution P(S) is ⟨ 36
1 1
, 18 1 1 5 1 5 1 1
, 12 1
, 9 , 36 , 6 , 36 , 9 , 12 , 18 1
, 36 ⟩.
Note the symmetry, with a “peak” at 7 – the random variable is (approximately,
because our domain is discrete rather than continuous) normally distributed (or
gaussian distributed, or follows a bell-curve,...).
Example 21.2.4. Probability distributions for Boolean random variables are natu-
rally pairs (probabilities for T and F), e.g.:
Example 21.2.7. P(cavity, toothache, gingivitis) could look something like this:
toothache ¬toothache
gingivitis ¬gingivitis gingivitis ¬gingivitis
cavity 0.007 0.06 0.005 0.05
¬cavity 0.08 0.003 0.045 0.75
First \ S 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1 1
1 36 36 36 36 36 36
0 0 0 0 0
1 1 1 1 1 1
2 0 36 36 36 36 36 36
0 0 0 0
1 1 1 1 1 1
3 0 0 36 36 36 36 36 36
0 0 0
1 1 1 1 1 1
4 0 0 0 36 36 36 36 36 36
0 0
1 1 1 1 1 1
5 0 0 0 0 36 36 36 36 36 36
0
1 1 1 1 1 1
6 0 0 0 0 0 36 36 36 36 36 36
Note that if we know the value of First, the value of S is completely determined by
the value of Second.
toothache ¬toothache
cavity P (cavity|toothache) = 0.45 P (cavity|¬toothache) = 0.065
¬cavity P (¬cavity|toothache) = 0.55 P (¬cavity|¬toothache) = 0.935
First \ S 2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1
1 1 2 3 4 5 6
0 0 0 0 0
1 1 1 1 1 1
2 0 2 3 4 5 6 5
0 0 0 0
1 1 1 1 1 1
3 0 0 3 4 5 6 5 4
0 0 0
1 1 1 1 1 1
4 0 0 0 4 5 6 5 4 3
0 0
1 1 1 1 1 1
5 0 0 0 0 5 6 5 4 3 2
0
1 1 1 1 1
6 0 0 0 0 0 6 5 4 3 2
1
Convention
We now “lift” multiplication and division to the level of whole probability distribu-
tions:
P(X,Y )
P(X|Y ) := P(Y ) represents the system of equations P (X = x|Y = y) :=
P ((X=x)∧(Y =y))
P (Y =y)
P(Y |X)·P(X)
Bayes’ Theorem: P(X|Y ) = P(Y ) represents the system of equations P (X =
P (Y =y|X=x)·P (X=x)
x|Y = y) = P (Y =y)
Example 21.2.14. We can read off the probability P (toothache) from the full
joint probability distribution as 0.007+0.06+0.08+0.003=0.15, and the probability
P (toothache ∧ cavity) as 0.007 + 0.06 = 0.067
We can actually implement this! (They’re just (nested) arrays)
But just as we often don’t have a fully specified probability space to work in, we often
don’t have a full joint probability distribution for our random variables either.
Also: Given random variables X 1 , . . ., X n , the full joint probability distribution has
Q
n
i=1 |dom(X i )| entries! (P(First, S) already has 60 entries!)
⇒ The rest of this section deals with keeping things small, by computing probabilities
instead of storing them all.
Probabilistic Reasoning
Probabilistic reasoning refers to inferring probabilities of events from the proba-
bilities of other events
as opposed to determining the probabilities e.g. empirically, by gathering (sufficient
amounts of representative) data and counting.
Note: In practice, we are primarily interested in, and have access to, conditional
probabilities rather than the unconditional probabilities of conjunctions of events:
We don’t reason in a vacuum: Usually, we have some evidence and want to infer
the posterior probability of some related event. (e.g. infer a plausible cause
given some symptom)
⇒ we are interested in the conditional probability P (hypothesis|observation).
“80% of patients with a cavity complain about a toothache” (i.e. P (toothache|cavity))
is more the kind of data people actually collect and publish than “1.2% of the gen-
eral population have both a cavity and a toothache” (i.e. P (cavity∧toothache)).
504 CHAPTER 21. QUANTIFYING UNCERTAINTY
Consider the probe catching in a cavity. The probe is a diagnostic tool, which
is usually evaluated in terms of its sensitivity P (catch|cavity) and specificity
P (¬catch|¬cavity). (You have probably heard these words a lot since 2020...)
Toothache Catch
Toothache Catch
Definition 21.2.15. A naive Bayes model (or, less accurately, Bayesian classifier, or,
derogatorily, idiot Bayes model) consists of:
1. random variables C, E 1 , . . ., E n such that all the E 1 , . . ., E n are conditionally inde-
pendent given C,
2. the probability distribution P(C), and
Can we compute the full joint probability distribution P(cavity, toothache, catch)
from this information?
Michael Kohlhase: Artificial Intelligence 2 745 2025-02-06
21.2. PROBABILISTIC REASONING TECHNIQUES 505
We can generalize this to more than two variables, by repeatedly applying the prod-
uct rule:
Lemma 21.2.17 (Chain rule). For any sequence of random variables X 1 , . . ., X n :
.
Hence:
Theorem 21.2.18. Given a naive Bayes model with effects E 1 , . . ., E n and cause
C, we have
n
Y
P(C, E 1 , . . ., E n ) = P(C) · ( P(E i |C)).
i=1
Marginalization
1 ,...,E n )
P(E 1 ,...,E n ) ...
Great, so now we can compute P(C|E 1 , . . ., E n ) = P(C,E
...except that we don’t know P(E 1 , . . ., E n ) :-/
...except that we can compute the full joint probability distribution, so we can recover
it:
Lemma 21.2.19 (Marginalization).
P Given random variables X 1 , . . ., X n and Y 1 , . . ., Y m ,
we have P(X 1 , . . ., X n ) = y1 ∈dom(Y 1 ),...,ym ∈dom(Y m ) P(X 1 , . . ., X n , Y 1 = y1 , . . ., Y m = ym ).
(This is just a fancy way of saying “we can add the relevant entries of the full joint
probability distribution”)
Example 21.2.20. Say we observed toothache = T and catch = T. Using marginal-
ization, we can compute
Unknowns
What if we don’t know catch? (I’m not a dentist, I don’t have a probe...)
We split our effects into {E 1 , . . ., E n } = {O1 , . . ., OnO } ∪ {U 1 , . . ., U nU } – the
observed and unknown random variables.
Let DU := dom(U 1 ) × . . . × dom(U nu ). Then
P(C, O1 , . . ., OnO )
P(C|O1 , . . ., OnO )=
P(O1 , . . ., OnO )
P
u∈DU P(C, O 1 , . . ., O nO , U 1 = u1 , . . ., U nu = unu )
=P P
c∈dom(C) u∈DU P(O 1 , . . ., O nO , C = c, U 1 = u1 , . . ., U nu = unu )
P QnO QnU
u∈DU P(C) · ( i=1 P(O i |C)) · ( j=1 P(U j = uj |C))
=P P QnO QnU
c∈dom(C) u∈DU P (C = c) · ( i=1 P(O i |C = c)) · ( j=1 P (U j = uj |C = c))
QnO P QnU
P(C) · ( i=1 P(Oi |C)) · ( u∈DU j=1 P(U j = uj |C))
=P QnO P QnU
c∈dom(C) P (C = c) · ( i=1 P(O i |C = c)) · ( u∈DU j=1 P (U j = uj |C = c))
...oof...
Michael Kohlhase: Artificial Intelligence 2 748 2025-02-06
Unknowns
QnO P QnU
P(C) · ( i=1 P(Oi |C)) · ( u∈DU j=1 P(U j = uj |C))
P(C|O1 , . . ., OnO ) = P QnO P QnU
c∈dom(C) P (C = c) · ( i=1 P(O i |C = c)) · ( u∈DU j=1 P (U j = uj |C = c))
P QnU
First, note that u∈DU j=1 P (U j = uj |C = c) = 1 (We’re summing over all
possible events on the (conditionally independent) U 1 , . . ., U nU given C = c)
QnO
P(C) · ( i=1 P(Oi |C))
P(C|O1 , . . ., OnO ) = P QnO
c∈dom(C) P (C = c) · ( i=1 P(O i |C = c))
That is: The denominator only serves to scale what is almost already the distribution
P(C|O1 , . . ., OnO ) to sum up to 1.
Normalization
Definition 21.2.21 (Normalization). Given a vector w := ⟨w1 , . . ., wk ⟩ of numbers
Pk
in [0,1] where i=1 wi ≤ 1.
Then the normalized vector α(w) is defined (component-wise) as
wi
(α(w))i := Pk .
j=1 wj
21.2. PROBABILISTIC REASONING TECHNIQUES 507
Pk
Note that i=1 α(w)i = 1, i.e. α(w) is a probability distribution.
Dentistry Example
Putting things together, we get:
Given a new article, we just count the occurrences ki of the words in it and compute
n
Y
P(category|word1 = k1 , . . ., wordn = kn ) = α(P(category)·( P(wordi = ki |category)))
i=1
Inference by Enumeration
The rules we established for naive Bayes models, i.e. Bayes’s theorem, the prod-
uct rule and chain rule, marginalization and normalization, are general techniques for
probabilistic reasoning, and their usefulness is not limited to the naive Bayes models.
More generally:
Theorem 21.2.23. Let Q, E 1 , . . ., E nE , U 1 , . . ., U nU be random variables and D :=
dom(U 1 ) × . . . × dom(U nU ). Then
X
P(Q|E 1 = e1 , . . ., E nE = ene ) = α( DP(Q, E 1 = e1 , . . ., E nE = ene , U 1 = u1 , . . ., U nU = unU ))
u
.
We call Q the query variable, E 1 , . . ., E nE the evidence, and U 1 , . . ., U nU the
unknown (or hidden) variables, and computing a conditional probability this way
enumeration.
Note that this is just a “mathy” way of saying we
1. sum over all relevant entries of the full joint probability distribution of the variables,
and
2. normalize the result to yield a probability distribution.
We will fortify our intuition about naive Bayes models with a variant of the Wumpus world we
looked at ?? to understand whether logic was up to the job of guiding an agent in the Wumpus
cave.
P i,j for i, j ∈ {1, 2, 3, 4}, stating there is a pit at square [i, j], and
B i,j for (i, j) ∈ {(1, 1), (1, 2), (2, 1)}, stating there is a breeze at square [i, j]
⇒ let’s apply our machinery!
Wumpus Continued
Problem: We only know P i,j for three fields. If we want to compute e.g. P 1,3 via
2
enumeration, that leaves 24 −4 = 4096 terms to sum over!
Let’s do better.
Let b := ¬B 1,1 ∧ B 1,2 ∧ B 2,1 (All the breezes we know
about)
Let p := ¬P 1,1 ∧ ¬P 1,2 ∧ ¬P 2,1 . (All the pits we know
about)
Let F := {P 3,1 ∧ P 2,2 , ¬P 3,1 ∧ P 2,2 , P 3,1 ∧ ¬P 2,2 , ¬P 3,1 ∧
P 2,2 } (the current “frontier”)
Let O be (the set of assignments for) all the other variables
P i,j . (i.e. except p, F and our query P 1,3 )
Optimized Wumpus
X X
P(P 1,3 |p, b)=α( P(P 1,3 , b, p, f , o))=α( P (b|P 1,3 , p, o, f ) · P(P 1,3 , p, f , o))
o∈O,f ∈F o∈O,f ∈F
XX X X
=α( P (b|P 1,3 , p, f ) · P(P 1,3 , p, f , o))=α( P (b|P 1,3 , p, f ) · ( P(P 1,3 , p, f , o)))
f ∈F o∈O f ∈F o∈O
X X
=α( P (b|P 1,3 , p, f ) · ( P(P 1,3 ) · P (p) · P (f ) · P (o)))
f ∈F o∈O
X X
=α(P(P 1,3 ) · P (p) · ( P (b|P 1,3 , p, f ) ·P (f ) · ( P (o))))
f ∈F
| {z } o∈O
∈{0,1} | {z }
=1
Cooking Recipe
In general, when you want to reason probabilistically, a good heuristic is:
1. Try to frame the full joint probability distribution in terms of the probabilities you
know. Exploit product rule/chain rule, independence, conditional independence,
marginalization and domain knowledge (as e.g. P(b|p, f ) ∈ {0, 1})
3. Substitute by the result of 1., and again, exploit all of our machinery
4. Implement the resulting (system of) equation(s)
5. ???
6. Profit
Summary
Probability distributions and conditional probability distributions allow us to repre-
sent random variables as convenient datastructures in an implementation
(Assuming they are finite domain...)
21.2. PROBABILISTIC REASONING TECHNIQUES 511
The full joint probability distribution allows us to compute all probabilities of state-
ments about the random variables contained (But possibly
inefficient)
Marginalization and normalization are the specific techniques for extracting the
specific probabilities we are interested in from the full joint probability distribution.
The product and chain rule, exploiting (conditional) independence, Bayes’ Theorem,
and of course domain specific knowledge allow us to do so much more efficiently.
Naive Bayes models are one example where all these techniques come together.
22.1 Introduction
John, Mary, and My Brand-New Alarm
Example 22.1.1 (From Russell/Norvig).
I got very valuable stuff at home. So I bought an alarm. Unfortunately, the alarm
just rings at home, doesn’t call me on my mobile.
I’ve got two neighbors, Mary and John, who’ll call me if they hear the alarm.
The problem is that, sometimes, the alarm is caused by an earthquake.
Also, John might confuse the alarm with his telephone, and Mary might miss the
alarm altogether because she typically listens to loud music.
⇒ Random variables: Burglary, Earthquake, Alarm, John, Mary.
Given that both John and Mary call me, what is the probability of a burglary?
⇒ This is almost a naive Bayes model, but with multiple causes (Burglary and
Earthquake) for the Alarm, which in turn may cause John and/or Mary.
513
514 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS
We assume:
We (should) know Burglary Earthquake
P(Alarm|Burglary, Earthquake),
P(John|Alarm), and P(Mary|Alarm).
Burglary and Earthquake are independent.
Alarm
John and Mary are conditionally independent
given Alarm.
Some Applications
A ubiquitous problem: Observe “symptoms”, need to infer “causes”.
Medical Diagnosis Face Recognition
Note: size(B) =
b The total number of entries in the conditional probability distri-
butions.
Note: Smaller BN ; need to assess less probabilities, more efficient inference.
516 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS
Qn
Observation 22.2.2. Explicit full joint probability distribution has size i=1 |Di |.
Observation 22.2.3. If |Parents(X i )| ≤ k for every X i , and Dmax is the largest
k+1
random variable domain, then size(B) ≤ n|Dmax | .
Example 22.2.4. For |Dmax | = 2, n = 20, k = 4 we have 220 = 1048576
probabilities, but a Bayesian network of size ≤ 20 · 25 = 640 . . . !
Q1
In the worst case, size(B) = n · ( ·=i n)|Di |, namely if every variable depends on
all its predecessors in the chosen variable ordering.
Intuition: BNs are compact – i.e. of small size – if each variable is directly
influenced only by few of its predecessor variables.
Thus: The size of the resulting BN depends on the chosen variable ordering
X 1 , . . ., X n .
In Particular: The size of a Bayesian network is not a fixed property of the domain.
It depends on the skill of the designer.
Note: For ?? we try to determine whether – given different value assignments to potential parents
– the probability of Xi being true differs? If yes, we include these parents. In the particular case:
1. M to J yes because the common cause may be the alarm.
Again: Given different value assignments to potential parents, does the probability of Xi being
true differ? If yes, include these parents.
1. M to J as before.
2. M, J to E as probability of E is higher if M/J is true.
3. Same for B; E to B because, given M and J are true, if E is true as well then prob of B is
lower than if E is false.
4. M /J/B/E to A because if M /J/B/E is true (even when changing the value of just one of
these) then probability of A is higher.
Example 22.2.9. The sum of two dice throws S is entirely determined by the values
of the two dice F irst and Second.
Example 22.2.10. In the Wumpus example, the breezes are entirely determined by
the pits
If we model Fever as a noisy disjunction node, then the general rule P (X i |Parents(X i )) =
520 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS
Q
{j | X j =T} q j for the CPT gives the following table:
Let’s do better!
Michael Kohlhase: Artificial Intelligence 2 776 2025-02-06
X
P(b|j, m) = α( P (j|a = ba ) · P (m|a = ba ) · P(a = ba |e = be , b) · P (e = be ) · P(b))
ba ,be ∈{T,F}
Let’s “optimize”:
X X
P(b|j, m) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}
Enumeration: Example
Variable order: b, e, a, j, m
P (a|b, e) · P (j|a) · P (m|a) · 1.0
P (e) · + P (¬a|b, e) · P (j|¬a) · P (m|¬a) · 1.0
P0 := P (b) · +
P (a|b, ¬e) · P (j|a) · P (m|a) · 1.0
P (¬e) · +
P (¬a|b, ¬e) · P (j|¬a) · P (m|¬a) · 1.0
P (a|¬b, e) · P (j|a) · P (m|a) · 1.0
P (e) · + P (¬a|¬b, e) · P (j|¬a) · P (m|¬a) · 1.0
P1 := P (¬b) · +
P (a|¬b, ¬e) · P (j|a) · P (m|a) · 1.0
P (¬e) · +
P (¬a|¬b, ¬e) · P (j|¬a) · P (m|¬a) · 1.0
P0 P1
⇐ ⟨P , ⟩
0 +P1 P0 +P1
22.3. INFERENCE IN BAYESIAN NETWORKS 523
X X
P(b|j = T, m = T) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}
Variable Elimination 2
X X
P(b|j, m) = α(P(b)·( P (e = be ) · ( P(a = ba |e = be , b) · P (j|a = ba ) · P (m|a = ba ))))
be ∈{T,F} ba ∈{T,F}
The last two factors P (j|a = ba ), P (m|a = ba ) only depend on a, but are “trapped”
behind the summation over e, hence computed twice in two distinct recursive calls to
EnumAll
Idea: Instead of left-to-right (top-down DFS), operate right-to-left (bottom-up) and
524 CHAPTER 22. PROBABILISTIC REASONING: BAYESIAN NETWORKS
⇒ can speed things up by a factor of 1000! (or more, depending on the order of
variables!)
So?: Life goes on . . . In the hard cases, if need be we can throw exactitude to
the winds and approximate.
Example 22.3.7. Sampling techniques as in MCTS.
22.4 Conclusion
A Video Nugget covering this section can be found at https://fau.tv/clip/id/29228.
Summary
Bayesian networks (BN) are a wide-spread tool to model uncertainty, and to reason
about it. A BN represents conditional independence relations between random vari-
ables. It consists of a graph encoding the variable dependencies, and of conditional
probability tables (CPTs).
Given a variable ordering, the BN is small if every variable depends on only a few
of its predecessors.
Probabilistic inference requires to compute the probability distribution of a set
of query variables, given a set of evidence variables whose values we know. The
remaining variables are hidden.
Inference by enumeration takes a BN as input, then applies Normalization+Marginalization,
the chain rule, and exploits conditional independence. This can be viewed as a tree
search that branches over all values of the hidden variables.
Reading:
• Chapter 14: Probabilistic Reasoning of [RN03].
– Section 14.1 roughly corresponds to my “What is a Bayesian Network?”.
– Section 14.2 roughly corresponds to my “What is the Meaning of a Bayesian Network?” and
“Constructing Bayesian Networks”.The main change I made here is to define the semantics
of the BN in terms of the conditional independence relations, which I find clearer than RN’s
definition that uses the reconstructed full joint probability distribution instead.
– Section 14.4 roughly corresponds to my “Inference in Bayesian Networks”. RN give full details
on variable elimination, which makes for nice ongoing reading.
– Section 14.3 discusses how CPTs are specified in practice.
– Section 14.5 covers approximate sampling-based inference.
– Section 14.6 briefly discusses relational and first-order BNs.
– Section 14.7 briefly discusses other approaches to reasoning about uncertainty.
23.1 Introduction
A Video Nugget covering this section can be found at https://fau.tv/clip/id/30338.
Overview
We now know how to update our world model, represented as (a set of) random
variables, given observations. Now we need to act.
For that we need to answer two questions:
Questions:
Given a world model and a set of actions, what will the likely consequences of each
action be?
How “good” are these consequences?
Idea:
Represent actions as “special random variables”:
Given disjoint actions a1 , . . ., an , introduce a random variable A with domain {a1 , . . ., an }.
Then we can model/query P(X|A = ai ).
Assign numerical values to the possible outcomes of actions (i.e. a function
u : dom(X) → R) indicating their desirability.
Definition 23.1.1. Decision theory investigates decision problems, i.e. how a model-
based agent a deals with choosing among actions based on the desirability of their
outcomes given by a real-valued utility function u on states s ∈ S: i.e. u : S → R.
Decision Theory
If our states are random variables, then we obtain a random variable for the utility
function:
Observation: Let X i : Ω → Di random variables on a probability model ⟨Ω, P ⟩ and
f : D1 × . . . × Dn → E. Then F (x) := f (X 0 (x), . . ., X n (x)) is a random variable
527
528 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY
Ω → E.
Definition 23.1.2. Given a probability
P model ⟨Ω, P ⟩ and a random variable X : Ω →
D with D ⊆ R, then E(X):= x∈D P (X = x) · x is called the expected value (or
expectation) of X. (Assuming the sum/series is actually defined!)
Analogously, let e1 , . . ., en a sequence of events.P Then the expected value of X
given e1 , . . ., en is defined as E(X|e1 , . . ., en ):= x∈D P (X = x|e1 , . . ., en ) · x.
Putting things together:
Definition 23.1.3. Let A : Ω → D a random variable (where D is a set of actions)
X i : Ω → Di random variables (the state), and u : D1 × . . . × Dn → R a utility function.
Then the expected utility of the action a ∈ D is the expected value of u (interpreted
as a random variable) given A = a ; i.e.
X
EU(a) := P (X 1 = x1 , . . ., X n = xn |A = a) · u(x1 , . . ., xn )
⟨x1 ,...,xn ⟩∈D 1 ×...×D n
Utility-based Agents
Definition 23.1.4. A utility-based agent uses a world model along with a utility
function that models its preferences among the states of that world. It chooses the
action that leads to the best expected utility.
54 Chapter 2. Intelligent Agents
Agent Schema:
Sensors
State
What the world
How the world evolves is like now
Environment
What action I
should do now
Agent Actuators
Figure 2.14 A model-based, utility-based agent. It uses a model of the world, along with
a utility function that measures its preferences among states of the world. Then it chooses the
Michael Kohlhase: Artificial Intelligence 2 789 2025-02-06
action that leads to the best expected utility, where expected utility is computed by averaging
over all possible outcome states, weighted by the probability of the outcome.
Decision networks
Definition 23.2.1. A decision network is a Bayesian net-
work with two additional kinds of nodes:
Note the sheer amount of summands in the sum above in the general case! (⇒
We will simplify where possible later)
Rational Preferences
Note: Preferences of a rational agent must obey certain constraints – An agent with
rational preferences can be described as an MEU-agent.
Definition 23.3.6. We call a set ≻ of preferences rational, iff the following constraints
hold:
Orderability A≻B ∨ B≻A ∨ A∼B
Transitivity A≻B ∧ B≻C ⇒ A≻C
Continuity A≻B≻C ⇒ (∃p.[p,A;1−p,C]∼B)
Substitutability A∼B ⇒ [p,A;1−p,C]∼[p,B;1−p,C]
Monotonicity A≻B ⇒ ((p > q) ⇔ [p,A;1−p,B]≻[q,A;1−q,B])
Decomposability [p,A;1−p,[q,B;1−q,C]]∼[p,A ; ((1 − p)q),B ; ((1 − p)(1 − q)),C]
p
A p A
(1 − p)q
q
B B
1−p
(1 − p)(1 − q)
C
1−q
C
23.4 Utilities
Ramseys Theorem and Value Functions
Theorem 23.4.1. (Ramsey, 1931; von Neumann and Morgenstern, 1944)
Given a rational set of preferences there exists a real valued
P function U such that
U (A) ≥ U (B), iff A⪰B and U ([p1 ,S1 ; . . . ; pn ,Sn ]) = i pi U (Si )
Observation: With deterministic prizes only (no lottery choices), only a total
ordering on prizes can be determined.
Definition 23.4.2. We call a total ordering on states a value function or ordinal
utility function. (If we don’t need to care about relative utilities of states, e.g. to
compute non-trivial expected utilities, that’s all we need anyway!)
23.4. UTILITIES 533
Utilities
Intuition: Utilities map states to real numbers.
Question: Which numbers exactly?
Definition 23.4.3 (Standard approach to assessment of human utilities).
Compare a given state A to a standard lottery Lp that has
“best possible prize” u⊤ with probability p
“worst possible catastrophe” u⊥ with probability 1 − p
adjust lottery probability p until A∼Lp . Then U (A) = p.
Comparing Utilities
Problem: What is the monetary value of a micromort?
Just ask people: What would you pay to avoid playing Russian roulette with a million-
barrelled revolver? (Usually: quite a lot!)
534 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY
People appear to be willing to pay about 10, 000€ more for a safer car that halves
the risk of death. (; 25€ per micromort)
This figure has been confirmed across many individuals and risk types.
Of course, this argument holds only for small risks. Most people won’t agree to kill
themselves for 25M€. (Also: People are pretty bad at estimating and comparing
risks, especially if they are small.) (Various cognitive biases and heuristics are at work
here!)
Given a lottery L with expected monetary value EMV(L), usually U (L) < U (EMV(L)),
i.e., people are risk averse.
Utility curve: For what probability p am I indifferent between a prize x and a
lottery [p,M $;1−p,0$] for large numbers M ?
Typical empirical data, extrapolated with risk prone behavior for debitors:
Strict Dominance
First Assumption: U is often monotone in each argument. (wlog. growing)
Definition 23.5.3. (Informally) An action B strictly dominates an action A, iff every
possible outcome of B is at least as good as every possible outcome of A,
536 CHAPTER 23. MAKING SIMPLE DECISIONS RATIONALLY
Observation: Strict dominance seldom holds in practice (life is difficult) but is useful
for narrowing down the field of contenders.
Stochastic Dominance
Definition 23.5.4. Let X1 , X2 distributions with domains ⊆ R.
X1 stochastically dominates X2 iff for all t ∈ R, we have P (X1 ≥ t) ≥ P (X2 ≥ t),
and for some t, we have P (X1 ≥ t) > P (X2 ≥ t).
Observation 23.5.5. If U is monotone in X1 , and P(X1 |a) stochastically dominates
P(X1 |b) for actions a, b, then a is always the better choice than b, with all other
attributes Xi being equal.
⇒ If some action P(Xi |a) stochastically dominates P(Xi |b) for all attributes Xi ,
we can ignore b.
Observation: Stochastic dominance can often be determined without exact distribu-
tions using qualitative reasoning.
Example 23.5.6 (Construction cost increases with distance). If airport location
S 1 is closer to the city than S 2 ; S 1 stochastically dominates S 2 on cost.q
We have seen how we can do inference with attribute-based utility functions, let us consider the
computational implications. We observe that we have just replaced one evil – exponentially many
states (in terms of the attributes) – by another – exponentially many parameters of the utility
functions.
Wo we do what we always do in AI-2: we look for structure in the domain, do more theory to
be able to turn such structures into computationally improved representations.
P
that there is an additive value function: V (S) = i Vi (Xi (S)), where Vi is a value
function referencing just one variable Xi .
Hence assess n single-attribute functions. (often a good approximation)
Example 23.5.11. The value function for the airport decision might be
X k
Y
U= Ui (X i = xi )
({X 0 ,...,X k }⊆X ) i=1
So far we have tacitly been concentrating on actions that directly affect the environment. We
will now come to a type of action we have hypothesized in the beginning of the course, but have
completely ignored up to now: information gathering actions.
Definition 23.6.2. Information value theory is concerned with agent making decisions
on information gathering rationally.
Solution: Compute the expected value of the best action given the information, minus
the expected value of the best action without information.
Example 23.6.4 (Oil Drilling Rights contd.).
Survey may say oil in block 3 with probability 1
n ; we buy block 3 for k
n€ and
make a profit of (k − nk )€.
k
So, we should pay up to n€ for the information. (as much as block 3 is worth!)
23.6. THE VALUE OF INFORMATION 539
Intuition: The VPI is the expected gain from knowing the value of F relative to
the current expected utility, and considering the relative probabilities of the possible
outcomes of F .
Michael Kohlhase: Artificial Intelligence 2 811 2025-02-06
Properties of VPI
Observation 23.6.6 (VPI is Non-negative).
VPIE (F ) ≥ 0 for all j and E (in expectation, not post hoc)
Observation 23.6.7 (VPI is Non-additive).
VPIE (F, G) ̸= VPIE (F ) + VPIE (G) (consider, e.g., obtaining F twice)
Observation 23.6.8 (VPI is Order-independent).
We will now use information value theory to specialize our utility-based agent from above.
Summary
An MEU agent maximizes expected utility.
Decision theory provides a framework for rational decision making.
Decision networks augment Bayesian networks with action nodes and a utility node.
rational preferences allow us to obtain a utility function (orderability, transitivity,
continuity, substitutability, monotonicity, decomposability)
multi-attribute utility functions can usually be “destructured” to allow for better
inference and representation (can be monotone, attributes may dominate others,
actions may dominate others, may be multiplicative,...)
information value theory tells us when to explore rather than exploit, using
VPI (value of perfect information) to determine how much to “pay” for information.
Stochastic Processes
The world changes in stochastically predictable ways.
Example 24.1.1.
The weather changes, but the weather tomorrow is somewhat predictable given
today’s weather and other factors, (which in turn (somewhat) depends on
yesterday’s weather, which in turn...)
the stock market changes, but the stock price tomorrow is probably related to
today’s price,
A patient’s blood sugar changes, but their blood sugar is related to their blood
sugar 10 minutes ago (in particular if they didn’t eat anything in between)
541
542 CHAPTER 24. TEMPORAL PROBABILITY MODELS
Markov Processes
Idea: Construct a Bayesian network from these variables (parents?)
...without everything exploding in size...?
Definition 24.1.6. Let (X t )t∈S a stochastic process. X has the (nth order) Markov
property iff X t only depends on a bounded subset of X0:t−1 – i.e. for all t ∈ S we have
P(X t |X 0 , . . .X t−1 ) = P(X t |X t−n , . . .X t−1 ) for some n ∈ S.
A stochastic process with the Markov property for some n is called a (nth order)
Markov process.
Important special cases:
Definition 24.1.7.
Problem: This network does not actually have the First-order Markov property...
Possible fixes: We have two ways to fix this:
24.1. MODELING TIME AND UNCERTAINTY 543
1. Increase the order of the Markov process. (more dependencies ⇒ more complex
inference)
2. Add more state variables, e.g., Tempt , Pressuret . (more information sources)
Xt−1 Xt Xt+1
Zt−1 Zt Zt+1
Example 24.1.10 (Battery Powered Robot). If the robot has a battery, the Markov
property is violated!
Battery exhaustion has a systematic effect on the change in velocity.
This depends on how much power was used by all previous manoeuvres.
Bt−1 Bt Bt+1
Vt−1 Vt Vt+1
Xt−1 Xt Xt+1
Zt−1 Zt Zt+1
Definition 24.1.15. We say that a sensor model has the sensor Markov property, iff
P(E t |X0:t , E1:t−1 ) = P(E t |X t ) – i.e., the sensor model depends only on the current
state.
Assumptions on Sensor Models: We usually assume the sensor Markov property and
make it stationary as well: P(E t |X t ) is fixed for all t.
Definition 24.1.16 (Note).
If a Markov chain X is stationary and discrete, we can represent the transition
model as a matrix Tij := P (X t = j|X t−1 = i).
If a sensor model has the sensor Markov property, we can represent each observation
E t = et at time t as the diagonal matrix Ot with Otii := P (E t = et |X t = i).
A pair ⟨X, E⟩ where X is a (stationary) Markov chains, E i only depends on X i ,
and E has the sensor Markov property is called a (stationary) Hidden Markov Model
(HMM). (X and E are single variables)
24.2. INFERENCE: FILTERING, PREDICTION, AND SMOOTHING 545
Inference tasks
Definition 24.2.1. Given a Markov process with state variables X t and evidence
variables E t , we are interested in the following Markov inference tasks:
Filtering (or monitoring) P(X t |E =e
1:t ): Given the sequence of observations up until
time t, compute the likely state of the world at current time t.
Prediction (or state estimation) P(X t+k |E =e
1:t ) for k > 0: Given the sequence of
observations up until time t, compute the likely future state of the world at time
t + k.
Smoothing (or hindsight) P(X t−k |E =e
1:t ) for 0 < k < t: Given the sequence of
observations up until time t, compute the likely past state of the world at time
t − k.
Most likely explanation argmax (P (X =x =e
1:t |E 1:t )): Given the sequence of observa-
x1:t
tions up until time t, compute the most likely sequence of states that led to these
observations.
Note: The most likely sequence of states is not (necessarily) the sequence of most
likely states ;-)
In this section, we assume X and E to represent multiple variables, where X jointly
forms a Markov chain and the E jointly have the sensor Markov property.
In the case where X and E are stationary single variables, we have a stationary
hidden Markov model and can use the matrix forms.
Michael Kohlhase: Artificial Intelligence 2 825 2025-02-06
546 CHAPTER 24. TEMPORAL PROBABILITY MODELS
Using the full joint probability distribution, we can compute any conditional prob-
ability we want, but not necessarily efficiently.
We want to use filtering to update our ‘‘world model” P(X t ) based on a new
observation E t = et and our previous world model P(X t−1 ).
Spoiler:
T
F (et , P(X t−1 |E =e =e
1:t−1 )) = α(Ot · T · P(X t−1 |E 1:t−1 ))
Filtering Derivation
P(X t |E =e =e
1:t ) = P(X t |E t = et , E 1:t−1 ) (dividing up evidence)
= α(P(E t = et |X t , E 1:t−1 ) · P(X t |E =e
=e
1:t−1 )) (using Bayes’ rule)
=e
= α(P(E t = et |X t ) · P(XX t |E 1:t−1 )) (sensor Markov property)
=e =e
= α(P(E t = et |X t ) · ( P(X t |X t−1 = x, E 1:t−1 ) · P (X t−1 = x|E 1:t−1 ))) (marginalization)
x∈dom(X)
X
= α(P(E t = et |X t ) ·( P(X t |X t−1 = x) · P (X t−1 = x|E =e
1:t−1 ))) (conditional independence)
| {z } | {z } | {z }
x∈dom(X)
sensor model transition model recursive call
Definition 24.2.2. We call the inner part of the above expression the forward algorithm,
i.e. P(X t |E =e =e
1:t ) = α(FORWARD(et , P(X t−1 |E 1:t−1 ))) =: f 1:t .
Observation 24.2.6. As k → ∞, P(X t+k |E =e 1:t ) converges towards a fixed point called
the stationary distribution of the Markov chain. (which we can compute from the
equation S = TT · S)
⇒ the impact of the evidence vanishes.
⇒ The stationary distribution only depends on the transition model.
⇒ There is a small window of time (depending on the transition model) where
the evidence has enough impact to allow for prediction beyond the mere stationary
548 CHAPTER 24. TEMPORAL PROBABILITY MODELS
Smoothing
Smoothing: P(X t−k |E =e1:t ) for k > 0.
Intuition: Use filtering to compute P(X t |E =e
1:t−k ), then recurse backwards from t until
t − k.
P(X t−k |E =e
1:t ) = P(X t−k |E =e =e
t−(k−1):t , E 1:t−k ) (Divide the evidence)
=e =e =e
= α(P(E t−(k−1):t |X t−k , E 1:t−k ) · P(X t−k |E 1:t−k )) (Bayes Rule)
= α(P(E =e =e
t−(k−1):t |X t−k ) · P(X t−k |E 1:t−k )) (cond. independence)
| {z } | {z }
=:bt−(k−1):t =f 1:t−k
= α(f 1:t−k × bt−(k−1):t )
Smoothing (continued)
Definition 24.2.7 (Backward message). bt−k:t = P(E =e
t−k:t |X t−(k+1) )
X
= P(E =e
t−k:t |X t−k = x, X t−(k+1) ) · P(X t−k = x|X t−(k+1) )
x∈dom(X)
X
= P (E =e
t−k:t |X t−k = x) · P(X t−k = x|X t−(k+1) )
x∈dom(X)
X
= P (E t−k = et−k , E =e
t−(k−1):t |X t−k = x) · P(X t−k = x|X t−(k+1) )
x∈dom(X)
X
= P (E t−k = et−k |X t−k = x) · P (E =e
t−(k−1):t |X t−k = x) · P(X t−k = x|X t−(k+1) )
| {z } | {z } | {z }
x∈dom(X)
sensor model =bt−(k−1):t transition model
Note: in a stationary hidden Markov model, we get the matrix formulation bt−k:t =
T · Ot−k · bt−(k−1):t
Definition 24.2.8. We call the associated algorithm the backward algorithm, i.e.
P(X t−k |E =e
1:t ) = α(FORWARD(et−k , f 1:t−(k+1) ) × BACKWARD(et−(k−1) , bt−(k−2):t )).
| {z } | {z }
f 1:t−k bt−(k−1):t
As a starting point for the recursion, we let bt+1:t the uniform vector with 1 in every
component.
Smoothing example
Example 24.2.9 (Smoothing Umbrellas). Reminder: We assumed P(R0 ) = ⟨0.5, 0.5⟩,
P (Rt+1 |Rt ) = 0.6, P (¬Rt+1 |¬Rt ) = 0.8, P (Ut |Rt ) = 0.9, P (¬Ut |¬Rt ) = 0.85
24.2. INFERENCE: FILTERING, PREDICTION, AND SMOOTHING 549
0.6 0.4 0.9 0 0.1 0
⇒T= , O1 = O2 = and O3 = .
0.2 0.8 0 0.15 0 0.85
(The director carries an umbrella on days 1 and 2, and not on day 3)
f 1:1 = ⟨0.8, 0.2⟩, f 1:2 = ⟨0.87, 0.13⟩ and f 1:3 = ⟨0.12, 0.88⟩
Let’s compute
In the second for-loop, we compute both f 1:i and bt−i:t (Only one copy of f 1:i ,
bt−i:t is stored)
⇒ constant space.
But: Requires that both matrices are invertible, i.e. every observation must be
possible in every state. (Possible hack: increase the probabilities of 0 to “negligibly
small”)
max P(X =x =e
1:t−1 , X t |E 1:t )
x1 ,...,xt−1
m1:t (i) gives the maximal probability that the most likely path up to t leads to state
X t = i.
Note that we can leave out the α, since we’re only interested in the maximum.
Example 24.2.12. For the sequence [T, T, F, T, T]:
Section 15.2. Inference in Temporal Models 577
Figure 15.5 (a) Possible state sequences for Rain t can be viewed as paths through a graph
bold arrows:of best predecessor
the possible measured
states at each time step. (Statesby “bestas preceding
are shown sequence
rectangles to avoid confusion probability ×
with nodes in a Bayes net.) (b) Operation of the Viterbi algorithm for the umbrella obser-
transition probability”
vation sequence [true, true, false, true, true]. For each t, we have shown the values of the
message m1:t , which gives the probability of the best sequence reaching each state at time t.
Also, Kohlhase:
Michael for each state, theIntelligence
Artificial bold arrow
leading into it indicates837
2 its best predecessor as measured
2025-02-06
by the product of the preceding sequence probability and the transition probability. Following
the bold arrows back from the most likely state in m1:5 gives the most likely sequence.
The Viterbi
butionsAlgorithm
over single time steps, whereas to find the most likely sequence we must consider
joint probabilities over all the time steps. The results can in fact be quite different. (See
Definition 24.2.13.
Exercise 15.4.)The Viterbi algorithm now proceeds as follows:
There is a linear-time algorithm for finding the most likely sequence, but it requires a
functionlittle more thought.
Viterbi(⟨e 1 , . It
. .,relies on the0same
et ⟩,P(X )) Markov property that yielded efficient algorithms for
m :=filtering
P(X 0 )and smoothing. The easiest way to think about the problem is to view each sequence /* m1:i */
prev as
:=a ⟨⟩path through a graph whose nodes/*arethe the most
possible at each time of
statespredecessor
likely step.
eachSuch a
possible xi */
for i graph
= 1, .is. .shown
, t dofor the umbrella world in Figure 15.5(a). Now consider the task of finding
the most likely path through this graph, where the likelihood of any path is the product of
m′ := max (P(E i = ei |X i ) · P(X i |X i−1 = xi−1 ) · mxi−1 )
the transition
xi−1 probabilities along the path and the probabilities of the given observations at
each state.:=Let’s
prev focus in
argmax particular
(P(E on paths that reach the state Rain = true. Because of
i = ei |X i ) · P(X i |X i−1 = xi−15) · mxi−1 )
i−1
the Markov property,
xi−1 it follows that the most likely path to the state Rain 5 = true consists of
mthe←−
mostmlikely
′ path to some state at time 4 followed by a transition to Rain 5 = true; and the
P :=state at time
⟨0, 0, ..., 4 argmax
that will become
mxpart
⟩ of the path to Rain 5 = true is whichever maximizes the
likelihood of that path. In other words, there is a recursive relationship between most likely
(x∈dom(X))
for i paths
= t −to 1,
each
. . .state xt+1 and most likely paths to each state xt . We can write this relationship
, 0 do
Pas an equation connecting the probabilities of the paths:
i := previ,Pi+1
return P max P(x1 , . . . , xt , Xt+1 | e1:t+1 )
x1 ...xt
! "
Observation 24.2.14. Viterbi
= α P(et+1 has
| Xt+1 ) max
x
linear
P(X t+1 | xtime
t ) maxcomplexity
x ...x
P (x1 , . . . , xt−1and linear
, xt | e1:t space complexity
) . (15.11)
t 1 t−1
(needs to keep the most likely sequence leading to each state).
Equation (15.11) is identical to the filtering equation (15.5) except that
Remark 24.3.2. This only works for perfect sensors. (else no impossible states)
What if our sensors are imperfect?
The Transition matrix for the move action (T has 422 = 1764 entries)
1
|N (i)| if j ∈ N (i)
P (X t+1 = j|X t = i) = Tij =
0 else
1
We do not know where the robot starts: P (X 0 ) = n (here n = 42)
Evidence variable E t : four bit presence/absence of obstacles in N, S, W, E. Let
dit be the number of wrong bits and ϵ the error rate of the sensor. Then
4−dit
P (E t = et |X t = i) = Otii = (1 − ϵ) · ϵdit
b) Posterior
(b) distribution
(b) Posterior
Posterior overover
distribution
distribution robot
over location
robot
robot locationafter
location after EE111=
after E ==N
NN SW
SW,
SW, E =and
E 22 = N
NSSE2 = N S
Still Figure
the same locations distribution
as in the “perfect sensing” one case, but now other locations
Figure 15.7
15.7 Posterior
Posterior distribution over
over robot
robot location:
location: (a)
(a) one observation
observation EE11 =
=N SW ;;
N SW
have non-zero
(b) after probability.
a second observation E = N S. The size of each disk corresponds to the probability
(b) after a second observation E = N S. The size of each disk corresponds to the probability
2
2
that
that the
the robot
robot is
is at
at that
that location.
location. The
The sensor
sensor error
error rate
rate is
is !! =
= 0.2.
0.2.
Michael Kohlhase: Artificial Intelligence 2 841 2025-02-06
NNS, S, for
for example,
example, toto mean
mean that
that the
the north
north and
and south
south sensors
sensors report
report an
an obstacle
obstacle and
and the
the east and
east and
west do not. Suppose that each sensor’s error rate is ! and that errors occur
west do not. Suppose that each sensor’s error rate is ! and that errors occur independently forindependently for
HMM Example: Further Inference Applications
the
the four
four sensor
sensor directions.
directions. In
In that
that case,
case, the
the probability
probability of
of getting
getting all
all four
four bits
bits right
right is
is (1
(1 !)44
− !)
−
and the 4 . Furthermore, if d is the discrepancy—the
Idea:andWe the probability
probability
can
of
of getting
getting them
use smoothing: them all
all wrong
bk+1:t wrong
= TO
is
is !!k+1
4 . Furthermore, if dit is the discrepancy—the
bk+2:t to findit out where it started and
number of bits
number algorithm that are
of bits that are different—between
different—between the true
the true values
values for
for square
square ii and
and the
the actual
actual reading
reading
the Viterbi
,,Hidden
then to find the most likely path it took.
Section15.3.
Section then the
15.3. eettHidden the probability
Markov
Markov that
that aa robot
Models
Models
probability robot in
in square
square ii would
would receive
Example 24.3.5.Performance of HMM localization vs. observation length receive aa sensor
sensor reading
reading eett is
is 583
583
(various
4−dit dit
error ratesPPϵ)
(E = eett || X
(Ett = Xtt =
= i) =O
i) = Ottiiii = (1 −
= (1 − !)
!)4−dit !!dit ..
For
For example,
example, the
the probability
probability that that3aa square
square with with obstacles
obstacles to the north and south would produce
66 1 11 to the north and south would produce
aa sensor
5.5 reading
sensor
5.5 reading NSENSE is is (1(1 − !) 3!!1..
−====0.10
0.20
0.20
!) 0.9
0.9
0.10
matrices T O
5
5Given
Given thethe matrices T and
and
==0.05 Ott,, the
0.05 the robot
robot cancan use Equation
Equation (15.12)
0.8
use0.8 (15.12) to to compute
compute the pos-
the pos-
Localization error
4.5
Localization error
4.5 = 0.02
terior distribution over locations—that
= 0.02 is, to work out where it is. Figure 15.7 shows the
accuracy
0.7
Path accuracy
terior 4distribution
4 over locations—that 0.7
is, to work out where it is. Figure 15.7 shows the
==0.00
0.00
distributions
3.5
3.5
distributions P(X 1 | E 1 = N SW ) and P(X 2 | E 1 = N 0.6 E2 = N S). This is the same maze
SW,
0.6
P(X1 | E1 = N SW ) and P(X2 | E1 = N SW, E2 = N S). This is the same maze == 0.00
0.00
we 3 3 before in Figure 4.18 (page 146), but there we 0.5used logical filtering to==find 0.02the loca-
we saw
0.5 0.02
saw
2.5 before in Figure 4.18 (page 146), but there we 0.4used logical filtering to==find 0.05the loca-
Path
24.4 Dynamic
takentotoget
taken whereBayesian
getwhere ititisisnow.
now.Figure Networks
Figure15.8
15.8 showsthe
shows thelocalization
localizationerror
errorand
andViterbi
Viterbipath
pathaccuracy
accuracy
forvarious
for variousvalues
valuesofofthe
theper-bit
per-bitsensor
sensorerror
errorrate
rate !.!. Even
Evenwhen
when!!isis20%—which
20%—which means
means that
that
theoverall
A Video the
Nugget covering
overallsensor thisisissection
sensorreading
reading wrong canofof
wrong59%
59% bethefound
the at https://fau.tv/clip/id/30355.
time—the
time—the robotisisusually
robot usuallyable
ableto
towork
workout
outits
its
locationwithin
location withintwo
twosquares
squaresafter
after25
25observations.
observations. This
Thisisisbecause
because of
ofthe
thealgorithm’s
algorithm’s ability
ability
Dynamic
integrateBayesian
totointegrate evidenceover
evidence networks
overtime
timeand
andtototake
takeinto
intoaccount
accountthe
theprobabilistic
probabilisticconstraints
constraints imposed
imposed
onthe
on thelocation
locationsequence
sequence bybythethetransition
transition model.
model. When When !! isis 10%,
10%, thethe performance
performance afterafter
aahalf-dozen
half-dozenobservations
observationsisishard
hardtotodistinguish
distinguishfrom
fromthe
theperformance
performance with with perfect
perfect sensing.
sensing.
Definition
Exercise 24.4.1.
15.7asks
asksyou A
youto Bayesian
toexplore
explorehowhownetwork D isHMM
robustthe
the calledlocalization
dynamic (a DBN),isifftoits
algorithm random
errors in
Exercise 15.7 robust HMM localization algorithm is to errors in
variables
the prior are indexed
distribution P(X by )a time
and in structure.
the transition We assume
model itself. that D
Broadly is
speaking, high levels
the prior distribution P(X00) and in the transition model itself. Broadly speaking, high levels
ofoflocalization
localizationandandpath
pathaccuracy
accuracyare aremaintained
maintainedevenevenininthe
theface
faceofofsubstantial
substantial errors
errors in
in the
the
models used.
models used.
Thestate
The statevariable
variable for
for the
the example
example we we have
have considered
considered inin this
this section
section isis aa physical
physical
locationininthe
location theworld.
world. Other
Otherproblems
problems can,
can, of
of course,
course, include
include other
other aspects
aspects ofof the
the world.
world.
Exercise 15.8 asks you to consider a version of the vacuum robot that has the policy of going
554 CHAPTER 24. TEMPORAL PROBABILITY MODELS
time sliced, i.e. that the time slices Dt – the subgraphs of t-indexed random
variables and the edges between them – are isomorphic.
a stationary Markov chain, i.e. that variables Xt can only have parents in Dt
and Dt−1 .
P (R0 )
0.7 Rain0 Rain1
R0 P (R1 )
T 0.7
F 0.3 R1 P (U1 )
T 0.9
F 0.2
Umbrella1
Summary
Temporal probability models use state and evidence variables replicated over time.
Markov property and stationarity assumption, so we need both
a transition model and P(Xt |Xt−1 )
a sensor model P(Et |Xt ).
Tasks are filtering, prediction, smoothing, most likely sequence; (all done
recursively with constant cost per time step)
Hidden Markov models have a single discrete state variable; (used for speech
recognition)
DBNs subsume HMMs, exact update intractable.
We will now pick up the thread from ?? but using temporal models instead of simply probabilistic
ones. We will first look at a sequential decision theory in the special case, where the environment is
stochastic, but fully observable (Markov decision processes) and then lift that to obtain POMDPs
and present an agent design based on that.
Outline
We will now combine the ideas of stochastic process with that of acting based on
maximizing expected utility:
557
558 CHAPTER 25. MAKING COMPLEX DECISIONS
Search
explicit actions uncertainty
and subgoals and utility
We will fortify our intuition by an example. It is specifically chosen to be very simple, but
to exhibit all the peculiarities of Markov decision problems, which we will generalize from this
example.
Perhaps what is more interesting than the components of an MDP is that is not a component: a
belief and/or sensor model. Recall that MDPs are for fully observable environments.
Idea: We use the rewards as a utility function: The goal is to choose actions such
that the expected cumulative rewards for the “foreseeable future” is maximized
⇒ need to take future actions and future states into account
Solving MDPs
In MDPs, the aim is to find an optimal policy π(s), which tells us the best action
for every possible state s. (because we can’t predict where we might end up, we
need to consider all states)
Note: When you run against a wall, you stay in your square.
+1 +1
+1 +1 +1 +1
1
–1 –1 –1 –1
3 +1
1 2 3 4
R(s) < –1.6284 – 0.4278 < R(s) < – 0.0850 – 0.0221 < R(s) < 0 R(s) > 0
2 –1
(a) (b)
Question: Explain what you see in a qualitative manner!
+1 +1
1
–1 –1
Answer: reserved for the plenary sessions ; be there!
1 2 3 4
Utility of States
Remember: Given a sequence of states S = s0 , s1 , s2 , . . ., and a discount factor
0 ≤ γ < 1, the utility of the sequence is
∞
X
u(S) = γ t R(st )
t=0
Definition 25.2.3. Given a policy π and a starting state s0 , let Ssπ0 be the random
variable giving the sequence of states resulting from executing π at every state starting
at s0 . (Since the environment is stochastic, we don’t know the exact sequence.)
Then the expected utility obtained by executing π starting in s0 is given by
U π (s0 ):=EU(Ssπ0 ).
⇒ given the “true” utilities, we can compute the optimal policy and vice versa.
Question: Why do we go left in (3, 1) and not up? (follow the utility)
expected sum of rewards = current reward + γ · exp. reward sum after best action
Definition 25.3.3. The value iteration algorithm for utilitysutility function is given
by
function VALUE−ITERATION (mdp,ϵ) returns a utility fn.
inputs: mdp, an MDP with states S, actions A(s), transition model P (s′ |s, a),
rewards R(s), and discount γ
ϵ, the maximum error allowed in the utility of any state
local variables: U , U ′ , vectors of utilities for states in S, initially zero
δ, the maximum change in the utility of any state in an iteration
repeat
U := U ′ ; δ := 0
for each state s in S do
U ′ [s] := R(s) + γ · max ( s′ U [s′ ] · P (s′ |s, a))
P
a∈A(s)
if |U ′ [s] − U [s]| > δ then δ := |U ′ [s] − U [s]|
until δ < ϵ(1 − γ)/γ
return U
P
Remark: Retrieve the optimal policy with π[s]:=argmax ( s′ U [s′ ] · P (s′ |s, a))
a∈A(s)
1e+07
1e+07
1 (4,3)
(4,3) cc==0.0001
0.0001
(3,3)
(3,3) 1e+06
1e+06 cc==0.001
0.001
0.8 cc==0.01
0.01
(1,1)
(1,1)
Iterations required
cc==0.1
0.1
required
100000
100000
Utility estimates
0.6 (3,1)
(3,1)
10000
10000
0.4 (4,1)
(4,1)
Iterations
1000
1000
0.2
100
100
0
10
10
-0.2
11
0 5 10 15 20
20 25
25 30
30 0.5
0.50.55
0.550.6
0.60.65
0.650.7
0.70.75
0.750.8
0.80.85
0.850.9
0.90.95
0.95 11
Number of iterations
iterations Discount
Discountfactor
factor
(a) (where ε = c · Rmax ) (b)
(b)
Convergence
where the update is assumed to be
be applied
applied simultaneously
simultaneously to
to all
all the
the states
states atateach
eachiteration.
iteration.
If we apply the Bellman update infinitely
infinitely often,
often, we
we are
are guaranteed
guaranteed to to reach
reach an
anequilibrium
equilibrium
(see
(see Section
Section 17.2.3),
17.2.3), in
in which
which case
case the
the final
final utility
utility values
values
Definition 25.3.5. The maximum norm is defined as ∥U ∥ = max |U (s)|, must
must be
be solutions
solutions to
to the
the Bellman
Bellman
so
equations. In
equations. In fact,
fact, they
they are
are also
also the unique solutions,
the unique solutions, and
and the
the corresponding
correspondingspolicy
policy(obtained
(obtained
∥U
using V ∥ = maximumisdifference
− Equation between U andcalled V . ALUE -I TERATION , is shown in
using Equation (17.4))
(17.4)) is optimal.
optimal. The The algorithm,
algorithm, called V VALUE -I TERATION , is shown in
Figure
Figure 17.4.
17.4.
Let U and U
t t+1
be successive approximations to the true utility U during value
We can
We
iteration.can apply
apply value
value iteration
iteration to
to the ×33 world
the 44× world inin Figure
Figure 17.1(a).
17.1(a). Starting
Startingwith
withinitial
initial
values of
values of zero,
zero, the
the utilities
utilities evolve
evolve asas shown
shown inin Figure
Figure 17.5(a).
17.5(a). Notice
Noticehowhowthethestates
statesatatdiffer-
differ-
Theorem 25.3.6. For any two approximations U t and V t
U t+1 − V t+1 ≤ γ U t − V t
I.e., any distinct approximations get closer to each other over time
In particular, any approximation gets closer to the true U over time
⇒ value iteration converges to a unique, stable, optimal solution.
Theorem 25.3.7. If U t+1 − U t < ϵ, then U t+1 − U < 2ϵγ/1 − γ
(once the change in U t becomes small, we are almost done.)
Remark: The policy resulting from U t may be optimal long before the utilities
convergence!
So we see that iteration with Bellman updates will always converge towards the utility of a state,
even without knowing the optimal policy. That gives us a first way of dealing with sequential
decision problems: we compute utility functions based on states and then use the standard MEU
machinery. We have seen above that optimal policies and state utilities are essentially inter-
changeable: we can compute one from the other. This leads to another approach to computing
state utilities: policy iteration, which we will discuss now.
Policy Iteration
Recap: Value iteration computes utilities ; optimal policy by MEU.
25.3. VALUE/POLICY ITERATION 565
This even works if the utility estimate is inaccurate. (⇝ policy loss small)
Idea: Search for optimal policy and utility values simultaneously [How60]: Iterate
policy evaluation: given policy πi , calculate Ui = U πi , the utility of each state
were πi to be executed.
policy improvement: calculate a new MEU policy πi+1 using 1 lookahead
Terminate if policy improvement yields no change in computed utilities.
Observation 25.3.8. Upon termination Ui is a fixpoint of Bellman update
; Solution to Bellman equation ; πi is an optimal policy.
Observation 25.3.9. Policy improvement improves policy and policy space is finite
; termination.
Policy Evaluation
Problem: How to implement the POLICY−EVALUATION algorithm?
Solution: To compute utilities given a fixed π: For all s we have
X
U (s) = R(s) + γ( U (s′ ) · P (s′ |s, π(s)))
s′
(i.e. Bellman equation with the maximum replaced by the current policy π)
Example 25.3.11 (Simplified Bellman Equations for π).
566 CHAPTER 25. MAKING COMPLEX DECISIONS
Partial Observability
Definition 25.4.1. A partially observable MDP (a POMDP for short) is a MDP
together with an observation model O that has the sensor Markov property and is
stationary: O(s, e) = P (e|s).
Example 25.4.2 (Noisy 4x3 World).
Problem: Agent does not know which state it is in ; makes no sense to talk
about policy π(s)!
Theorem 25.4.3 (Astrom 1965). The optimal policy in a POMDP is a function
π(b) where b is the belief state (probability distribution over states).
Idea: Convert a POMDP into an MDP in belief state space, where T (b, a, b′ ) is
the probability that the new belief state is b′ given that the current belief state is b
and the agent does a. I.e., essentially a filtering update step.
For POMDPs, we also need to consider actions. (but the effect is the same)
If b is the previous belief state and agent does action A = a and then perceives
E = e, then the new belief state is
X
b′ = α(P(E = e|s′ ) · ( P(s′ |S = s, A = a) · b(s)))
s
Consequence: The optimal policy can be written as a function π ∗ (b) from belief
states to actions.
Definition 25.4.4. The POMDP decision cycle is to iterate over
1. Given the current belief state b, execute the action a = π ∗ (b)
2. Receive percept e.
3. Set the current belief state to FORWARD(b, a, e) and repeat.
Intuition: POMDP decision cycle is search in belief state space.
Example 25.4.6. The belief state of the 4x3 world is a 11 dimensional continuous
space. (11 states)
Theorem 25.4.7. Solving POMDPs is very hard! (actually, PSPACE hard)
Write the probability of reaching b′ from b, given action a, as P (b′ |b, a), then
X
P (b′ |b, a) = P (b′ |a, b) = P (b′ |e, a, b) · P (e|a, b)
e
X X X
′
= P (b |e, a, b) · ( P (e|s′ ) · ( P (s′ |s, a), b(s)))
e s′ s
Observation: This equation defines a transition model for belief state space!
Idea: We can also define a reward function for belief states:
X
ρ(b):= b(s) · R(s)
s
i.e., the expected reward for the actual states the agent might be in.
Together, P (b′ |b, a) and ρ(b) define an (observable) MDP on the space of belief
states.
25.4. PARTIALLY OBSERVABLE MDPS 569
Theorem 25.4.8. An optimal policy π ∗ (b) for this MDP, is also an optimal policy
for the original POMDP.
Upshot: Solving a POMDP on a physical state space can be reduced to solving
an MDP on the corresponding belief state space.
π ∗ will choose to execute the conditional plan with highest expected utility
570 CHAPTER 25. MAKING COMPLEX DECISIONS
Observation 3 (conbined): The utility function U (b) on belief states, being the
maximum of a collection of hyperplanes, is piecewise linear and convex.
Consider the one-step plans [Stay] and [Go] and their direct utilities:
3 3
2.5 2.5
2 2
Utility
Utility
0.5 0.5
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Probability of state 1 Probability of state 1
(a) (b)
3 7.5
2.5 7
2 6.5
25.4. PARTIALLY OBSERVABLE MDPS 571
The maximum represents the utility function for the finite-horizon problem that
allows just one action
in each “piece” the optimal action is the first action of the corresponding plan.
Here the optimal one-step policy is to “Stay” when b(1) > 0.5 and “Go” other-
wise.
3 3
Utility
Utility
Utility
1.5 [Stay] 1.5
0 0 [Go]
1 1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Probability of state 1 Probability of state 1 0.5 0.5
Utility
Utility
Utility
1 1 1 5.5 1.5 6
0.5 0.5 1 5.5
0.5 5
0 0 0.5 5
0 0.2 0.4 0 0.6 0.8 1 0 0.2 0.4 0.6 4.50.8 1
Probability of0state 1 0.2 0.4 0.6 0.8 Probability
1 of state 1 0 0.2 0.4 0.6 0.8 1 0 4.5
0 0.2 0.4 0.6 0.8 1 0 0.2
(a) Probability of state 1 (b) Probability of state 1
There are four undominated plans, each optimal in their region Probability of state 1 Pro
3 (c) 7.5 (d) (c)
2.5
Figure 17.8 (a) Utility of7 two one-step plans as a function of the initial belief state b(1) Figure 17.8 (a) Utility of two one-step plans as a function of th
2 for the two-state world, with
6.5 the corresponding utility function shown in bold. (b) Utilities for the two-state world, with the corresponding utility function sho
for 8 distinct two-step plans. (c) Utilities for four undominated tw
Utility
The elimination of dominated plans is essential for reducing this doubly exponential
growth (but they are already constructed)
Hopelessly inefficient in practice – even the 3x4 POMDP is too hard!
Decisions are made in DDN by projecting forward possible action sequences and
choosing the best one.
DDNs – like the DBNs they are based on – are factored representations
; typically exponential complexity advantages!
Figure 17.10 The generic structure of a dynamic decision network. Variables with known
POMDP
valuesstate St becomes
are shaded. a set
The current time is t of
andrandom variables
the agent must Xtto do—that is, choose
decide what
a value for At . The network has been unrolled into the future for three steps and represents
there future
may rewards,
be multiple
as well asevidence
the utility ofvariables Etlook-ahead horizon.
the state at the
Action at time t denoted by At . agent must choose a value for At .
17.4.3 Online agents for POMDPs
Transition model: P(Xt+1 |Xt , At ); sensor model: P(Et |Xt ).
In this section, we outline a simple approach to agent design for partially observable, stochas-
Reward functions
tic environments. The t and
Rbasic utilityofUthe
elements t of state
design are S t.
already familiar:
Variables
• The with known
transition valuesmodels
and sensor are gray, rewards for
are represented by at dynamic , t + 2, but
= 0, . . .Bayesian utility for
network
t + 3 (DBN), as described in Chapter 15. (=
b discounted sum of rest)
• The dynamic Bayesian network is extended with decision and utility nodes, as used in
Problem:decision
How networks
do we compute
in Chapterwith that?
16. The resulting model is called a dynamic decision
DYNAMIC DECISION
NETWORK network, or DDN.
Answer:• AAll POMDP
filtering algorithms
algorithm can be adapted
is used to incorporate each new to DDNs!
percept and action(only
and toneed
updateCPTs)
the belief state representation.
• Decisions are made by projecting forward possible action sequences and choosing the
best one.
Michael Kohlhase: Artificial Intelligence 2 880 2025-02-06
DBNs are factored representations in the terminology of Chapter 2; they typically have
an exponential complexity advantage over atomic representations and can model quite sub-
Lookahead: Searching over the Possible Action Sequences
stantial real-world problems. The agent design is therefore a practical implementation of the
utility-based agent sketched in Chapter 2.
In the DBN, the single state St becomes a set of state variables Xt , and there may be
Idea:multiple
Search over variables
evidence the treeEtof possible
. We will use Aaction
t to refersequences
to the action at time t,(like
so thein game-play)
transition
model becomes P(Xt+1 |Xt , At ) and the sensor model becomes P(Et |Xt ). We will use Rt to
Part of the
refer lookahead
to the solution
reward received oft the
at time and UDDN above
t to refer (three
to the utility of the state atsteps
time t.lookahead)
(Both
of these are random variables.) With this notation, a dynamic decision network looks like the
one shown in Figure 17.10.
Dynamic decision networks can be used as inputs for any POMDP algorithm, including
those for value and policy iteration methods. In this section, we focus on look-ahead methods
that project action sequences forward from the current belief state in much the same way as do
the game-playing algorithms of Chapter 5. The network in Figure 17.10 has been projected
three steps into the future; the current and future decisions A and the future observations
Section 17.4.
574 Partially Observable MDPs 665
CHAPTER 25. MAKING COMPLEX DECISIONS
At in P(Xt | E1:t)
Et+1 ...
... ... ... ...
Et+2 ...
... ... ...
Et+3 ...
... ... ...
U(Xt+3) ...
10 4 6 3
Figure 17.11 Part of the look-ahead solution of the DDN in Figure 17.10. Each decision
circle =b chance nodes (the environment decides)
will be taken in the belief state indicated.
triangle =
b belief state (each action decision is taken there)
E and rewards R are all unknown. Notice that the network includes nodes for the rewards
for Xt+1 and MichaelXt+2Kohlhase:
, but Artificial
the utility Intelligence 2
for Xt+3 . This is881because the agent 2025-02-06
must maximize the
(discounted) sum of all future rewards, and U (Xt+3 ) represents the reward for Xt+3 and all
subsequent rewards. As in Chapter 5, we assume that U is available only in some approximate
Designing Online
form: if exact utility values
Agents for POMDPs
were available, look-ahead beyond depth 1 would be unnecessary.
Section 17.4. Partially Observable MDPs 665
Figure 17.11 shows part of the search tree corresponding to the three-step look-ahead
DDN in Figure 17.10. Each of the triangular nodes is a belief state in which the agent makes
A in P(X | E )t t 1:t
a decision At+i for i = 0, 1, 2, . . .. The round (chance) nodes correspond to choices by the
E t+1 ...
... ... ... ...
environment, namely, what evidence Et+i arrives. Notice that there are no chance nodes
A in P(X | E )
t+1 t+1 1:t+1 ...
... ... ...
corresponding to the action outcomes; this is because the belief-state update for an action is
E t+2 ...
... ... ...
E
The belief state at each triangular node can be computed by applying a filtering al-
t+3 ...
... ... ...
gorithm to the sequence of percepts and actions leading to it. In this way, the algorithm
U(X ) t+3 ...
10 4 6 3
takes into account the fact that, for decision At+i , the agent will have available percepts
Figure 17.11 Part of the look-ahead solution of the DDN in Figure 17.10. Each decision
will be taken in the belief state indicated.
Et+1Belief
, . . . , state
Et+i , at even triangle
thoughcomputed at time t it bydoes filtering not know with what actions/percepts
those percepts leading to In
will be. it this
E and rewards R are all unknown. Notice that the network includes nodes for the rewards
way, a decision-theoretic for X agentand X automatically
, but the utility for X . This takes
is becauseintothe agentaccount
must maximizethe the value of information and
for decision At+i willsumuse percepts (even
reward forifX values at time t unknown)
t+1 t+2 t+3
(discounted) of all future rewards, and U E(Xt+1:t+i ) represents the and all
will execute information-gathering actions where appropriate.
t+3 t+3
subsequent rewards. As in Chapter 5, we assume that U is available only in some approximate
form: if exact utility values were available, look-ahead beyond depth 1 would be unnecessary.
Athus a POMDP
decision can be agent
extracted
Figure 17.11 automatically
showsfrompart of thethe treetakes
search search corresponding into
tree account
to theby backing
three-step theupvalue
look-ahead of information
the utility values from
DDN in Figure 17.10. Each of the triangular nodes is a belief state in which the agent makes
and executes
the leaves, taking an average information
a decision A at
for i
t+i=gathering
the
0, 1, 2,chance
. . .. The round actions
nodes
(chance) nodesandwhere taking
correspond to appropriate.
choices the
by the maximum at the decision
environment, namely, what evidence E arrives. Notice that there are no chance nodes
t+i
Summary
25.5. ONLINE AGENTS WITH POMDPS 575
Machine Learning
577
579
This part introduces the foundations of machine learning methods in AI. We discuss the prob-
lem learning from observations in general, study inference-based techniques, and then go into
elementary statistical methods for learning.
The current hype topics of deep learning, reinforcement learning, and large language models
are only very superficially covered, leaving them to specialized courses.
580
Chapter 26
In this chapter we introduce the concepts, methods, and limitations of inductive learning, i.e.
learning from a set of given examples.
Outline
Learning agents
Inductive learning
Neural Networks
Support Vector Machines
i.e., expose the agent to reality rather than trying to write it down
Learning modifies the agent’s decision mechanisms to improve performance.
581
582 CHAPTER 26. LEARNING FROM OBSERVATIONS
Definition 26.1.4. Learning element may use knowledge already acquired in the
performance element.
Definition 26.1.5. Learning may require experimentation actions an agent might
not normally consider such as dropping rocks from the Tower of Pisa.
26.2. SUPERVISED LEARNING 583
Ways of Learning
Supervised learning: There’s an unknown function f : A → B called the target
function. We do know a set of pairs T := {⟨ai , f (ai )⟩} of examples. The goal is to
find a hypothesis h ∈ H ⊆ A → B based on T , that is “approximately” equal to f .
(Most of the techniques we will consider)
Unsupervised learning: Given a set of data A, find a pattern in the data; i.e. a
function f : A → B for some predetermined B. (Primarily
clustering /dimensionality reduction)
Reinforcement learning: The agent receives a reward for each action performed. T
he goal is to iteratively adapt the action function to maximize the total reward.
(Useful in e.g. game play)
a set of examples T ⊆ A × B called the training set, such that for every a ∈ A,
there is at most one b ∈ B with ⟨a, b⟩ ∈ T , (⇒ T is a function on some subset of
A)
We assume there is an unknown function f : A → B called the target function with
T ⊆ f.
Definition 26.2.2. Inductive learning algorithms solve inductive learning problems by
finding a hypothesis h ∈ H such that h ∼ f (for some notion of similarity).
Definition 26.2.3. We call a supervised learning problem with target function A → B
a classification problem if B is finite, and call the members of B classes.
We call it a regression problem if B = R.
Training Set
Linear Hypothesis
partially, approximatively
consistent
Quadratic Hypothesis
partially consistent
Degree-4 Hypothesis
consistent
High-degree Hypothesis
consistent
26.2. SUPERVISED LEARNING 585
Intuition: This only works, if the training set is “representative” for the underlying
process.
Idea: We think of examples (seen and unseen) as a sequence, and express the
“representativeness” as a stationarity assumption for the probability distribution.
Attribute-based Representations
Definition 26.3.1. In attribute-based representations, examples are described by
attributes: (simple) functions on input samples, (think pre classifiers on
examples)
their value, and (classify by attributes)
classifications. (Boolean, discrete, continuous, etc.)
Example 26.3.2 (In a Restaurant). Situations where I will/won’t wait for a table:
Attributes Target
Example Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait
X1 T F F T Some $$$ F T French 0–10 T
X2 T F F T Full $ F F Thai 30–60 F
X3 F T F F Some $ F F Burger 0–10 T
X4 T F T T Full $ F F Thai 10–30 T
X5 T F T F Full $$$ F T French >60 F
X6 F T F T Some $$ T T Italian 0–10 T
X7 F T F F None $ T F Burger 0–10 F
X8 F F F T Some $$ T T Thai 0–10 T
X9 F T T F Full $ T F Burger >60 F
X 10 T T T T Full $$$ F T Italian 10–30 F
X 11 F F F F None $ F F Thai 0–10 F
X 12 T T T T Full $ F F Burger 30–60 T
Decision Trees
Decision trees are one possible representation for hypotheses.
Example 26.3.4 (Restaurant continued). Here is the “true” tree for deciding
whether to wait:
26.3. LEARNING DECISION TREES 587
We evaluate the tree by going down the tree from the top, and always take the branch whose
attribute matches the situation; we will eventually end up with a Boolean value; the result. Using
the attribute values from X3 in ?? to descend through the tree in ?? we indeed end up with the
result “true”. Note that
Expressiveness
Decision trees can express any function of the input attributes ⇒ H = A1 ×. . .×An
Example 26.3.7. For Boolean functions, a path from the root to a leaf corresponds
to a row in a truth table:
588 CHAPTER 26. LEARNING FROM OBSERVATIONS
Trivially, for any training set there is a consistent hypothesis with one path to a
leaf for each example, but it probably won’t generalize to new examples.
Solution: Prefer to find more compact decision trees.
Choosing an Attribute
Idea: A good attribute splits the examples into subsets that are (ideally) “all
positive” or “all negative”.
26.4. USING INFORMATION THEORY 589
Example 26.3.9.
Attribute “Patrons?” is a better choice, it gives gives information about the classi-
fication.
Can we make this more formal? ; Use information theory! (up next)
Information Entropy
Intuition: Information answers questions – the less I know initially, the more Informa-
tion is contained in an answer.
Definition 26.4.1. Let ⟨p1 , . . ., pn ⟩ the distribution of a random variable P . The
information (also called entropy) of P is
n
X
I(⟨p1 , . . ., pn ⟩):= −pi · log2 (pi )
i=1
Treating attributes also as random variables, we can compute how much information
is needed after knowing the value for one attribute:
Example 26.4.4. If we know Pat = Full, we only need I(P(WillWait|Pat = Full)) =
I(⟨ 46 , 26 ⟩) ≊ 0.9 bits of information.
Note: The expected number of bits needed after an attribute test on A is
X
P (A = a) · I(P(C|A = a))
a
Result: Substantially simpler than “true” tree – a more complex hypothesis isn’t
justified by small amount of data.
Performance measurement
Question: How do we know that h≊f ? (Hume’s Problem of Induction)
1. Use theorems of computational/statistical learning theory.
2. Try h on a new test set of examples. (use same distribution over example space
as training set)
Question: How big should the information gain be to split (; keep) a node?
Idea: Use a statistical significance test.
Definition 26.5.6. A result has statistical significance, if the probability they
could arise from the null hypothesis (i.e. the assumption that there is no underlying
pattern) is very low (usually 5%).
Compute the probability that the example distribution (p positive, n negative) for
a terminal node deviates from the expected distribution under the null hypothesis.
For an attribute A with d values, compare the actual numbers pk and nk in each
subset sk with the expected numbers (expected if A is irrelevant)
pbk = p · pkp+n
+nk
bk = n · pkp+n
and n +nk
.
d
X 2 2
(pk − pbk ) (nk − n
bk )
∆= +
pbk bk
n
k=1
Caveat: A low error rate on the training set does not mean that a hypothesis
generalizes well.
Idea: Do not use homework questions in the exam.
594 CHAPTER 26. LEARNING FROM OBSERVATIONS
Definition 26.5.11. The practice of splitting the data available for learning into
1. a training set from which the learning algorithm produces a hypothesis h and
2. a test set, which is used for evaluating h
Model Selection
Definition 26.5.14. The model selection problem is to determine – given data –
a good hypothesis space.
Example 26.5.15. What is the best polynomial degree to fit the data
Concrete Problem: Find the “size” that best balances overfitting and underfitting
to optimize test set accuracy.
60
Validation Set Error
Training Set Error
50
40
Error rate
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Tree size
Stops when training set error rate converges, choose optimal tree for validation
596 CHAPTER 26. LEARNING FROM OBSERVATIONS
Generalization Loss
Note: L(y, y) = 0. (no loss if you are exactly correct)
Empirical Loss
26.5. EVALUATING AND CHOOSING THE BEST HYPOTHESIS 597
Regularization
Idea: Directly use empirical loss to solve model selection. (finding a good H)
Minimize the weighted sum of empirical loss and hypothesis complexity. (to avoid
overfitting).
Definition 26.5.25. Let λ ∈ R, h ∈ H, and E a set of examples, then we call
Remark: In regularization, empirical loss and hypothesis complexity are not mea-
sured in the same scale ; λ mediates between scales.
Idea: Measure both in the same scale ; use information content, i.e. in bits.
The minimum description length or MDL hypothesis minimizes the total number of
bits required.
This works well in the limit, but for smaller problems there is a difficulty in that the
choice of encoding for the program affects the outcome.
e.g., how best to encode a decision tree as a bit string?
In recent years there has been more emphasis on large-scale learning. (millions of
examples)
Generalization error is dominated by limits of computation
there is enough data and a rich enough model that we could find an h that
is very close to the true f ,
but the computation to find it is too complex, so we settle for a sub-optimal
approximation.
Hardware advances (GPU farms, Amazon EC2, Google Data Centers, . . . ) help.
PAC Learning
Basic idea of Computational Learning Theory:
Any hypothesis h that is seriously wrong will almost certainly be “found out”
with high probability after a small number of examples, because it will make an
incorrect prediction.
Thus, if h is consistent with a sufficiently large set of training examples is unlikely
to be seriously wrong.
; h is probably approximately correct.
Definition 26.6.1. Any learning algorithm that returns hypotheses that are prob-
ably approximately correct is called a PAC learning algorithm.
Derive performance bounds for PAC learning algorithms in general, using the
Stationarity Assumption (again): We assume that the set E of possible examples
is IID ; we have a fixed distribution P(E) = P(X, Y ) on examples.
PAC Learning
Start with PAC theorems for Boolean functions, for which L0/1 is appropriate.
Definition 26.6.2. The error rate error(h) of a hypothesis h is the probability that
600 CHAPTER 26. LEARNING FROM OBSERVATIONS
Sample Complexity
Let’s compute the probability that hb ∈ Hb is consistent with the first N examples.
We know error(hb ) > ϵ
N
; P (hb agrees with N examples) ≤ (1 − ϵ) . (independence)
N N
; P (Hb contains consistent hyp.)≤|Hb | · (1 − ϵ) ≤|H| · (1 − ϵ) . (Hb ⊆ H)
1 1
; to bound this by a small δ, show the algorithm N ≥ ϵ · (log2 ( δ ) + log2 (|H|))
examples.
Definition 26.6.4. The number of required examples as a function of ϵ and δ is
called the sample complexity of H.
n
Example 26.6.5. If H is the set of n-ary Boolean functions, then |H| = 22 .
n
; sample complexity grows with O(log2 (22 )) = O(2n ).
There are 2n possible examples,
; PAC learning for Boolean functions needs to see (nearly) all examples.
H contains enough hypotheses to classify any given set of examples in all possible
ways.
In particular, for any set of N examples, the set of hypotheses consistent with
those examples contains equal numbers of hypotheses that predict xN +1 to be
positive and hypotheses that predict xN +1 to be negative.
No No
P atrons(x, Some) P atrons(x, F ull) ∧ F ri/Sat(x) No
Yes Yes
Yes Yes
Lemma 26.6.8. Given arbitrary size conditions, decision lists can represent arbi-
trary Boolean functions.
Plug this into the equation for the sample complexity: N ≥ 1ϵ ·(log2 ( 1δ )+log2 (|H|))
to obtain
1 1
N ≥ · (log2 ( ) + log2 (O(nk log2 (nk ))))
ϵ δ
Intuitively: Any algorithm that returns a consistent decision list will PAC learn a
k−DL function in a reasonable number of examples, for small k.
0.8
Decision tree
0.7 Decision list
0.6
0.5
0.4
0 20 40 60 80 100
Training set size
Recall: A mapping f between vector spaces is called linear, iff it preserves plus
and scalar multiplication, i.e. f (α · v1 + v2 ) = α · f (v1 ) + f (v2 ).
Observation 26.7.2. A univariate, linear function f : R → R is of the form f (x) =
w1 x + w0 for some wi ∈ R.
Idea: Minimize squared error loss over {(xi ,yi ) | i ≤ N } (used already by Gauss)
N
X N
X N
X
2 2
Loss(hw ) = L2 (yj , hw (xj )) = (yj − hw (xj )) = (yj − (w1 xj + w0 ))
j=1 j=1 j=1
Remark: Closed-form solutions only exist for linear regression, for other (dif-
ferentiable) hypothesis spaces use gradient descent methods for adjusting/learning
weights.
Definition 26.7.6. The weight space of a parametric model is the space of all
possible combinations of parameters (called the weights). Loss minimization in a
weight space is called weight fitting.
Note: it is convex. w0
w1
Observation 26.7.7. The squared error loss function is convex for any linear
regression problem ; there are no local minima.
The parameter α is called the learning rate. It can be a fixed constant or it can
decay as learning proceeds.
These updates constitute the batch gradient descent learning rule for univariate
linear regression.
Convergence to the unique global loss minimum is guaranteed (as long as we pick
α small enough) but may be very slow.
606 CHAPTER 26. LEARNING FROM OBSERVATIONS
Doing batch gradient descent on random subsets of the examples of fixed batch
size n is called stochastic gradient descent (SGD). (More computationally efficient
than updating for every example)
Gradient descent will reach the (unique) minimum of the loss function; the update
equation for each weight wi is
X
wi ←− wi − α( xj,i (yj − hw (⃗xj )))
j
5
earthquakes, black: underground 4.5
explosions 4
3.5
Also: hw∗ as a decision boundary 3
2.5
x2 = 17x1 − 4.9. 4.5 5 5.5 6 6.5 7
x1
wi ←− wi + α · (y − hw (x)) · xi
Proportion correct
Proportion correct
Logistic Regression
For an example (x,y) we compute the partial derivatives: (via chain rule)
∂ ∂ 2
L2 (w) = ((y − hw (x)) )
∂wi ∂wi
∂
= 2 · hw (x) · (y − hw (x))
∂wi
∂
= −2 · hw (x) · l′ (w·x) · (w·x)
∂wi
= −2 · hw (x) · l′ (w·x) · xi
The derivative of the logistic function satisfies l′ (z) = l(z)(1 − l(z)), thus
Definition 26.7.21. The rule for logistic update (weight update for minimizing the
loss) is
wi ←− wi + α · (y − hw (x)) · hw (x) · (1 − hw (x)) · xi
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
The goal is to find a hyperplane in Rp that maximally separates the two classes
(i.e. y i = −1 from y i = 1)
Remember A hyperplane can be represented as the set {x | (w·x) + b = 0} for some
vector w and scalar b. (w is orthogonal to the plane, b determines the offset from
the origin)
X 1 X
Theorem 26.8.4 (SVM equation). Let α = argmax ( αj − ( αj αk y j y k (xj ·xk )))
α
j
2
P j,k
under the constraints αj ≥ 0 and j αj y j = 0. P
The maximum margin separator is given by w = j αj xj and b = w·xi − y i for
any xi where αi ̸= 0.
Proof sketch: By the duality principle for optimization problems
Important Properties:
The weights αj associated with each data point are zero except at the support
612 CHAPTER 26. LEARNING FROM OBSERVATIONS
1.5
0.5
x2
-0.5
-1
-1.5
-1.5 -1 -0.5 0 0.5 1 1.5
x1
1.5
√2x1x2
1
3
2
0.5
1
0
x2
0
-1
-2 2.5
-0.5 -3 2
0 1.5
-1 0.5 2
1 1 x2
1.5 0.5
-1.5 x21 2
-1.5 -1 -0.5 0 0.5 1 1.5
x1
Neural networks
Perceptrons
Multilayer perceptrons
Applications of neural networks
Brains
Axiom 26.9.1 (Neuroscience Hypothesis). Mental activity consists consists
primarily of electrochemical activity in networks of brain cells called neurons.
614 CHAPTER 26. LEARNING FROM OBSERVATIONS
One approach to Artificial Intelligence is to model and simulate brains. (and hope
that AI comes along naturally)
Definition 26.9.3. The AI subfield of neural networks (also called connectionism,
parallel distributed processing, and neural computation) studies computing systems
inspired by the biological neural networks that constitute brains.
Neural networks are attractive computational devices, since they perform important
AI tasks – most importantly learning and distributed, noise-tolerant computation –
naturally and efficiently.
Bias Weight
a0 = 1 aj = g(inj)
X w0,j
ini = wj,i aj g
wi,j inj
j
X
ai
Σ aj
w =1 w =1
3. 2 AND 2
OR NOT
Recurrent neural networks follow largely the same principles as feed-forward networks,
so we will not go into details here.
Single-layer Perceptrons
Definition 26.9.10. A perceptron network is a feed-forward network of perceptron
units. A single layer perceptron network is called a perceptron.
Example 26.9.11.
1
0.8
0.6
0.4
0.2
-2-4
0 20x
Input w Output -2 0
x1
2 4 6 10 8
6 4 2
i,j
Layer Layer
a5 = g(w3,5 · a3 + w4,5 · a4 )
= g(w3,5 · g(w1,3 · a1 + w2,3 a2 ) + w4,5 · g(w1,4 · a1 + w2,4 a2 ))
Expressiveness of Perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
Can represent AND, OR, NOT, majority, etc., but not XOR (and thus no adders)
Represents a linear separator in input space:
X
wj xj > 0 or W, x· > 0
j
x1 x1 x1
1 1 1
0 0 0
0 1 x2 0 1 x2 0 1 x2
(a) x1 and x2 (b) x1 or x2 (c) x1 xor x2
Minsky & Papert (1969) pricked the first neural network balloon!
Perceptron Learning
For learning, we update the weights using gradient descent based on the generaliza-
tion loss function.
Let e.g. L(w) = (y − hw (x))2 (the squared error loss).
We compute the gradient:
618 CHAPTER 26. LEARNING FROM OBSERVATIONS
X n
∂L(w) ∂(y − hw (x)) ∂
= 2 · (yk − hw (x)k ) · = 2 · (yk − hw (x)k ) · (y − g( wj,k xj ))
∂wj,k ∂wj,k ∂wj,k j=0
0.9 0.9
0.8 0.8
0.7 0.7
Multilayer perceptrons
Definition 26.9.13. In multi layer perceptron (MLPs), layers are usually fully
connected;
numbers of hidden units typically chosen by hand.
Output Layer ai
wi,j
Hidden Layer aj
wi,j
Input Layer ak
26.9. ARTIFICIAL NEURAL NETWORKS 619
Definition 26.9.14. Some MLPs have residual connections, i.e. connections that
skip layers.
Expressiveness of MLPs
All continuous functions w/ 2 layers, all functions w/ 3 layers.
∂L(w)k ∂ink
= −2 · (yk − hw (x)k ) · g ′ (ink ) · (as before)
∂wi,j | {z } ∂wi,j
=:∆k
P
∂( ℓ wℓ,k aℓ ) ∂aj ∂g(inj )
= −2 · ∆k · = −2 · ∆k · wj,k · = −2 · ∆k · wj,k ·
∂wi,j ∂wi,j ∂wi,j
′
= −2 · ∆k · wj,k · g (inj ) ·ai
| {z }
=:∆j,k
620 CHAPTER 26. LEARNING FROM OBSERVATIONS
Idea: The total “error” of the hidden node j is the sum of all the connected nodes k
in the next layer
Definition 26.9.15. The X back-propagation rule for hidden nodes of a multilayer per-
′
ceptron is ∆j ← g (inj ) · ( wj,i ∆i ) And the update rule for weights in a hidden layer
i
is wk,j ← wk,j + α · ak · ∆j
Back-Propagation – Properties
Sum gradient updates for all examples in some “batch” and apply gradient descent.
0.9
0.8
0.7
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Experience shows: MLPs are quite good for complex pattern recognition tasks,
but resulting hypotheses cannot be understood easily.
This makes MLPs ineligible for some tasks, such as credit card and loan approvals,
where law requires clear unbiased criteria.
Summary
neural networks can be extremely powerful (hypothesis space intractably complex)
Perceptrons (one-layer networks) insufficiently expressive for most applications
Engineering, cognitive modelling, and neural system modelling subfields have largely
diverged
Drawbacks: take long to converge, require large amounts of data, and are difficult
to interpret (Why is the output what it is?)
For supervised learning, the aim is to find a simple hypothesis that is approximately
consistent with training examples
Decision tree learning using information gain.
Learning performance = prediction accuracy measured on test set
Statistical Learning
625
626 CHAPTER 27. STATISTICAL LEARNING
What kind of bag is it? What flavour will the next candy be?
Note: Every hypothesis is itself a probability distribution over the random variable
“flavour”.
Michael Kohlhase: Artificial Intelligence 2 975 2025-02-06
1 P(h1 | d)
P(h2 | d)
0.8 P(h3 | d)
P(h4 | d)
P(h5 | d)
0.6
0.4
0.2
0
0 2 4 6 8 10
Number of observations in d
Q
if the observations are IID, i.e. P (d|hi ) = j P (dj |hi ) and the hypothesis prior is
as advertised. (e.g. P (d|h3 ) = 0.510 = 0.1%)
The posterior probabilities start with the hypothesis priors, change with data.
0.9
0.8
0.7
0.6
0.5
0.4
0 2 4 6 8 10
Number of observations in d
; we compute the expected value of the probability of the next candy being lime
over all hypotheses (i.e. distributions).
; “meta-distribution”
Michael Kohlhase: Artificial Intelligence 2 977 2025-02-06
where P (d|hi ) is called the likelihood (of the data under each hypothesis) and
P (hi ) the hypothesis prior.
Bayesian predictions use a likelihood-weighted average over the hypotheses:
X X
P(X|d) = P(X|d, hi ) · P (hi |d) = P(X|hi ) · P (hi |d)
i i
Definition 27.2.1. For maximum a posteriori learning (MAP learning) choose the
MAP hypothesis hMAP that maximizes P (hi |d).
I.e., maximize P (d|hi ) · P (hi ) or (even better) log2 (P (d|hi )) + log2 (P (hi )).
Predictions made according to a MAP hypothesis hMAP are approximately Bayesian
to the extent that P(X|d) ≈ P(X|hMAP ).
Example 27.2.2. In our candy example, hMAP = h5 after three limes in a row
a MAP learner then predicts that candy 4 is lime with probability 1.
compare with Bayesian prediction of 0.8. (see prediction curves above)
As more data arrive, the MAP and Bayesian predictions become closer, because the
competitors to the MAP hypothesis become less and less probable.
For deterministic hypotheses, P (d|hi ) is 1 if consistent, 0 otherwise
; MAP = simplest consistent hypothesis. (cf. science)
Remark: Finding MAP hypotheses is often much easier than Bayesian learning,
because it requires solving an optimization problem instead of a large summation
(or integration) problem.
Indeed if hypothesis predicts the data exactly – e.g. h5 in candy example – then
log2 (1) = 0 ; preferred hypothesis.
This is more directly modeled by the following approximation to Bayesian learning:
Observation: For large data sets, the prior becomes irrelevant. (we might not
trust it anyways)
Idea: Use this to simplify learning.
Definition 27.2.4. Maximum likelihood learning (ML learning): choose the ML
hypothesis hML maximizing P (d|hi ). (simply get the best fit to the data)
Remark: ML learning = b MAP learning for a uniform prior. (reasonable if all
hypotheses are of the same complexity)
ML learning is the “standard” (non Bayesian) statistical learning method.
P (F = cherry)
θ
Flavor
N
Y ℓ
These are IID observations, so the likelihood is P (d|hθ ) = P (dj |hθ ) = θc · (1 − θ)
j=1
N
X
L(d|hθ ) = log2 (P (d|hθ )) = log2 (P (dj |hθ )) = clog2 (θ) + ℓlog2 (1 − θ)
j=1
1. Write down an expression for the likelihood of the data as a function of the
parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zero
P (F = cherry)
θ
Flavor
F P (W = red|F )
cherry θ1
lime θ2
Wrapper
27.3. PARAMETER LEARNING FOR BAYESIAN NETWORKS 631
0.8
P(y |x)
4 0.6
3.5
y
3
2.5 0.4
2
1.5
1 1
0.5 0.8 0.2
0 0.6
0 0.2 0.4 y
0.4 0.6 0.2
0.8 0 0
x 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Maximum likelihood learning assumes uniform prior, OK for large data sets:
1. Choose a parameterized family of models to describe the data.
; requires substantial insight and sometimes new models.
2. Write down the likelihood of the data as a function of the parameters.
; may require summing over hidden variables, i.e., inference.
3. Write down the derivative of the log likelihood w.r.t. each parameter.
4. Find the parameter values such that the derivatives are zero.
; may be hard/impossible; modern optimization techniques help.
Naive Bayes models as a fall-back solution for machine learning:
Reinforcement Learning
Unsupervised Learning
So far: We have studied “learning from examples”. (functions, logical theories,
probability models)
Now: How can agents learn “what to do” in the absence of labeled examples of
“what to do”. We call this problem unsupervised learning.
Example 28.1.1 (Playing Chess). Learn transition models for own moves and
maybe predict opponent’s moves.
Problem: The agent needs to have some feedback about what is good/bad
; cannot decide “what to do” otherwise. (recall: external performance standard
for learning agents)
Example 28.1.2. The ultimate feedback in chess is whether you win, lose, or draw.
Definition 28.1.3. We call a learning situation where there are no labeled examples
unsupervised learning and the feedback involved a reward or reinforcement.
Example 28.1.4. In soccer, there are intermediate reinforcements in the shape of
goals, penalties, . . .
633
634 CHAPTER 28. REINFORCEMENT LEARNING
In MDPs, the agent has total knowledge about the environment and the reward
function, in reinforcement learning we do not assume this. (;
POMDPs+reward-learning)
Example 28.1.6. You play a game without knowing the rules, and at some time
the opponent shouts you lose!
Passive Learning
Definition 28.2.1 (To keep things simple). Agent uses a state-based represen-
tation in a fully observable environment:
In passive learning, the agent’s policy π is fixed: in state s, it always executes
the action π(s).
Its goal is simply to learn how good the policy is – that is, to learn the utility
function U π (s).
The passive learning task is similar to the policy evaluation task (part of the policy
iteration algorithm) but the agent does not know
the transition model P (s′ |s, a), which specifies the probability of reaching state
s′ from state s after doing action a,
the reward function R(s), which specifies the reward for each state.
Section 17.1. Sequential Decision Problems 651
Remember that πs∗ is a policy, so it recommends an action for every state; its connection
with s in particular is that it’s an optimal policy when s is the starting state. A remarkable
consequence of using discounted utilities with infinite horizons is that the optimal policy is
independent of the starting state. (Of course, the action sequence won’t be independent;
remember that a policy is a function specifying an action for each state.) This fact seems
intuitively obvious: if policy πa∗ is optimal starting in a and policy πb∗ is optimal starting in b,
28.2. PASSIVE LEARNING then, when they reach a third state 635
c, there’s no good reason for them to disagree with each
other, or with πc∗ , about what to do next.2 So we can simply write π ∗ for an optimal policy.
∗
Given this definition, the994true utility of a state
Michael Kohlhase: Artificial Intelligence 2
is just U π (s)—that is, the expected
2025-02-06
sum of discounted rewards if the agent executes an optimal policy. We write this as U (s),
matching the notation used in Chapter 16 for the utility of an outcome. Notice that U (s) and
R(s) are quite different quantities; R(s) is the “short term” reward for being in s, whereas
648 Passive Learning by Example U (s) is the “long term” total reward Chapterfrom s17.onward. Making Complex
Figure 17.3 Decisions
shows the utilities for the
4 × 3 world. Notice that the utilities are higher for states closer to the +1 exit, because fewer
Example 28.2.2 (Passive steps are required to reach
Learning). We the
useexit.
the 4 × 3 world introduced above
+1 +1
–1 –1
3 +1 3 0.812 0.868 0.918 +1
+1 +1
1 2 3 4 1 2 3 4
Optimal Policy π
Figure 17.3 The utilities Utilities,
of the states given
in <
the π
– 0.0221 < R(s) 0 4 × 3 world, calculated
R(s) > 0 with γ = 1 and
R(s) = − 0.04 for nonterminal states.
The agent executes a(a)set of trials in the environment using its policy
(b) π.
Figurethe
In each trial, 17.2agent
(a)starts
AnTheoptimal policy
in state
utility forU the
(1,1)
function and
(s) stochastic
the environment
experiences
allows with
a select
agent to sequence R(s) − 0.04
=using
ofbystate
actions theinprinciple of
the nonterminal states.
maximum (b) Optimal
expected policies
utility for
from four different
Chapter 16—that
transitions until it reaches one of the terminal states, (4,2) or (4,3). ranges
is, of R(s).
choose the action that maximizes the
expected utility of the subsequent state:
Its percepts supply both the current state! and the reward received in that state.
and (3,3) are as shown, every π ∗ (s) = policy
argmaxis optimal,
P (s# |and
s, a)Uthe(sagent
#
). obtains infinite total reward be- (17.4)
cause it never enters a terminal state. Surprisingly,
a∈A(s) s " it turns out that there are six other optimal
policiesMichael
forKohlhase:
various ranges
Artificial of 2R(s); Exercise 17.5
Intelligence 995 asks you to find them.
2025-02-06
The next two sections describe algorithms for finding optimal policies.
The careful balancing of risk and reward is a characteristic of MDPs that does not
arise in deterministic searchthisproblems;
2 Although moreover,
seems obvious, it does not it is fora finite-horizon
hold characteristic of ormany
policies real-world
for other ways of combining
Passivedecision
Learning problems. byrewardsExample
Forover thistime.
reason, MDPs
The proof follows have
directlybeen
from the studied
uniquenessinofseveral
the utility fields, including
function on states, as shown in
Section 17.2.
AI, operations research, economics, and control theory. Dozens of algorithms have been
Example
proposed28.2.3. Typical trials
for calculating optimalmight look like
policies. this:
In sections 17.2 and 17.3 we describe two of the
most important algorithm families. First, however, we must complete our investigation of
1. (1,utilities
1)−0.4 and; (1, 2)−0.4for;sequential
policies ; (1, problems.
(1, 3)−0.4decision 2)−0.4 ; (1, 3)−0.4 ; (2, 3)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
2. (1,17.1.1
1)−0.4 ; Utilities
(1, 2)−0.4 over;time(1, 3)−0.4 ; (2, 3)−0.4 ; (3, 3)−0.4 ; (3, 2)−0.4 ;
(3, 3)−0.4 ; (4, 3)+1
In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of
3. (1, 1)−0.4 ; (2, 1)−0.4 ; (3, 1)−0.4 ; (3, 2)−0.4 ; (4, 2)−1 .
rewards for the states visited. This choice of performance measure is not arbitrary, but it is
not the only
Definition 28.2.4. possibility for the
The utility is utility
definedfunction on environment
to be the expected sum histories, which we write as
of (discounted)
rewards obtained
Uh ([s 0 , s1 , . . .if, s n ]). Our
policy π isanalysis
followed.draws on multiattribute utility theory (Section 16.4) and
is somewhat technical; the impatient "reader may wish # to skip to the next section.
X∞
FINITE HORIZON The first question to answer π is whether t there is a finite horizon or an infinite horizon
U (s):=E γ R(S )
INFINITE HORIZON for decision making. A finite horizon means that tthere is a fixed time N after which nothing
t=0
matters—the game is over, so to speak. Thus, Uh ([s0 , s1 , . . . , sN +k ]) = Uh ([s0 , s1 , . . . , sN ])
whereforR(s)
all kis>the 0. Forrewardexample,
for a suppose
state, Stan(aagent startsvariable)
random at (3,1) in × 3 world
is the 4state reachedof Figure
at 17.1,
time and suppose
t when that Npolicy
executing = 3. Then,
π, andtoShave0 = any
s. chance
(for 4 × of
3 wereaching
take the
the +1 state,
discount the
factoragent must
head directly for it, and the optimal action is to go Up. On the other hand, if N = 100,
γ = 1)
then there is plenty of time to take the safe route by going Left. So, with a finite horizon,
Michael Kohlhase: Artificial Intelligence 2 996 2025-02-06
636 CHAPTER 28. REINFORCEMENT LEARNING
Idea: Each trial provides a sample of the reward to go for each state visited.
Example 28.2.6. The first trial in ?? provides a sample total reward of 0.72 for
state (1,1), two samples of 0.76 and 0.84 for (1,2), two samples of 0.80 and 0.88
for (1,3), . . .
Definition 28.2.7. The direct utility estimation algorithm cycles over trials, cal-
culates the reward to go for each state, and updates the estimated utility for that
state by keeping the running average for that for each state in a table.
Observation 28.2.8. In the limit, the sample average will converge to the true
expectation (utility) from ??.
Remark 28.2.9. Direct utility estimation is just supervised learning, where each
example has the state as input and the observed reward to go as output.
Upshot: We have reduced reinforcement learning to an inductive learning problem.
The utility of each state equals its own reward plus the expected utility of its
successor states.
So: The utility values obey a Bellman equation for a fixed policy π.
X
U π (s) = R(s) + γ · ( P (s′ |s, π(s)) · U π (s′ ))
s′
But direct utility estimation learns nothing until the end of the trial.
Intuition: Direct utility estimation searches for U in a hypothesis space that too
large ⇝ many functions that violate the Bellman equations.
28.2. PASSIVE LEARNING 637
As above: These equations are linear (no maximization involved) (solve with any
any linear algebra package).
Observation 28.2.12. Learning the model itself is easy, because the environment
is fully observable.
U := POLICY−EVALUATION(π,mdp)
if s′ .TERMINAL? then s, a := null else s, a := s′ , π[s′ ]
return a
P∞
POLICY−EVALUATION computes U π (s):=E [ t=0 γ t R(st )] in a MDP.
Note the large changes occurring around the 78th trial – this is the first time that
the agent falls into the -1 terminal state at (4,2).
Observation 28.2.17. The ADP agent is limited only by its ability to learn the
transition model. (intractable for large state spaces)
The agent follows the optimal policy for the learned model at each step.
It does not learn the true utilities or the true optimal policy!
instead, in the 39th trial, it finds a policy that reaches the +1 reward along the
lower route via (2,1), (3,1), (3,2), and (3,3).
After experimenting with minor variations, from the 276th trial onward it sticks
to that policy, never learning the utilities of the other states and never finding
the optimal route via (1,2), (1,3), and (2,3).
Idea: Actions do more than provide rewards according to the learned model
they also contribute to learning the true model by affecting the percepts received.
By improving the model, the agent may reap greater rewards in the future.
Pure exploitation risks getting stuck in a rut. Pure exploration to improve one’s
knowledge is of no use if one never puts that knowledge into practice.
Compare with the information gathering agent from ??.
Knowledge in Learning
641
642 CHAPTER 29. KNOWLEDGE IN LEARNING
The classification is given by the goal predicate WillWait, in this case WillWait(X 1 )
or ¬WillWait(X 1 ).
can be represented as
Method: Construct a disjunction of all the paths from the root to the positive
leaves interpreted as conjunctions of the attributes on the path.
Note: The equivalence takes care of positive and negative examples.
Cumulative Development
Example 29.1.6. Learning from very few examples using background knowledge:
1. Caveman Zog and the fish on a stick:
Prior
Knowledge
Logic based
Observations Hypotheses Predictions
inductive learning
Example 29.2.1. Inferring disease D from the symptoms is not enough to explain
the prescription of medicine M .
Need a new general rule: M is effective against D (induction from example)
Definition 29.2.2. Knowledge based inductive learning (KBIL) replaces the expla-
nation constraint by the KBIL constraint:
Offers complete algorithms for inducing general, first-order theories from examples.
29.2.1 An Example
A Video Nugget covering this subsection can be found at https://fau.tv/clip/id/30396.
646 CHAPTER 29. KNOWLEDGE IN LEARNING
ILP: An example
General knowledge-based induction problem
George Mum
Example
Descriptions include facts like
Background knowledge
Observation: A little bit of background knowledge helps a lot.
Example 29.2.6. If the background knowledge contains
FOIL
function Foil(examples,target) returns a set of Horn clauses
inputs: examples, set of examples
target, a literal for the goal predicate
local variables: clauses, set of clauses, initially empty
while examples contains positive examples do
clause := New−Clause(examples,target)
remove examples covered by clause from examples
add clause to clauses
return clauses
29.2. INDUCTIVE LOGIC PROGRAMMING 649
FOIL
function New−Clause(examples,target) returns a Horn clause
local variables: clause, a clause with target as head and an empty body
l, a literal to be added to the clause
extendedExamples, a set of examples with values for new variables
extendedExamples := examples
while extendedExamples contains negative examples do
l := Choose−Literal(New−Literals(clause),extendedExamples)
append l to the body of clause
extendedExamples := map Extend−Example over extendedExamples
return clause
function Extend−Example(example,literal) returns a new example
if example satisfies literal
then return the set of examples created by extending example with each
possible constant value for each new variable in literal
else return the empty set
function New−Literals(clause) returns a set of possibly ‘‘useful’’ literals
function Choose−Literal(literals) returns the ‘‘best’’ literal from literals
father(x, z) ⇒ grandfather(x, y)
Add literals using predicates
Negated or unnegated
Use any existing predicate (including the goal)
Arguments must be variables
Each literal must include at least one variable from an earlier literal or from the
head of the clause
Valid: M other(z, u), M arried(z, z), grandfather(v, x)
Invalid: M arried(u, v)
Inverse Resolution
Definition 29.2.9. Inverse resolution in a nutshell
Classifications follows from Background ∧ Hypothesis ∧ Descriptions.
This can be proven by resolution.
Run the proof backwards to find hypothesis.
Problem: How to run the resolution proof backwards?
Recap: In ordinary resolution we take two clauses C1 = L ∨ R1 and C2 = ¬L ∨ R2
and resolve them to produce the resolvent C = R1 ∨ R2 .
[George/x],[Elisabeth/z]
[Anne/y]
{}
Can inverse resolution infer the law of gravity from examples of falling bodies?
Yes, given suitable background mathematics!
Monkey and typewriter problem: How to overcome the large branching factor and
the lack of structure in the search space?
652 CHAPTER 29. KNOWLEDGE IN LEARNING
Applications of ILP
ILP systems have outperformed knowledge free methods in a number of domains.
Molecular biology: the GOLEM system has been able to generate high-quality
predictions of protein structures and the therapeutic efficacy of various drugs.
Natural Language
653
655
In other words: the language you use all day long, e.g. English, German, . . .
Why Should we care about natural language?:
Even more so than thinking, language is a skill that only humans have.
It is a miracle that we can express complex thoughts in a sentence in a matter
of seconds.
It is no less miraculous that a child can learn tens of thousands of words and
complex syntax in a matter of a few years.
Language Technology
Language Assistance:
written language: Spell/grammar/style-checking,
spoken language: dictation systems and screen readers,
multilingual text: machine-supported text and dialog translation, eLearning.
Information management:
657
658 CHAPTER 30. NATURAL LANGUAGE PROCESSING
Psychology/Cognition: Semantics =
b “what is in our brains” (; mental models)
Mathematics has driven much of modern logic in the quest for foundations.
Logic as “foundation of mathematics” solved as far as possible
In daily practice syntax and semantics are not differentiated (much).
A good probe into the issues involved in natural language understanding is to look at translations
between natural language utterances – a task that arguably involves understanding the utterances
first.
30.2. NATURAL LANGUAGE AND ITS MEANING 659
Example 30.2.2. Wirf der Kuh das Heu über den Zaun. ̸;Throw the cow the
hay over the fence. (differing grammar; Google Translate)
Example 30.2.3. Grammar is not the only problem
Der Geist ist willig, aber das Fleisch ist schwach!
Der Schnaps ist gut, aber der Braten ist verkocht!
Observation 30.2.4. We have to understand the meaning for high-quality trans-
lation!
If it is indeed the meaning of natural language, we should look further into how the form of the
utterances and their meaning interact.
Newspaper ;
For questions/answers, it would be very useful to find out what words (sentences/texts)
mean.
Definition 30.2.6. Interpretation of natural language utterances: three problems
language
utterance
Let us support the last claim a couple of initial examples. We will come back to these phenomena
again and again over the course of the course and study them in detail.
660 CHAPTER 30. NATURAL LANGUAGE PROCESSING
But there are other phenomena that we need to take into account when compute the meaning of
NL utterances.
Grammar Inference
relevant
Utterance Meaning information
of utterance
Lexicon World knowledge
We will look at another example, that shows that the situation with semantic/pragmatic analysis
is even more complex than we thought. Understanding this is one of the prime objectives of the
AI-2 lecture.
30.3. LOOKING AT NATURAL LANGUAGE 661
Grammar Inference
utterance- relevant
semantic
Utterance specific information
potential
meaning of utterance
?? is also a very good example for the claim ?? that even for high-quality (machine) translation
we need semantics.
Logical analysis vs. conceptual analysis: These examples — mostly borrowed from David-
son:tam67 — help us to see the difference between “logical-analysis” and “conceptual-analysis”.
We observed that from This is a big diamond. we cannot conclude This is big. Now consider the
sentence Jane is a beautiful dancer. Similarly, it does not follow from this that Jane is beautiful,
but only that she dances beautifully. Now, what it is to be beautiful or to be a beautiful dancer
662 CHAPTER 30. NATURAL LANGUAGE PROCESSING
is a complicated matter. To say what these things are is a problem of conceptual analysis. The
job of semantics is to uncover the logical form of these sentences. Semantics should tell us that
the two sentences have the same logical forms; and ensure that these logical forms make the right
predictions about the entailments and truth conditions of the sentences, specifically, that they
don’t entail that the object is big or that Jane is beautiful. But our semantics should provide a
distinct logical form for sentences of the type: This is a fake diamond. From which it follows that
the thing is fake, but not that it is a diamond.
One way to think about the examples of ambiguity on the previous slide is that they illustrate a
certain kind of indeterminacy in sentence meaning. But really what is indeterminate here is what
sentence is represented by the physical realization (the written sentence or the phonetic string).
The symbol duck just happens to be associated with two different things, the noun and the verb.
Figuring out how to interpret the sentence is a matter of deciding which item to select. Similarly
for the syntactic ambiguity represented by PP attachment. Once you, as interpreter, have selected
one of the options, the interpretation is actually fixed. (This doesn’t mean, by the way, that as
an interpreter you necessarily do select a particular one of the options, just that you can.) A
brief digression: Notice that this discussion is in part a discussion about compositionality, and
gives us an idea of what a non-compositional account of meaning could look like. The Radical
Pragmatic View is a non-compositional view: it allows the information content of a sentence to
be fixed by something that has no linguistic reflex.
To help clarify what is meant by compositionality, let me just mention a couple of other ways
in which a semantic account could fail to be compositional.
• Suppose your syntactic theory tells you that S has the structure [a[bc]] but your semantics
computes the meaning of S by first combining the meanings of a and b and then combining the
result with the meaning of c. This is non-compositional.
• Recall the difference between:
1. Jane knows that George was late.
2. Jane believes that George was late.
Sentence 1. entails that George was late; sentence 2. doesn’t. We might try to account for
this by saying that in the environment of the verb believe, a clause doesn’t mean what it
usually means, but something else instead. Then the clause that George was late is assumed
to contribute different things to the informational content of different sentences. This is a
non-compositional account.
30.3. LOOKING AT NATURAL LANGUAGE 663
Example 30.3.4. Every man loves a woman. (Keira Knightley or his mother!)
Example 30.3.5. Every car has a radio. (only one reading!)
Example 30.3.6. Some student in every course sleeps in every class at least
some of the time. (how many readings?)
Example 30.3.7. The president of the US is having an affair with an intern.
(2002 or 2000?)
Example 30.3.8. Everyone is here. (who is everyone?)
Observation: If we look at the first sentence, then we see that it has two readings:
1. there is one woman who is loved by every man.
2. for each man there is one woman whom that man loves.
These correspond to distinct situations (or possible worlds) that make the sentence true.
Observation: For the second example we only get one reading: the analogue of 2. The reason
for this lies not in the logical structure of the sentence, but in concepts involved. We interpret
the meaning of the word has as the relation “has as physical part”, which in our world carries a
certain uniqueness condition: If a is a physical part of b, then it cannot be a physical part of c,
unless b is a physical part of c or vice versa. This makes the structurally possible analogue to 1.
impossible in our world and we discard it.
Observation: In the examples above, we have seen that (in the worst case), we can have one
reading for every ordering of the quantificational phrases in the sentence. So, in the third example,
we have four of them, we would get 4! = 24 readings. It should be clear from introspection that
we (humans) do not entertain 12 readings when we understand and process this sentence. Our
models should account for such effects as well.
Context and Interpretation: It appears that the last two sentences have different informational
content on different occasions of use. Suppose I say Everyone is here. at the beginning of class.
Then I mean that everyone who is meant to be in the class is here. Suppose I say it later in the
day at a meeting; then I mean that everyone who is meant to be at the meeting is here. What
shall we say about this? Here are three different kinds of solution:
Radical Semantic View On every occasion of use, the sentence literally means that everyone
in the world is here, and so is strictly speaking false. An interpreter recognizes that the speaker
has said something false, and uses general principles to figure out what the speaker actually
meant.
Radical Pragmatic View What the semantics provides is in some sense incomplete. What the
sentence means is determined in part by the context of utterance and the speaker’s intentions.
The differences in meaning are entirely due to extra-linguistic facts which have no linguistic
reflex.
The Intermediate View The logical form of sentences with the quantifier every contains a slot
for information which is contributed by the context. So extra-linguistic information is required
to fix the meaning; but the contribution of this information is mediated by linguistic form.
We now come to a phenomenon of natural language, that is a paradigmatic challenge for pragmatic
analysis: anaphora – the practice of replacing a (complex) reference with a mere pronoun.
664 CHAPTER 30. NATURAL LANGUAGE PROCESSING
Anaphora challenge pragmatic analysis, since they can only be resolved from the
context using world knowledge.
Anaphora are also interesting for pragmatic analysis, since they introduce (often initially massive
amoungs of) ambiguity that needs to be taken care of in the language understanding process.
We now come to another challenge to pragmatic analysis: presuppositions. Instead of just being
subject to the context of the readers/hearers like anaphora, they even have the potential to change
the context itself or even affect their world knowledge.
Remark 30.4.2. Natural languages like English, German, or Spanish are not.
Example 30.4.3. Let us look at concrete examples
Not to be invited is sad! (definitely English)
To not be invited is sad! (controversial)
Definition 30.4.5. A text corpus (or simply corpus; plural corpora) is a large and
structured collection of natural language texts called documents.
Definition 30.4.6. In corpus linguistics, corpora are used to do statistical analysis
and hypothesis testing, checking occurrences or validating linguistic rules within a
specific natural language.
Thus, a trigram model for a language with 100 characters, P(ci |ci−2:i−1 ) has
1.000.000 entries. It can be estimated from a corpus with 107 characters.
ℓ∗ = argmax (P (ℓ|c1:N ))
ℓ
= argmax (P (ℓ) · P (c1:N |ℓ))
ℓ
N
Y
= argmax (P (ℓ) · ( P (ci |ci−2:i−1 , ℓ)))
ℓ i=1
The prior probability P (ℓ) can be estimated, it is not a critical factor, since the
trigram language models are extremely sensitive.
Remark 30.4.13. While many features help make this classification, counts of
punctuation and other character n-gram features go a long way [KNS97].
30.4. LANGUAGE MODELS 667
Definition 30.4.14. Named entity recognition (NER) is the task of finding names
of things in a document and deciding what class they belong to.
Example 30.4.15. In Mr. Sopersteen was prescribed aciphex. NER should
recognize that Mr. Sopersteen is the name of a person and aciphex is the name of
a drug.
Remark 30.4.16. Character-level language models are good for this task because
they can associate the character sequence ex with a drug name and steen with a
person name, and thereby identify words that they have never seen before.
Remark 30.4.18. OOV words are usually content words such as names and locations
which contain information crucial to the success of NLP tasks.
Idea: Model OOV words by
1. adding a new word token, e.g. <UNK> to the vocabulary,
2. in the training corpus, replacing the respective first occurrence of a previously
unknown word by <UNK>,
3. counting n grams as usual, treating <UNK> as a regular word.
This trick can be refined if we have a word classifier, then use a new token per class,
e.g. <EMAIL> or <NUM>.
Example 30.4.19 (Test n-grams). Build unigram, bigram, and trigram language
models over the words [RN03], randomly sample sequences from the models.
1. Unigram: logical are as are confusion a may right tries agent goal the was . . .
2. Bigram: systems are very similar computational approach would be represented . . .
3. Trigram: planning and scheduling are integrated the success of naive bayes model . . .
Clearly there are differences, how can we measure them to evaluate the models?
Definition 30.4.20. The perplexity of a sequence c1:N is defined as
1
−( N )
Perplexity(c1:N ):=P (c1:N )
Native Speakers However: Will tell you that a black cat matches a familiar
pattern: article-adjective-noun, while cat black a does not!
Example 30.5.1. Consider the fulvous kitten a native speaker reasons that it
follows the determiner-adjective-noun pattern
fulvous (=
b brownish yellow) ends in ous ; adjective
So by generalization this is (probably) correct English.
Observation: The order of syntactical categories of words plays a role in English!
Problem: How can we compute them? (up next)
Part-of-Speech Tagging
Definition 30.5.2. Part-of-speech tagging (also POS tagging, POST, or gram-
matical tagging) is the process of marking up a word in corpus with tags (called
30.5. PART OF SPEECH TAGGING 669
Example 30.5.4. In text-to-speech synthesis, a POS tag of “noun” for record helps
determine the correct pronunciation (as opposed to the tag “verb”)
the HMM does not consider context other than the current state (Markov
property)
it does not have any idea what the sentence is trying to convey
Idea: Use the Viterbi algorithm to find the most probable sequence of hidden
states (POS tags)
POS taggers based on the Viterbi algorithm can reach an F1 score of up to 97%.
For the sensor model P (Wt = would|Ct = M D) = 0.1 means that if we choose a
modal verb, we will choose would 10% of the time.
These numbers also come from the corpus with appropriate smoothing.
Limitations: HMM models only know about the transition and sensor models
In particular, we cannot take into account that e.g. words ending in ous are likely
adjectives.
We will see methods based on neural networks later.
Spam Detection
Definition 30.6.5. Spam detection – classifying an email message as spam or ham
(i.e. non-spam)
General Idea: Use NLP/machine learning techniques to learn the categories.
where P (c) is estimated just by counting the total number of spam and ham mes-
sages.
This approach works well for spam detection, just as it did for language identifi-
cation.
Information Retrieval
30.7. INFORMATION RETRIEVAL 673
Definition 30.7.3. Information retrieval (IR) deals with the representation, orga-
nization, storage, and maintenance of information objects that provide users with
easy access to the relevant information and satisfy their various information needs.
Observation (Hjørland 1997): Information need is closely related to relevance:
If something is relevant for a person in relation to a given task, we might say that
the person needs the information for that task.
Definition 30.7.4. Relevance denotes how well an information object meets the
information need of the user. Relevance may include concerns such as timeliness,
authority or novelty of the object.
Idea: Query and document are similar, iff the angle between their word frequency
vectors is small.
term 1
D1 (t1,1 , t1,2 , t1,3 )
D2 (t2,1 , t2,2 , t2,3 )
term 3
term 2
Lemma 30.7.10 (Euclidean Dot Product Formula). A·B = ∥A∥2 ∥B∥2 cos θ,
where θ is the angle between A and B.
Definition 30.7.11. The cosine similarity of A and B is cos θ = A·B
∥A∥2 ∥B∥2 .
Idea: Use the tfidf-vector with cosine similarity for information retrieval instead.
Definition 30.7.16. Let D be a document collection with vocabulary V =
{t1 , . . ., t|V | }, then the tfidf-vector tfidf(d, D) ∈ N|V | is defined by tfidf(d, D)i :=
tfidf(ti , d, D).
TF-IDF Example
30.7. INFORMATION RETRIEVAL 675
Once an answer set has been determined, the results have to be sorted, so that they can be
presented to the user. As the user has a limited attention span – users will look at most at three
to eight results before refining a query, it is important to rank the results, so that the hits that
contain information relevant to the user’s information need early. This is a very difficult problem,
as it involves guessing the intentions and information context of users, to which the search engine
has no access.
Problem: There are many hits, need to sort them (e.g. by importance)
Idea: A web site is important, . . . if many other hyperlink to it.
Definition 30.7.17. Let A be a web page that is hyperlinked from web pages
S1 , . . . , Sn , then the page rank PR of A is defined as
PR(S1 ) PR(Sn )
PR(A) = 1 − d + d + ··· +
C(S1 ) C(Sn )
Getting the ranking right is a determining factor for success of a search engine. In fact, the early
of Google was based on the pagerank algorithm discussed above (and the fact that they figured
out a revenue stream using text ads to monetize searches).
Example 30.8.2. Extracting instances of addresses from web pages, with attributes
for street, city, state, and zip code;
Example 30.8.3. Extracting instances of storms from weather reports, with at-
tributes for temperature, wind speed, and precipitation.
Example 30.8.8. For List price $99.00, special sale price $78.00, shipping $3.00.
take the lowest price that is within 50% of the highest price. ; $78.00
Course Intent: Groom students for bachelor/master theses and as KWARC re-
search assistants.
In this chapter, we explore this idea, using – and extending – the methods from ??.
Overview:
1. Word embeddings
2. Recurrent neural networks for NLP
3. Sequence-to-sequence models
4. Transformer Architecture
5. Pretraining and transfer learning.
Word Embeddings
Problem: For ML methods in NLP, we need numerical data. (not words)
Idea: Embed words or word sequences into real vector spaces.
679
680 CHAPTER 31. DEEP LEARNING FOR NLP
One hot word embeddings are rarely used for actual tasks, but often used as a
starting point for better word embeddings.
Example 31.1.3 (Vector Space Methods in Information Retrieval).
Word frequency vectors are induced by adding up one hot word embeddings.
Example 31.1.4. Given a corpus D – the context – the tf idf word embedding
is given by tfidf(t, d, D):=tf(t, d) · log10 ( |{d∈D|D|
| t∈d}| ), where tf(t, d) is the term
frequency of word t in document d.
Intuition behind these two: Words that occur in similar documents are similar.
Word2Vec
Idea: Use feature extraction to map words to vectors in RN :
Train a neural network on a “dummy task”, throw away the output layer, use the
previous layer’s output (of size N ) as the word embedding
First Attempt: Dimensionality Reduction: Train to predict the original one hot
vector:
For a vocabulary size V , train a network with a single hidden layer; i.e. three layers
of sizes (V, N, V ). The first two layers will compute our embeddings.
Feed the one hot encoded input word into the network, and train it on the one hot
vector itself, using a softmax activation function at the output layer. (softmax
normalizes a vector into a probability distribution)
Properties
Vector embeddings like CBOW have interesting properties:
Similarity : Using e.g. cosine similarity (A · B · cos(θ)) to compare vectors, we can
find words with similar meanings.
Semantic and syntactic relationships emerge as arithmetic relations:
Word2vec: the original system that established the concept (see above)
GloVe (Global Vectors)
fastText (embeddings for 157 languages)
But we can also train our own word embedding (together with main task) (up
next)
a past participle
an adjective
a noun.
If a nearby temporal adverb refers to the past ; this occurrence may be a past tense
verb.
Note: CBOW treats all context words identically reagrdless of order, but in POS
tagging the exact positions of the words matter.
POS/Embedding Network
Idea: Start with a random (or pretrained) embedding of the words in the corpus and
just concatenate them over some context window size
Figure 24.3
Feedforward part-of-speech tagging model. This model takes a 5-word window as input and
predicts the tag of the word in the middle—here, cut. The model is able to account for word
Layer 1 has (in this case) 5 · N inputs, Output layer is one hot over POS classes.
position because each of the 5 input embeddings is multiplied by a different part of the first
hidden layer. The parameter values for the word embeddings and for the three layers are all
learned simultaneously during training.
embedding for each word and concatenate the embedding vectors. The result is a
real-valued input vector of length . Even though a given word will have the
same embedding vector whether it occurs in the first position, the last, or
somewhere in between, each embedding will be multiplied by a different part of the
31.2. RECURRENT NEURAL NETWORKS 683
The embedding layers treat all words the same, but the first hidden layer will treat
them differently depending on the position.
The embeddings will be finetuned for the POS task during training.
Note: Better positional encoding techniques exist (e.g. sinusoidal), but for fixed small
context window sizes, this works well.
Michael Kohlhase: Artificial Intelligence 2 1081 2025-02-06
Example 31.2.1. In the sentence Eduardo told me that Miguel was very sick so
I took him to the hospital, the pronouns him refers to Miguel and not Eduardo.
(14 words of context)
Observation: Language models with n-grams or n-word feed-forward networks
have problems:
Either the context is too small or the model has too many parameters! (or both)
Observation: Feed-forward networks N also have the problem of asymmetry:
whatever N learns about a word w at position n, it has to relearn about w at
position m ̸= n.
Idea: What about recurrent neural networks – nets with cycles? (up next)
Intuition: RNNs are a bit like HMMs and dynamic Bayesian Networks:
indicates a delay. Each input is the word embedding vector of the next w
output is the output for that time step. (b) The same network unrolled o
They make a Markov assumption: the hidden state z suffices to capture the input
feedforward network. Note that the weights are shared across all timestep
Training RNNs for NLP parameters in the weight matrixes , , and stays constan
layer has access to both the current input word and the prev
which means that information about any word in the input can b
indefinitely, copied over (or modified as appropriate) from one ti
course, there is a limited amount of storage in , so it can’t remem
the previous words.
Problem: The weight matrices Wx,z , Wz,z , and Wz,y are shared over all time
(a) Schematic diagram of an RNN where the hidden layer has recurrent connections; the symbol
slides.
indicates a delay. Each input is the word embedding vector of the next word in the sentence. Each
output is the output for that time step. (b) The same network unrolled over three timesteps to create a
Definition 31.2.4. The back-propagation through time algorithm carefully main-
feedforward network. Note that the weights are shared across all timesteps.
interested in doing multiclass classification: the classes are the words of the vocabulary.
Thus the output will be a softmax probability distribution over the possible values of the
Bidirectional RNN for more Context
next word in the sentence.
Observation: RNNs only take left context – i.e. words before – into account, but
The RNN architecture solves the problem of too many parameters. The number of
we may also need right context – the words after.
parameters in the weight matrixes , , and stays constant, regardless of the number
Example
of words—it is 31.2.5.
. This For Eduardo
is in contrast told
to feedforward me that
networks, Miguel
which have was very sick so I took him
to and
parameters, the -gram
hospital thewhich
models, pronoun
have him parameters,
resolves to Miguel
where with
is the high
size of the probability.
If the
vocabulary. sentence ended with to see Miguel, then it should be Eduardo.
The RNN architecture also solves the problem of asymmetry, because the weights are the
same for every word position.
model. The only difference is that the training data will require labels—part of speech tags or
reference indications. That makes it much harder to collect the data than for the case of a
language model, where unlabelled text is all we need.
In a language model we want to predict the th word given the previous words. But for
classification, there is no reason we should limit ourselves to looking at only the previous
words. It can be very helpful to look ahead in the sentence. In our coreference example, the
referent him would be different if the sentence concluded “to see Miguel” rather than “to the
31.2. RECURRENT NEURAL NETWORKS
hospital,” so looking ahead is crucial. We know from eye-tracking experiments that human
685
readers do not go strictly left-to-right.
Example 31.2.7. Bidirectional RNNs can be used for POS tagging, extending the
RNN for POS tagging is shown in Figure 24.5 .
network from ??
Figure 24.5
LSTM: Idea
Introduce a memory vector c in addition to the recurrent (short-term memory) vector
z
c is essentially copied from the previous time step, but can be modified by the forget
gate f , the input gate i, and the output gate o.
the forget gate f decides which components of c to retain or discard
686 CHAPTER 31. DEEP LEARNING FOR NLP
the input gate i decides which components of the current input to add to c
(additive, not multiplicative ; no vanishing gradients)
the output gate o decides which components of c to output as z
the three Spanish words caballo de mar translate to the English seahorse and
the two Spanish words perro grande translate to English as big dog.
in English, the subject is usually first and in Fijian last.
Idea: For MT, generate one word at a time, but keep track of the context, so that
Sequence-To-Sequence Models
Idea: Use two coupled RNNs, one for the source, and one for the target. The
input for the target is the output of the last hidden layer of the source RNN.
Definition 31.3.1. A sequence-to-sequence (seq2seq) model is a neural model for
translating an input sequence x into an output sequence y by an encoder followed
by a decoder generates y.
output
hi
Encoder Decoder
input
Example 31.3.2. A simple seq2seq model (without embedding and output layers)
This neural network architecture is called a basic sequence-to-sequence model, an example
of which is shown in Figure 24.6 . Sequence-to-sequence models are most commonly used
for machine translation, but can also be used for a number of other tasks, like automatically
generating a text caption from an image, or summarization: rewriting a long text into a
Basic sequence-to-sequence model. Each block represents one LSTM timestep. (For simplicity, the
embedding and output layers are not shown.) On successive steps we feed the network the words of the
Each block
source represents one
sentence “The man LSTM
is tall,” time
followed by thestep;
<start> inputs are fed
tag to indicate successively
that the followed by
network should start
the token <start>
producing tosentence.
the target start The thefinaldecoder.
hidden state at the end of the source sentence is used as the
hidden state for the start of the target sentence. After that, each target sentence word at time is used as
input at time , until the network produces the <end> tag to indicate that sentence generation is
finished.
Michael Kohlhase: Artificial Intelligence 2 1089 2025-02-06
Seq2Seq Evaluation
Remark: Seq2seq models were a major breakthrough in NLP and MT. But they
have three major shortcomings:
nearby context bias: RNNs remember with their hidden state, which has more
information about a word in – say – step 56 than in step 5. BUT long-distance
context can also be important.
fixed context size: the entire information about the source sentence must be
compressed into the fixed-dimensional – typically 1024 – vector. Larger vectors
; slow training and overfitting.
Idea: Concatenate all source RNN hidden vectors to use all of them to mitigate
the nearby context bias.
Attention
Bad Idea: Concatenate all source RNN hidden vectors to use all of them to
mitigate the nearby context bias.
Better Idea: The decoder generates the target sequence one word at a time. ;
Only a small part of the source is actually relevant.
the decoder must focus on different parts of the source for every word.
Idea: We need a neural component that does context-free summarization.
Definition 31.3.3. An attentional seq2seq model is a seq2seq that passes along a
context vector ci in the decoder. If hi = RN N (hi−1 , xi ) is the standard decoder,
then the decoder with attention is given by hi = RN N (hi−1 , xi + ci ), where xi + ci
is the concatenation of the input xi and context vectors ci with
These scores are then normalized into a probability using a softmax over all source
where
output
is the target RNN vector that is going to be used for predicting the word at
words. Finally, these probabilities are used to generate a weighted average of the source
RNN vectors, Both
timestep , and
(another
context
is the output of the source RNN vector for the source word (or timestep) .
and -dimensional
vectorvector).
are -dimensional vectors, where is the hidden size. The value of is
Encoder Decoder x +c
therefore the raw “attention score” between the current target statei and the
i source word .
An example ofThese scores are then normalized into a probability
an attentional sequence-to-sequence modelusing a softmax over all source
is given in Figure 24.7(a) .
input details to understand. First, the attention component itself has no
words. Finally, these probabilities are used to generate a weighted average of the source
There are a few important
RNN vectors, (another -dimensional vector).
learned weights and supports variable-length sequences on both the source and target side.
Second, like most of theof
An example other neural network
an attentional modeling techniques
sequence-to-sequence we’ve
model is given learned
in Figure 24.7(a)about,
.
Michael Kohlhase: Artificial Intelligence 2 1091 2025-02-06
There are
attention is entirely a few important
latent. details to understand.
The programmer First, thewhat
does not dictate attention component gets
information itself used
has no
learned weights and supports variable-length sequences on both the source and target side.
when; the model learns what to use. Attention can also be combined with multilayer RNNs.
Second, like most of the other neural network modeling techniques we’ve learned about,
Attention: English to Spanish Translation
Typically attention is applied at each layer in that case.
attention is entirely latent. The programmer does not dictate what information gets used
when; the model learns what to use. Attention can also be combined with multilayer RNNs.
Definition 31.3.6. Always selecting the highest probability word is called greedy
decoding.
Problem: This may not always maximize the probability of the whole sequence
Example 31.3.7. Let’s use a greedy decoder on The front door is red.
Beam search with beam size of . The score of each word is the log-probability generated by the target
Word scores
RNN softmax, andare log-probabilities
the score of each hypothesisgenerated by word
is the sum of the the scores.
decoder softmax
At timestep 3, the highest
scoring hypothesis La entrada can only generate low-probability continuations, so it “falls off the beam.”
hypothesis score is the sum of the word scores.
At time step 3, the highest scoring hypothesis La entrada can only generate low-
probability continuations, so it “falls off the beam”. (as
intended)
Self-Attention
Idea: “Attention is all you need!” (see [Vas+17])
So far, attention was used from the encoder to the decoder.
Self-attention extends this so that each hidden states sequence also attends to itself.
(*coder to *coder)
Idea: Just use the dot product of the input vectors
Problem: Always high, so each hidden state will be biased towards attending to
itself.
Self-attention solves this by first projecting the input into three different represen-
tations using three different weight matrices:
the query vector qi = Wq xi =
b standard attention
key vector ki = Wk xi =
b the source in seq2seq
value vector vi = Wv xi is the context being generated
√
rij = (qi ·kiP
)/ d
rij
aij = eP /( k erij )
ci = j aij · vj
Positional embedding
The transformer architecture does not explicitly capture the order of words in the sequence,
since context is modeled only through self-attention, which is agnostic to word order. To
capture the ordering of the words, the transformer uses a technique called positional
31.5. LARGE LANGUAGE MODELS 691
Figure 24.10 illustrates the transformer architecture for POS tagging, applied to the same
Michael Kohlhase: Artificial Intelligence 2 1097 2025-02-06
sentence used in Figure 24.3 . At the bottom, the word embedding and the positional
embeddings are summed to form the input for a three-layer transformer. The transformer
Idea: Take a pretrained neural network, replace the last layer(s), and then train
those on your own corpus.
Observation: Simple but surprisingly efficient!
Repeat until ℓ = N .
; we obtain a one-hot encoding of tokens of size N , where the most common sequences
of bytes are represented by a single token. By retaining BPE(⟨b⟩) = b, we avoid OOV
problems.
; We can then train a word embedding on the resulting tokens
31.5. LARGE LANGUAGE MODELS 693
Tokenization - Example
https://huggingface.co/spaces/Xenova/the-tokenizer-playground
Positional encodings
Definition 31.5.5. Let ⟨w1 , . . . , wn ⟩ be a sequence of tokens. A positional encoding
PEi (wi ) is a vector that retains the position of wi in the sequence alongside the word
embedding of wi .
We want positional encodings to satisfy the following properties:
Masked Token Prediction: Given a sentence (e.g. “The river rose five feet”), ran-
domly replace tokens by a special mask token (e.g. “The river [MASK] five feet”).
The LLM should predict the masked tokens (e.g. “rose”). (BERT et al; well suited
for generic tasks)
Discrimination: Train a small masked token prediction model M . Given a masked
sentence, let M generated possible completions. Train the actual model to distin-
guish between tokens generated by M and the original tokens. (Google Electra et
al; well suited for generic tasks)
Next Token Prediction: Given the (beginning of) a sentence, predict the next token
in the sequence. (GPT et al; well suited for generative tasks)
DL4NLP methods do very well, but only after processing orders of magnitude more
data than humans do for learning language.
This suggests that there is of scope for new insigths from all areas.
Planning
Planning Frameworks
Planning Algorithms
Planning and Acting in the real world
695
696 CHAPTER 32. WHAT DID WE LEARN IN AI 1/2?
Agent Sensors
Percepts
Environment
?
Actions
Actuators
Figure 2.1 Agents interact with environments through sensors and actuators.
Section 2.4. Simple
The Structure of Agents
Reflex Agents 49
there is to say about the agent. Mathematically speaking, we say that an agent’s behavior is
AGENT FUNCTION described by the agent Agent function that maps any given percept sequence to an action.
Sensors
We can imagine tabulating the agent function that describes any given agent; for most
agents, this would be a very large table—infinite, Whatin the fact,
world unless we place a bound on the
is like now
length of percept sequences we want to consider. Given an agent to experiment with, we can,
Environment
in principle, construct this table by trying out all possible percept sequences and recording
which actions the agent does in response.1 The table is, of course, an external characterization
of the agent. Internally, the agent function for an artificial agent will be implemented by an
AGENT PROGRAM agent program. It is important to keep these two ideas distinct. The agent function is an
abstract mathematical Condition-action
description; rules
the agent program is aI concrete implementation, running
What action
should do now
within some physical system.
To illustrate these ideas, we use a very simple example—the vacuum-cleaner world
Actuators
shown in Figure 2.2. This world is so simple that we can describe everything that happens;
it’s also a made-up world, so we can invent many variations. This particular world has just two
Figuresquares
locations: 2.9 Schematic
A and B.diagram of a simple
The vacuum agentreflex agent. which square it is in and whether
perceives
Reflex
there isAgents
dirt in with State It can choose to move left, move right, suck up the dirt, or do
the square.
nothing. One very simple agent function is the following: if the current square is dirty, then
suck; otherwise,
function S IMPLEmove to the-Aother
-R EFLEX GENTsquare.
( perceptA partial tabulation
) returns an action of this agent function is shown
persistent:
in Figure 2.3 and an agent
rules, program
a set of that implements
condition–action rules it appears in Figure 2.8 on page 48.
Looking
state at Figure-I2.3,
← I NTERPRET we see that various vacuum-world agents can be defined simply
NPUT( percept )
by filling
rule ←in Rthe
ULEright-hand column
-M ATCH(state, in various ways. The obvious question, then, is this: What
rules)
is theaction
right ←way to fill
rule.ACTION out the table? In other words, what makes an agent good or bad,
intelligent
returnor stupid? We answer these questions in the next section.
action
1 If the agent uses some randomization to choose its actions, then we would have to try each sequence many
Figure 2.10 A simple reflex agent. It acts according to a rule whose condition matches
times to identify the probability of each action. One might imagine that acting randomly is rather silly, but we
the current
show later state, as
in this chapter defined
that byvery
it can be the intelligent.
percept.
trivial; it gets more interesting shortly.) We use rectangles to denote the current internal state
of the agent’s decision process, and ovals to represent the background information used in
the process. The agent program, which is also very simple, is shown in Figure 2.10. The
I NTERPRET-I NPUT function generates an abstracted description of the current state from the
Section 2.4. The Structure of Agents 51 697
Sensors
State
How the world evolves What the world
is like now
Environment
What my actions do
Agent Actuators
Environment
What it will be like
state ← U PDATE -S What (state,
TATEmy action
actions do , percept ,ifmodel ) A
I do action
rule ← RULE -M ATCH(state, rules)
action ← rule.ACTION
return action
What action I
Goals should do now
Figure 2.12 A model-based reflex agent. It keeps track of the current state of the world,
using an internal model. It then chooses an action in the same way as the reflex agent.
Agent Actuators
Performance standard
Critic Sensors
feedback
Environment
changes
Learning Performance
element element
knowledge
learning
goals
Problem
generator
Actuators
Agent
He estimates how much work this might take and concludes “Some more expeditious method
seems desirable.” The method he proposes is to build learning machines and then to teach
Rational
them. InAgentmany areas of AI, this is now the preferred method for creating state-of-the-art
systems. Learning has another advantage, as we noted earlier: it allows the agent to operate
in initially
Idea: Tryunknown
to design environments
agents that andare
to become
successfulmore competent than(do its initial knowledge
the right thing)
alone might allow. In this section, we briefly introduce the main ideas of learning agents.
Throughout 32.0.1.
Definition the book, An we agent
comment on opportunities
is called rational, ifandit methods
chooses for learning in
whichever particular
action max-
kinds of agents. Part V goes into much more depth on the learning
imizes the expected value of the performance measure given the percept sequence algorithms themselves.
to date.A learning agent can
This is called thebeMEU divided into four conceptual components, as shown in Fig-
principle.
LEARNING ELEMENT ure 2.15. The most important distinction is between the learning element, which is re-
PERFORMANCE
ELEMENT Note:
sponsibleAfor makingagent
rational improvements,
need notand bethe performance element, which is responsible for
perfect
selecting external actions. The performance element is what we have previously considered
to only
be theneeds to maximize
entire agent: it takes inexpected
percepts value
and decides on actions. The (rational
learning omniscient)
̸=element uses
CRITIC feedback
need from critic on
notthepredict e.g.how verytheunlikely
agent is butdoing and determines
catastrophic how in
events thetheperformance
future
element should be modified to do better in the future.
percepts
The designmayofnot supply all
the learning relevant
element information
depends very much on the(Rational
design of the clairvoyant)
̸= performance
element.
if weWhen tryingperceive
cannot to design things
an agentwe that
dolearns a certain
not need capability,
to react the first question is
to them.
not “How am I going to get it to learn this?” but
but we may need to try to find out about hidden dangers
“What kind of performance element will my
(exploration)
agent need to do this once it has learned how?” Given an agent design, learning mechanisms
action
can outcomes
be constructed may not
to improve be as
every partexpected
of the agent. (rational ̸= successful)
but we may need to take action to ensure that they dowith
The critic tells the learning element how well the agent is doing (morerespect to a fixed
often)
performance standard.
(learning) The critic is necessary because the percepts themselves provide no
indication of the agent’s success. For example, a chess program could receive a percept
Rational
indicating; thatexploration, learning,
it has checkmated autonomy
its opponent, but it needs a performance standard to know
that this is a good thing; the percept itself does not say so. It is important that the performance
[Bac00] Fahiem Bacchus. Subset of PDDL for the AIPS2000 Planning Competition. The AIPS-
00 Planning Competition Comitee. 2000.
[BF95] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Proceedings of the 14th International Joint Conference on Artificial Intelligence
(IJCAI). Ed. by Chris S. Mellish. Montreal, Canada: Morgan Kaufmann, San Mateo,
CA, 1995, pp. 1636–1642.
[BF97] Avrim L. Blum and Merrick L. Furst. “Fast planning through planning graph analysis”.
In: Artificial Intelligence 90.1-2 (1997), pp. 279–298.
[BG01] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search”. In: Artificial Intelli-
gence 129.1–2 (2001), pp. 5–33.
[BG99] Blai Bonet and Héctor Geffner. “Planning as Heuristic Search: New Results”. In:
Proceedings of the 5th European Conference on Planning (ECP’99). Ed. by S. Biundo
and M. Fox. Springer-Verlag, 1999, pp. 60–72.
[BKS04] Paul Beame, Henry A. Kautz, and Ashish Sabharwal. “Towards Understanding and
Harnessing the Potential of Clause Learning”. In: Journal of Artificial Intelligence
Research 22 (2004), pp. 319–351.
[Bon+12] Blai Bonet et al., eds. Proceedings of the 22nd International Conference on Automated
Planning and Scheduling (ICAPS’12). AAAI Press, 2012.
[Bro90] Rodney Brooks. In: Robotics and Autonomous Systems 6.1–2 (1990), pp. 3–15. doi:
10.1016/S0921-8890(05)80025-9.
[Cho65] Noam Chomsky. Syntactic structures. Den Haag: Mouton, 1965.
[CKT91] Peter Cheeseman, Bob Kanefsky, and William M. Taylor. “Where the Really Hard
Problems Are”. In: Proceedings of the 12th International Joint Conference on Artificial
Intelligence (IJCAI). Ed. by John Mylopoulos and Ray Reiter. Sydney, Australia:
Morgan Kaufmann, San Mateo, CA, 1991, pp. 331–337.
[CM85] Eugene Charniak and Drew McDermott. Introduction to Artificial Intelligence. Ad-
dison Wesley, 1985.
[CQ69] Allan M. Collins and M. Ross Quillian. “Retrieval time from semantic memory”. In:
Journal of verbal learning and verbal behavior 8.2 (1969), pp. 240–247. doi: 10.1016/
S0022-5371(69)80069-1.
[Dav67] Donald Davidson. “Truth and Meaning”. In: Synthese 17 (1967).
[DCM12] DCMI Usage Board. DCMI Metadata Terms. DCMI Recommendation. Dublin Core
Metadata Initiative, June 14, 2012. url: http : / / dublincore . org / documents /
2012/06/14/dcmi-terms/.
[DF31] B. De Finetti. “Sul significato soggettivo della probabilita”. In: Fundamenta Mathe-
maticae 17 (1931), pp. 298–329.
701
702 BIBLIOGRAPHY
[DHK15] Carmel Domshlak, Jörg Hoffmann, and Michael Katz. “Red-Black Planning: A New
Systematic Approach to Partial Delete Relaxation”. In: Artificial Intelligence 221
(2015), pp. 73–114.
[Ede01] Stefan Edelkamp. “Planning with Pattern Databases”. In: Proceedings of the 6th Eu-
ropean Conference on Planning (ECP’01). Ed. by A. Cesta and D. Borrajo. Springer-
Verlag, 2001, pp. 13–24.
[FD14] Zohar Feldman and Carmel Domshlak. “Simple Regret Optimization in Online Plan-
ning for Markov Decision Processes”. In: Journal of Artificial Intelligence Research
51 (2014), pp. 165–205.
[Fis] John R. Fisher. prolog :- tutorial. url: https : / / saksagan . ceng . metu . edu .
tr/courses/ceng242/documents/prolog/jrfisher/contents.html (visited on
10/29/2024).
[FL03] Maria Fox and Derek Long. “PDDL2.1: An Extension to PDDL for Expressing Tem-
poral Planning Domains”. In: Journal of Artificial Intelligence Research 20 (2003),
pp. 61–124.
[Fla94] Peter Flach. Wiley, 1994. isbn: 0471 94152 2. url: https://github.com/simply-
logical/simply-logical/releases/download/v1.0/SL.pdf.
[FN71] Richard E. Fikes and Nils Nilsson. “STRIPS: A New Approach to the Application of
Theorem Proving to Problem Solving”. In: Artificial Intelligence 2 (1971), pp. 189–
208.
[Gen34] Gerhard Gentzen. “Untersuchungen über das logische Schließen I”. In: Mathematische
Zeitschrift 39.2 (1934), pp. 176–210.
[Ger+09] Alfonso Gerevini et al. “Deterministic planning in the fifth international planning
competition: PDDL3 and experimental evaluation of the planners”. In: Artificial In-
telligence 173.5-6 (2009), pp. 619–668.
[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractability—A Guide to
the Theory of NP-Completeness. BN book: Freeman, 1979.
[Glo] Grundlagen der Logik in der Informatik. Course notes at https://www8.cs.fau.de/
_media/ws16:gloin:skript.pdf. url: https://www8.cs.fau.de/_media/ws16:
gloin:skript.pdf (visited on 10/13/2017).
[GNT04] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: Theory and
Practice. Morgan Kaufmann, 2004.
[GS05] Carla Gomes and Bart Selman. “Can get satisfaction”. In: Nature 435 (2005), pp. 751–
752.
[GSS03] Alfonso Gerevini, Alessandro Saetti, and Ivan Serina. “Planning through Stochas-
tic Local Search and Temporal Action Graphs”. In: Journal of Artificial Intelligence
Research 20 (2003), pp. 239–290.
[Hau85] John Haugeland. Artificial intelligence: the very idea. Massachusetts Institute of Tech-
nology, 1985.
[HD09] Malte Helmert and Carmel Domshlak. “Landmarks, Critical Paths and Abstractions:
What’s the Difference Anyway?” In: Proceedings of the 19th International Conference
on Automated Planning and Scheduling (ICAPS’09). Ed. by Alfonso Gerevini et al.
AAAI Press, 2009, pp. 162–169.
[HE05] Jörg Hoffmann and Stefan Edelkamp. “The Deterministic Part of IPC-4: An Overview”.
In: Journal of Artificial Intelligence Research 24 (2005), pp. 519–579.
[Hel06] Malte Helmert. “The Fast Downward Planning System”. In: Journal of Artificial In-
telligence Research 26 (2006), pp. 191–246.
BIBLIOGRAPHY 703
[Her+13a] Ivan Herman et al. RDF 1.1 Primer (Second Edition). Rich Structured Data Markup
for Web Documents. W3C Working Group Note. World Wide Web Consortium (W3C),
2013. url: http://www.w3.org/TR/rdfa-primer.
[Her+13b] Ivan Herman et al. RDFa 1.1 Primer – Second Edition. Rich Structured Data Markup
for Web Documents. W3C Working Goup Note. World Wide Web Consortium (W3C),
Apr. 19, 2013. url: http://www.w3.org/TR/xhtml-rdfa-primer/.
[HG00] Patrik Haslum and Hector Geffner. “Admissible Heuristics for Optimal Planning”. In:
Proceedings of the 5th International Conference on Artificial Intelligence Planning
Systems (AIPS’00). Ed. by S. Chien, R. Kambhampati, and C. Knoblock. Brecken-
ridge, CO: AAAI Press, Menlo Park, 2000, pp. 140–149.
[HG08] Malte Helmert and Hector Geffner. “Unifying the Causal Graph and Additive Heuris-
tics”. In: Proceedings of the 18th International Conference on Automated Planning
and Scheduling (ICAPS’08). Ed. by Jussi Rintanen et al. AAAI Press, 2008, pp. 140–
147.
[HHH07] Malte Helmert, Patrik Haslum, and Jörg Hoffmann. “Flexible Abstraction Heuristics
for Optimal Sequential Planning”. In: Proceedings of the 17th International Conference
on Automated Planning and Scheduling (ICAPS’07). Ed. by Mark Boddy, Maria
Fox, and Sylvie Thiebaux. Providence, Rhode Island, USA: Morgan Kaufmann, 2007,
pp. 176–183.
[Hit+12] Pascal Hitzler et al. OWL 2 Web Ontology Language Primer (Second Edition). W3C
Recommendation. World Wide Web Consortium (W3C), 2012. url: http://www.
w3.org/TR/owl-primer.
[HN01] Jörg Hoffmann and Bernhard Nebel. “The FF Planning System: Fast Plan Generation
Through Heuristic Search”. In: Journal of Artificial Intelligence Research 14 (2001),
pp. 253–302.
[Hof11] Jörg Hoffmann. “Every806thing You Always Wanted to Know about Planning (But
Were Afraid to Ask)”. In: Proceedings of the 34th Annual German Conference on
Artificial Intelligence (KI’11). Ed. by Joscha Bach and Stefan Edelkamp. Vol. 7006.
Lecture Notes in Computer Science. Springer, 2011, pp. 1–13. url: http://fai.cs.
uni-saarland.de/hoffmann/papers/ki11.pdf.
[How60] R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
[ILD] 7. Constraints: Interpreting Line Drawings. url: https://www.youtube.com/watch?
v=l-tzjenXrvI&t=2037s (visited on 11/19/2019).
[JN33] E. S. Pearson J. Neyman. “IX. On the problem of the most efficient tests of statis-
tical hypotheses”. In: Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences 231.694-706 (1933), pp. 289–337.
doi: 10.1098/rsta.1933.0009.
[KC04] Graham Klyne and Jeremy J. Carroll. Resource Description Framework (RDF): Con-
cepts and Abstract Syntax. W3C Recommendation. World Wide Web Consortium
(W3C), Feb. 10, 2004. url: http://www.w3.org/TR/2004/REC- rdf- concepts-
20040210/.
[KD09] Erez Karpas and Carmel Domshlak. “Cost-Optimal Planning with Landmarks”. In:
Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJ-
CAI’09). Ed. by C. Boutilier. Pasadena, California, USA: Morgan Kaufmann, July
2009, pp. 1728–1733.
[Kee74] R. L. Keeney. “Multiplicative utility functions”. In: Operations Research 22 (1974),
pp. 22–34.
704 BIBLIOGRAPHY
[KHD13] Michael Katz, Jörg Hoffmann, and Carmel Domshlak. “Who Said We Need to Relax
all Variables?” In: Proceedings of the 23rd International Conference on Automated
Planning and Scheduling (ICAPS’13). Ed. by Daniel Borrajo et al. Rome, Italy: AAAI
Press, 2013, pp. 126–134.
[KHH12a] Michael Katz, Jörg Hoffmann, and Malte Helmert. “How to Relax a Bisimulation?”
In: Proceedings of the 22nd International Conference on Automated Planning and
Scheduling (ICAPS’12). Ed. by Blai Bonet et al. AAAI Press, 2012, pp. 101–109.
[KHH12b] Emil Keyder, Jörg Hoffmann, and Patrik Haslum. “Semi-Relaxed Plan Heuristics”.
In: Proceedings of the 22nd International Conference on Automated Planning and
Scheduling (ICAPS’12). Ed. by Blai Bonet et al. AAAI Press, 2012, pp. 128–136.
[KNS97] B. Kessler, G. Nunberg, and H. Schütze. “Automatic detection of text genre”. In:
CoRR cmp-lg/9707002 (1997).
[Koe+97] Jana Koehler et al. “Extending Planning Graphs to an ADL Subset”. In: Proceedings
of the 4th European Conference on Planning (ECP’97). Ed. by S. Steel and R. Alami.
Springer-Verlag, 1997, pp. 273–285. url: ftp://ftp.informatik.uni- freiburg.
de/papers/ki/koehler-etal-ecp-97.ps.gz.
[Koh08] Michael Kohlhase. “Using LATEX as a Semantic Markup Format”. In: Mathematics in
Computer Science 2.2 (2008), pp. 279–304. url: https://kwarc.info/kohlhase/
papers/mcs08-stex.pdf.
[Kow97] Robert Kowalski. “Algorithm = Logic + Control”. In: Communications of the Asso-
ciation for Computing Machinery 22 (1997), pp. 424–436.
[KS00] Jana Köhler and Kilian Schuster. “Elevator Control as a Planning Problem”. In: AIPS
2000 Proceedings. AAAI, 2000, pp. 331–338. url: https://www.aaai.org/Papers/
AIPS/2000/AIPS00-036.pdf.
[KS06] Levente Kocsis and Csaba Szepesvári. “Bandit Based Monte-Carlo Planning”. In:
Proceedings of the 17th European Conference on Machine Learning (ECML 2006). Ed.
by Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou. Vol. 4212. LNCS.
Springer-Verlag, 2006, pp. 282–293.
[KS92] Henry A. Kautz and Bart Selman. “Planning as Satisfiability”. In: Proceedings of the
10th European Conference on Artificial Intelligence (ECAI’92). Ed. by B. Neumann.
Vienna, Austria: Wiley, Aug. 1992, pp. 359–363.
[KS98] Henry A. Kautz and Bart Selman. “Pushing the Envelope: Planning, Propositional
Logic, and Stochastic Search”. In: Proceedings of the Thirteenth National Conference
on Artificial Intelligence AAAI-96. MIT Press, 1998, pp. 1194–1201.
[Kur90] Ray Kurzweil. The Age of Intelligent Machines. MIT Press, 1990. isbn: 0-262-11121-7.
[LPN] Learn Prolog Now! url: http://lpn.swi-prolog.org/ (visited on 10/10/2019).
[LS93] George F. Luger and William A. Stubblefield. Artificial Intelligence: Structures and
Strategies for Complex Problem Solving. World Student Series. The Benjamin/Cum-
mings, 1993. isbn: 9780805347852.
[Luc96] Peter Lucas. “Knowledge Acquisition for Decision-theoretic Expert Systems”. In:
AISB Quarterly 94 (1996), pp. 23–33. url: https : / / www . researchgate . net /
publication/2460438_Knowledge_Acquisition_for_Decision-theoretic_Expert_
Systems.
[McD+98] Drew McDermott et al. The PDDL Planning Domain Definition Language. The AIPS-
98 Planning Competition Comitee. 1998.
[Met+53] N. Metropolis et al. “Equations of state calculations by fast computing machines”. In:
Journal of Chemical Physics 21 (1953), pp. 1087–1091.
[Min] Minion - Constraint Modelling. System Web page at http://constraintmodelling.
org/minion/. url: http://constraintmodelling.org/minion/.
BIBLIOGRAPHY 705
[MSL92] David Mitchell, Bart Selman, and Hector J. Levesque. “Hard and Easy Distributions
of SAT Problems”. In: Proceedings of the 10th National Conference of the American
Association for Artificial Intelligence (AAAI’92). San Jose, CA: MIT Press, 1992,
pp. 459–465.
[NHH11] Raz Nissim, Jörg Hoffmann, and Malte Helmert. “Computing Perfect Heuristics in
Polynomial Time: On Bisimulation and Merge-and-Shrink Abstraction in Optimal
Planning”. In: Proceedings of the 22nd International Joint Conference on Artificial
Intelligence (IJCAI’11). Ed. by Toby Walsh. AAAI Press/IJCAI, 2011, pp. 1983–
1990.
[Nor+18a] Emily Nordmann et al. Lecture capture: Practical recommendations for students and
lecturers. 2018. url: https://osf.io/huydx/download.
[Nor+18b] Emily Nordmann et al. Vorlesungsaufzeichnungen nutzen: Eine Anleitung für Studierende.
2018. url: https://osf.io/e6r7a/download.
[NS63] Allen Newell and Herbert Simon. “GPS, a program that simulates human thought”.
In: Computers and Thought. Ed. by E. Feigenbaum and J. Feldman. McGraw-Hill,
1963, pp. 279–293.
[NS76] Alan Newell and Herbert A. Simon. “Computer Science as Empirical Inquiry: Symbols
and Search”. In: Communications of the ACM 19.3 (1976), pp. 113–126. doi: 10.
1145/360018.360022.
[OWL09] OWL Working Group. OWL 2 Web Ontology Language: Document Overview. W3C
Recommendation. World Wide Web Consortium (W3C), Oct. 27, 2009. url: http:
//www.w3.org/TR/2009/REC-owl2-overview-20091027/.
[PD09] Knot Pipatsrisawat and Adnan Darwiche. “On the Power of Clause-Learning SAT
Solvers with Restarts”. In: Proceedings of the 15th International Conference on Princi-
ples and Practice of Constraint Programming (CP’09). Ed. by Ian P. Gent. Vol. 5732.
Lecture Notes in Computer Science. Springer, 2009, pp. 654–668.
[Pól73] George Pólya. How to Solve it. A New Aspect of Mathematical Method. Princeton
University Press, 1973.
[Pra+94] Malcolm Pradhan et al. “Knowledge Engineering for Large Belief Networks”. In:
Proceedings of the Tenth International Conference on Uncertainty in Artificial In-
telligence. UAI’94. Seattle, WA: Morgan Kaufmann Publishers Inc., 1994, pp. 484–
490. isbn: 1-55860-332-8. url: http://dl.acm.org/citation.cfm?id=2074394.
2074456.
[Pro] Protégé. Project Home page at http : / / protege . stanford . edu. url: http : / /
protege.stanford.edu.
[PRR97] G. Probst, St. Raub, and Kai Romhardt. Wissen managen. 4 (2003). Gabler Verlag,
1997.
[PS08] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C
Recommendation. World Wide Web Consortium (W3C), Jan. 15, 2008. url: http:
//www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.
[PW92] J. Scott Penberthy and Daniel S. Weld. “UCPOP: A Sound, Complete, Partial Order
Planner for ADL”. In: Principles of Knowledge Representation and Reasoning: Pro-
ceedings of the 3rd International Conference (KR-92). Ed. by B. Nebel, W. Swartout,
and C. Rich. Cambridge, MA: Morgan Kaufmann, Oct. 1992, pp. 103–114. url: ftp:
//ftp.cs.washington.edu/pub/ai/ucpop-kr92.ps.Z.
[Ran17] Aarne Ranta. Automatic Translation for Consumers and Producers. Presentation
given at the Chalmers Initiative Seminar. 2017. url: https://www.grammaticalframework.
org/~aarne/mt-digitalization-2017.pdf.
706 BIBLIOGRAPHY
[RHN06] Jussi Rintanen, Keijo Heljanko, and Ilkka Niemelä. “Planning as satisfiability: parallel
plans and algorithms for plan search”. In: Artificial Intelligence 170.12-13 (2006),
pp. 1031–1080.
[Rin10] Jussi Rintanen. “Heuristics for Planning with SAT”. In: Proceeedings of the 16th In-
ternational Conference on Principles and Practice of Constraint Programming. 2010,
pp. 414–428.
[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 2nd ed.
Pearso n Education, 2003. isbn: 0137903952.
[RN09] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd.
Prentice Hall Press, 2009. isbn: 0136042597, 9780136042594.
[RN95] Stuart J. Russell and Peter Norvig. Artificial Intelligence — A Modern Approach.
Upper Saddle River, NJ: Prentice Hall, 1995.
[RW10] Silvia Richter and Matthias Westphal. “The LAMA Planner: Guiding Cost-Based
Anytime Planning with Landmarks”. In: Journal of Artificial Intelligence Research
39 (2010), pp. 127–177.
[RW91] S. J. Russell and E. Wefald. Do the Right Thing — Studies in limited Rationality.
MIT Press, 1991.
[She24] Esther Shein. 2024. url: https://cacm.acm.org/news/the- impact- of- ai- on-
computer-science-education/.
[Sil+16] David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529 (2016), pp. 484–503. url: http://www.nature.com/nature/
journal/v529/n7587/full/nature16961.html.
[Smu63] Raymond M. Smullyan. “A Unifying Principle for Quantification Theory”. In: Proc.
Nat. Acad Sciences 49 (1963), pp. 828–832.
[SR14] Guus Schreiber and Yves Raimond. RDF 1.1 Primer. W3C Working Group Note.
World Wide Web Consortium (W3C), 2014. url: http://www.w3.org/TR/rdf-
primer.
[sTeX] sTeX: A semantic Extension of TeX/LaTeX. url: https://github.com/sLaTeX/
sTeX (visited on 05/11/2020).
[SWI] SWI Prolog Reference Manual. url: https://www.swi-prolog.org/pldoc/refman/
(visited on 10/10/2019).
[Tur50] Alan Turing. “Computing Machinery and Intelligence”. In: Mind 59 (1950), pp. 433–
460.
[Vas+17] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Infor-
mation Processing Systems. Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc.,
2017. url: https://proceedings.neurips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[Wal75] David Waltz. “Understanding Line Drawings of Scenes with Shadows”. In: The Psy-
chology of Computer Vision. Ed. by P. H. Winston. McGraw-Hill, 1975, pp. 1–19.
[WHI] Human intelligence — Wikipedia The Free Encyclopedia. url: https://en.wikipedia.
org/w/index.php?title=Human_intelligence (visited on 04/09/2018).
Part VIII
Excursions
707
709
As this course is predominantly an overview over the topics of Artificial Intelligence, and not
about the theoretical underpinnings, we give the discussion about these as a “suggested readings”
part here.
710
Appendix A
The next step is to analyze the two calculi for completeness. For that we will first give ourselves
a very powerful tool: the “model existence theorem” (??), which encapsulates the model-theoretic
part of completeness theorems. With that, completeness proofs – which are quite tedious otherwise
– become a breeze.
711
712 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
Corollary: C is complete.
The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for calculi;
they form syntactic counterparts of satisfiability.
Consistency
Let C be a calculus,. . .
Definition A.1.1. Let C be a calculus, then a formula set Φ is called C-refutable, if
there is a refutation, i.e. a derivation of a contradiction from Φ. The act of finding
a refutation for Φ is called refuting Φ.
Definition A.1.2. We call a pair of formulae A and ¬A a contradiction.
So a set Φ is C-refutable, if C canderive a contradiction from it.
It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, ⊨⟩ is the current logical system.
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 713
Abstract Consistency
Definition A.1.6. Let ∇ be a collection of sets. We call ∇ closed under subsets,
iff for each Φ ∈ ∇, all subsets Ψ ⊆ Φ are elements of ∇.
Definition A.1.7 (Notation). We will use Φ∗A for Φ ∪ {A}.
Definition A.1.8. A collection ∇ of sets of propositional formulae is called an
abstract consistency class, iff it is closed under subsets, and for each Φ ∈ ∇
∇c ) P ̸∈ Φ or ¬P ̸∈ Φ for P ∈ V0
∇¬ ) ¬¬A ∈ Φ implies Φ∗A ∈ ∇
∇∨ ) A ∨ B ∈ Φ implies Φ∗A ∈ ∇ or Φ∗B ∈ ∇
∇∧ ) ¬(A ∨ B) ∈ Φ implies Φ ∪ {¬A, ¬B} ∈ ∇
Example A.1.9. The empty set is an abstract consistency class.
Example A.1.10. The set {∅, {Q}, {P ∨Q}, {P ∨Q, Q}} is an abstract consistency
class.
So a family of sets (we call it a family, so that we do not have to say “set of sets” and we can
distinguish the levels) is an abstract consistency class, iff it fulfills five simple conditions, of which
the last three are closure conditions.
Think of an abstract consistency class as a family of “consistent” sets (e.g. C-consistent for some
calculus C), then the properties make perfect sense: They are naturally closed under subsets — if
we cannot derive a contradiction from a large set, we certainly cannot from a subset, furthermore,
∇c ) If both P ∈ Φ and ¬P ∈ Φ, then Φ cannot be “consistent”.
∇¬ ) If we cannot derive a contradiction from Φ with ¬¬A ∈ Φ then we cannot from Φ∗A, since
they are logically equivalent.
The other two conditions are motivated similarly. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.
Compact Collections
Definition A.1.12. We call a collection ∇ of sets compact, iff for any set Φ we
have
Φ ∈ ∇, iff Ψ ∈ ∇ for every finite subset Ψ of Φ.
Lemma A.1.13. If ∇ is compact, then ∇ is closed under subsets.
Proof:
714 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
1. Suppose S ⊆ T and T ∈ ∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A ∈ ∇.
4. Thus S ∈ ∇.
The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a collection ∇ by testing
all their finite subsets (which is much simpler).
Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.
∇-Hintikka Set
A.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 715
∇-Hintikka Set
Proof:
We prove the properties in turn
1. Hc by induction on the structure of A
1.1. A ∈ V0 Then A ̸∈ H or ¬A ̸∈ H by ∇c .
1.2. A = ¬B
1.2.1. Let us assume that ¬B ∈ H and ¬¬B ∈ H,
1.2.2. then H∗B ∈ ∇ by ∇¬ , and therefore B ∈ H by maximality.
1.2.3. So both B and ¬B are in H, which contradicts the induction hy-
pothesis.
1.3. A = B ∨ C similar to the previous case
2. We prove H¬ by maximality of H in ∇.
2.1. If ¬¬A ∈ H, then H∗A ∈ ∇ by ∇¬ .
2.2. The maximality of H now gives us that A ∈ H.
Proof sketch: other H∗ are similar
The following theorem is one of the main results in the “abstract consistency”/”model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ ∈ H.
Extension Theorem
Theorem A.1.17. If ∇ is an abstract consistency class and Φ ∈ ∇, then there is
a ∇-Hintikka set H with Φ ⊆ H.
Proof:
1. Wlog. we assume that ∇ is compact (otherwise pass to compact extension)
2. We choose an enumeration A1 , . . . of the set wff0 (V0 )
3. and construct a sequence of sets Hi with H0 := Φ and
Hn if Hn ∗An ̸∈ ∇
Hn+1 :=
Hn ∗An if Hn ∗An ∈ ∇
S
4. Note that all Hi ∈ ∇, choose H := i∈N Hi
716 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of wff0 (V0 ). If we pick a
different enumeration, we will end up with a different H. Say if A and ¬A are both ∇-consistent1
with Φ, then depending on which one is first in the enumeration H, will contain that one; with all
the consequences for subsequent choices in the construction process.
Valuation
Definition A.1.18. A function ν : wff0 (V0 ) → Do is called a (propositional) valua-
tion, iff
ν(¬A) = T, iff ν(A) = F
ν(A ∧ B) = T, iff ν(A) = T and ν(B) = T
Lemma A.1.19. If ν : wff0 (V0 ) → Do is a valuation and Φ ⊆ wff0 (V0 ) with ν(Φ) =
{T}, then Φ is satisfiable.
Proof sketch: ν|V0 : V0 → Do is a satisfying variable assignment.
Lemma A.1.20. If φ : V0 → Do is a variable assignment, then I φ : wff0 (V0 ) → Do
is a valuation.
Now, we only have to put the pieces together to obtain the model existence theorem we are after.
Model Existence
Lemma A.1.21 (Hintikka-Lemma). If ∇ is an abstract consistency class and H
a ∇-Hintikka set, then H is satisfiable.
Proof:
1. We define ν(A) := T, iff A ∈ H
2. then ν is a valuation by the Hintikka properties
3. and thus ν|V0 is a satisfying assignment.
3. In particular, Φ ⊆ H is satisfiable.
Observation: If we look at the completeness proof below, we see that the Lemma above is the
only place where we had to deal with specific properties of the T0 .
So if we want to prove completeness of any other calculus with respect to propositional logic,
then we only need to prove an analogon to this lemma and can use the rest of the machinery we
have already established “off the shelf”.
This is one great advantage of the “abstract consistency method”; the other is that the method
can be extended transparently to other logics.
718 APPENDIX A. COMPLETENESS OF CALCULI FOR PROPOSITIONAL LOGIC
Completeness of T0
Corollary A.2.2. T0 is complete.
Proof: by contradiction
1. We assume that A ∈ wff0 (V0 ) is valid, but there is no closed tableau for AF .
2. We have {¬A} ∈ ∇ as ¬AT = AF .
3. so ¬A is satisfiable by the model existence theorem (which is applicable as ∇
is an abstract consistency class by our Lemma above)
4. this contradicts our assumption that A is valid.
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
719
720 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING
This Section. We will capture the “what went wrong” in terms of graphs
over literals set during the search, and their dependencies.
What can we learn from that information?:
Intuition: The initial vertices are the choice literals and unit clauses of ∆.
1. UP Rule: R 7→ T
Implied literal RT . Implication graph:
P T ∨ QT ; P F ∨ QF ; P T ∨ QF
PF
2. Splitting Rule:
2a. P 7→ F
Choice literal P F .
QT ; QF QT
3a. UP Rule: Q 7→ T
Implied literal QT
edges (RT ,QT ) and (P F ,QT ).
2
Conflict vertex 2P T ∨QF RT 2P T ∨QF
edges (P F ,2P T ∨QF ) and (QT ,2P T ∨QF ).
P F
T
X1
T F
Xn Xn
T F T F
Q Q Q Q
T F T F T F T F
T T T T T T T T
R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2R ; 2
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
DPLL on ∆ ; Θ ; Φ with Φ := QF ∨ S T ; QF ∨ S F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals:
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
Option 1 Option 2
T
Q PT
2P F ∨QF PF 2P F ∨QF QF
Conflict Graphs
A conflict graph captures “what went wrong” in a failed node.
Definition B.1.9 (Conflict Graph). Let ∆ be a clause set, and let Gimpl
β be the
implication graph for some search branch β of DPLL on ∆. A subgraph C of Gimpl
β
is a conflict graph if:
(i) C contains exactly one conflict vertex 2C .
B.1. UP CONFLICT ANALYSIS 723
(ii) If l′ is a vertex in C, then all parents of l′ , i.e. vertices li with a I edge (li ,l′ ),
are vertices in C as well.
(iii) All vertices in C have a path to 2C .
Conflict graph =
b Starting at a conflict vertex, backchain through the implication
graph until reaching choice literals.
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
DPLL on ∆ ; Θ ; Φ with Φ := QF ∨ S T ; QF ∨ S F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals: RT .
724 APPENDIX B. CONFLICT DRIVEN CLAUSE LEARNING
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT ST 2
Clause Learning
Observation: Conflict graphs encode the entailment relation.
Definition B.2.1. Let ∆ be a clause set, C be a conflict graph at some time
pointWduring a run of DPLL on ∆, and L be the choice literals in C, then we call
c := l∈L l the learned clause for C.
Theorem B.2.2. Let ∆, C, and c as in ??, then ∆ ⊨ c.
Idea: We can add learned clauses to DPLL derivations at any time without losing
soundness. (maybe this helps, if we have a good notion of learned clauses)
Definition B.2.3. Clause learning is the process of adding learned clauses to DPLL
clause sets at specific points. (details coming up)
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
Choice literals: P T , (X 1 T ), . . ., (X n T ), QT . Implied literals: RT .
PT
QT X1 T ... Xn T
RT 2P T ∨QF ∨RT
Learned clause: P F ∨ QF
Example B.2.5. l1 = P , C = P F ∨ QF , l′ = Q.
Observation: Given the earlier choices l1 , . . . , lk , after we learned the new clause
C = l1 ∨ . . . ∨ lk ∨ l′ , the value of l′ is now set by UP!
So we can continue:
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
Θ := X 1 T ∨ . . . ∨ X 100 T ; X 1 F ∨ . . . ∨ X 100 F
DPLL on ∆ ; Θ ; Φ with Φ := P F ∨ QF
Choice literals: P T , (X 1 T ), . . ., (X 100 T ), QT . Implied literals: QF , RT .
PT
QF X1 T ... Xn T
RT 2
Learned clause: P F
∆ := P F ∨ QF ∨ RT ; P F ∨ QF ∨ RF ; P F ∨ QT ∨ RT ; P F ∨ QT ∨ RF
DPLL on ∆ ; Θ with Θ := X 1 T ∨ . . . ∨ X n T ; X 1 F ∨ . . . ∨ X n F
P F
T
X1
T
Xn
T
Q
T F set by UP
RT ; 2 RT ; 2
learn P ∨ Q learn P F
F F
Note: Here, the problem could be avoided by splitting over different variables.
Problem: This is not so in general! (see next slide)
B.2. CLAUSE LEARNING 727
Definition B.2.8 (Just for the record). (not exam or exercises relevant)
One could run “DPLL + Clause Learning” by always backtracking to the maximal-
level choice variable contained in the learned clause.
The actual algorithm is called Conflict Directed Clause Learning (CDCL), and
differs from DPLL more radically:
let L := 0; I := ∅
repeat
execute UP
if a conflict was reached then /∗ learned clause C = l1 ∨ . . . ∨ lk ∨ l′ ∗/
if L = 0 then return UNSAT
L := maxki=1 level(li ); erase I below L
add C into ∆; add l′ to I at level L
else
if I is a total interpretation then return I
choose a new decision literal l; add l to I at level L
L := L + 1
Remarks
Which clause(s) to learn?:
While we only select choice literals, much more can be done.
For any cut through the conflict graph, with Choice literals on the “left hand”
side of the cut and the conflict literals on the right-hand side, the literals on the
left border of the cut yield a learnable clause.
Must take care to not learn too many clauses . . .
Modern SAT solvers successfully tackle practical instances where n > 1.000.000.
The most successful works are empirical. (Interesting theory is mainly concerned
with hand-crafted formulas, like the Pigeon Hole Problem.)
[CKT91] confirmed this for Graph Coloring and Hamiltonian Circuits. Later work
confirmed it for SAT (see previous slides), and for numerous other NP-complete
problems.
We will now analyze the first-order calculi for completeness. Just as in the case of the propositional
calculi, we prove a model existence theorem for the first-order model theory and then use that
for the completeness proofs2 . The proof of the first-order model existence theorem is completely EdN:2
analogous to the propositional one; indeed, apart from the model construction itself, it is just an
extension by a treatment for the first-order quantifiers.3 EdN:3
733
734 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
The proof of the model existence theorem goes via the notion of a Hintikka set, a set of
formulae with very strong syntactic closure properties, which allow to read off models. Jaako
Hintikka’s original idea for completeness proofs was that for every complete calculus C and every
C-consistent set one can induce a Hintikka set, from which a model can be constructed. This can
be considered as a first model existence theorem. However, the process of obtaining a Hintikka set
for a C-consistent set Φ of sentences usually involves complicated calculus dependent constructions.
In this situation, Raymond Smullyan was able to formulate the sufficient conditions for the
existence of Hintikka sets in the form of “abstract consistency properties” by isolating the calculus
independent parts of the Hintikka set construction. His technique allows to reformulate Hintikka
sets as maximal elements of abstract consistency classes and interpret the Hintikka set construction
as a maximizing limit process.
To carry out the “model-existence”/“abstract consistency” method, we will first have to look at
the notion of consistency.
Consistency and refutability are very important notions when studying the completeness for calculi;
they form syntactic counterparts of satisfiability.
Consistency
Let C be a calculus,. . .
Definition C.1.1. Let C be a calculus, then a formula set Φ is called C-refutable, if
there is a refutation, i.e. a derivation of a contradiction from Φ. The act of finding
a refutation for Φ is called refuting Φ.
It is very important to distinguish the syntactic C-refutability and C-consistency from satisfiability,
which is a property of formulae that is at the heart of semantics. Note that the former have the
calculus (a syntactic device) as a parameter, while the latter does not. In fact we should actually
say S-satisfiability, where ⟨L, K, ⊨⟩ is the current logical system.
C.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 735
Even the word “contradiction” has a syntactical flavor to it, it translates to “saying against
each other” from its Latin root.
The notion of an “abstract consistency class” provides the a calculus-independent notion of con-
sistency: A set Φ of sentences is considered “consistent in an abstract sense”, iff it is a member of
an abstract consistency class ∇.
Abstract Consistency
Definition C.1.6. Let ∇ be a collection of sets. We call ∇ closed under subsets,
iff for each Φ ∈ ∇, all subsets Ψ ⊆ Φ are elements of ∇.
The conditions are very natural: Take for instance ∇c , it would be foolish to call a set Φ of
sentences “consistent under a complete calculus”, if it contains an elementary contradiction. The
next condition ∇¬ says that if a set Φ that contains a sentence ¬¬A is “consistent”, then we should
be able to extend it by A without losing this property; in other words, a complete calculus should
be able to recognize A and ¬¬A to be equivalent. We will carry out the proof here, since it
gives us practice in dealing with the abstract consistency properties.
The main result here is that abstract consistency classes can be extended to compact ones. The
proof is quite tedious, but relatively straightforward. It allows us to assume that all abstract
consistency classes are compact in the first place (otherwise we pass to the compact extension).
Actually we are after abstract consistency classes that have an even stronger property than just
being closed under subsets. This will allow us to carry out a limit construction in the Hintikka
set extension argument later.
Compact Collections
Definition C.1.8. We call a collection ∇ of sets compact, iff for any set Φ we have
Φ ∈ ∇, iff Ψ ∈ ∇ for every finite subset Ψ of Φ.
Lemma C.1.9. If ∇ is compact, then ∇ is closed under subsets.
Proof:
1. Suppose S ⊆ T and T ∈ ∇.
2. Every finite subset A of S is a finite subset of T .
3. As ∇ is compact, we know that A ∈ ∇.
4. Thus S ∈ ∇.
736 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
The property of being closed under subsets is a “downwards-oriented” property: We go from large
sets to small sets, compactness (the interesting direction anyways) is also an “upwards-oriented”
property. We can go from small (finite) sets to large (infinite) sets. The main application for the
compactness condition will be to show that infinite sets of formulae are in a collection ∇ by testing
all their finite subsets (which is much simpler).
Hintikka sets are sets of sentences with very strong analytic closure conditions. These are motivated
as maximally consistent sets i.e. sets that already contain everything that can be consistently
added to them.
∇-Hintikka Set
Definition C.1.11. Let ∇ be an abstract consistency class, then we call a set
H ∈ ∇ a ∇ Hintikka Set, iff H is maximal in ∇, i.e. for all A with H∗A ∈ ∇ we
already have A ∈ H.
Theorem C.1.12 (Hintikka Properties). Let ∇ be an abstract consistency class
and H be a ∇-Hintikka set, then
C.1. ABSTRACT CONSISTENCY AND MODEL EXISTENCE 737
The following theorem is one of the main results in the “abstract consistency”/“model existence”
method. For any abstract consistent set Φ it allows us to construct a Hintikka set H with Φ ∈ H.
Extension Theorem
Theorem C.1.13. If ∇ is an abstract consistency class and Φ ∈ ∇ finite, then
there is a ∇-Hintikka set H with Φ ⊆ H.
Proof:
1. Wlog. assume that ∇ compact (else use compact extension)
2. Choose an enumeration A1 , . . . of cwff o (Σι ) and c1 , . . . of Σsk
0 .
3. and construct a sequence of sets Hi with H0 := Φ and
Hn if Hn ∗An ̸∈ ∇
Hn+1 := Hn ∪ {An , ¬([cn /X](B))} if Hn ∗An ∈ ∇ and An = ¬(∀X.B)
Hn ∗An else
S
4. Note that all Hi ∈ ∇, choose H := i∈N Hi
5. Ψ ⊆ H finite implies there is a j ∈ N such that Ψ ⊆ Hj ,
6. so Ψ ∈ ∇ as ∇ closed under subsets and H ∈ ∇ as ∇ is compact.
7. Let H∗B ∈ ∇, then there is a j ∈ N with B = Aj , so that B ∈ Hj+1 and
Hj+1 ⊆ H
8. Thus H is ∇-maximal
Note that the construction in the proof above is non-trivial in two respects. First, the limit
construction for H is not executed in our original abstract consistency class ∇, but in a suitably
738 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
extended one to make it compact — the original would not have contained H in general. Second,
the set H is not unique for Φ, but depends on the choice of the enumeration of cwff o (Σι ). If
we pick a different enumeration, we will end up with a different H. Say if A and ¬A are both
∇-consistent4 with Φ, then depending on which one is first in the enumeration H, will contain
that one; with all the consequences for subsequent choices in the construction process.
Valuations
Definition C.1.14. A function ν : cwff o (Σι )→D0 is called a (first-order) valuation,
iff ν is a propositional valuation and
ν(∀X.A) = T, iff ν([B/X](A)) = T for all closed terms B.
Thus a valuation is a weaker notion of evaluation in first-order logic; the other direction is also
true, even though the proof of this result is much more involved: The existence of a first-order
valuation that makes a set of sentences true entails the existence of a model that satisfies it.5
Now, we only have to put the pieces together to obtain the model existence theorem we are after.
Model Existence
Theorem C.1.17 (Hintikka-Lemma). If ∇ is an abstract consistency class and
H a ∇-Hintikka set, then H is satisfiable.
Proof:
1. we define ν(A):=T, iff A ∈ H,
2. then ν is a valuation by the Hintikka set properties.
3. We have ν(H) = {T}, so H is satisfiable.
Theorem C.1.18 (Model Existence). If ∇ is an abstract consistency class and
Φ ∈ ∇, then Φ is satisfiable.
Proof:
1. There is a ∇-Hintikka set H with Φ ⊆ H (Extension Theorem)
2. We know that H is satisfiable. (Hintikka-Lemma)
3. In particular, Φ ⊆ H is satisfiable.
This directly yields two important results that we will use for the completeness analysis.
Henkin’s Theorem
Corollary C.2.2 (Henkin’s Theorem). Every ND1 -consistent set of sentences has
a model.
Proof:
1. Let Φ be a ND1 -consistent set of sentences.
2. The class of sets of ND1 -consistent propositions constitute an abstract consis-
tency class.
3. Thus the model existence theorem guarantees a model for Φ.
Corollary C.2.3 (Löwenheim&Skolem Theorem). Satisfiable set Φ of first-order
sentences has a countable model.
Proof sketch: The model we constructed is countable, since the set of ground terms
is.
Now, the completeness result for first-order natural deduction is just a simple argument away.
We also get a compactness theorem (almost) for free: logical systems with a complete calculus are
always compact.
Soundness of T1f
Lemma C.3.1. Tableau rules transform satisfiable tableaux into satisfiable ones.
Proof:
we examine the tableau rules in turn
1. propositional rules as in propositional tableaux
2. T1f ∃ by ??
3. T1f⊥ by ?? (substitution value lemma)
4. T1f ∀
4.1. I φ (∀X.A) = T, iff I ψ (A) = T for all a ∈ Dι
4.2. so in particular for some a ∈ Dι ̸= ∅.
Corollary C.3.2. T1f is correct.
The only interesting steps are the cut rule, which can be directly handled by the substitution
value lemma, and the rule for the existential quantifier, which we do in a separate lemma.
Soundness of T1f ∃
F
5. So ([f (X 1 , . . ., X k )/X](A)) satisfiable in M′
This proof is paradigmatic for soundness proofs for calculi with Skolemization. We use the axiom
of choice at the meta-level to choose a meaning for the Skolem constant. Armed with the Model
Existence Theorem for first-order logic (??), the completeness of first-order tableaux is similarly
straightforward. We just have to show that the collection of tableau-irrefutable sentences is an
abstract consistency class, which is a simple proof-transformation exercise in all but the universal
quantifier case, which we postpone to its own Lemma (??).
742 APPENDIX C. COMPLETENESS OF CALCULI FOR FIRST-ORDER LOGIC
Completeness of (T1f )
ΨT ΨT
F F
(∀X.A) (∀X.A)
F F
([c/X](A)) ([f (X 1 , . . ., X k )/X](A))
Rest [f (X 1 , . . ., X k )/c](Rest)
So we only have to treat the case for the universal quantifier. This is what we usually call a
“lifting argument”, since we have to transform (“lift”) a proof for a formula θ(A) to one for A. In
the case of tableaux we do that by an induction on the tableau refutation for θ(A) which creates
a tableau-isomorphism to a tableau refutation for A.
Tableau-Lifting
Theorem C.3.5. If Tθ is a closed tableau for a set θ(Φ) of formulae, then there is
a closed tableau T for Φ.
Proof: by induction over the structure of Tθ we build an isomorphic tableau T , and
a tableau-isomorphism ω : T → Tθ , such that ω(A) = θ(A).
only the tableau-substitution rule is interesting.
T F
1. Let (θ(Ai )) and (θ(Bi )) cut formulae in the branch Θiθ of Tθ
2. there is a joint unifier σ of (θ(A1 ))=?(θ(B1 )) ∧ . . . ∧ (θ(An ))=?(θ(Bn ))
3. thus σ ◦ θ is a unifier of A and B
4. hence there is a most general unifier ρ of A1=?B1 ∧ . . . ∧ An=?Bn
5. so Θ is closed.
Again, the “lifting lemma for tableaux” is paradigmatic for lifting lemmata for other refutation
calculi.
Correctness (CNF)
Lemma C.4.1. A set Φ of sentences is satisfiable, iff CNF1 (Φ) is.
F
1. Let (∀X.A) satisfiable in M := ⟨D, I⟩ and free(A) = {X 1 , . . ., X n }
2. I φ (∀X.A) = F, so there is an a ∈ D with I φ,[a/X] (A) = F (only depends on
φ|free(A) )
3. let g : Dn → D be defined by g(a1 , . . ., an ):=a, iff φ(X i ) = ai .
4. choose M′ := ⟨D, I ′ ⟩ with I(f )′ := g, then Iφ′ ([f (X 1 , . . . , X k )/X](A)) = F
F
5. Thus ([f (X 1 , . . . , X k )/X](A)) is satisfiable in M′
Resolution (Correctness)
Definition C.4.2. A clause is called satisfiable, iff I φ (A) = α for one of its literals
Aα .
Completeness (R1 )
Theorem C.4.6. R1 is refutation complete.
Proof: ∇ := {Φ | ΦT has no closed tableau} is an abstract consistency class
1. as for propositional case.
2. by the lifting lemma below
F
3. Let T be a closed tableau for ¬(∀X.A) ∈ Φ and ΦT ∗([c/X](A)) ∈ ∇.
F
4. CNF1 (ΦT ) = CNF1 (ΨT ) ∪ CNF1 (([f (X 1 , . . ., X k )/X](A)) )
F
5. ([f (X 1 , . . ., X k )/c](CNF1 (ΦT )))∗([c/X](A)) = CNF1 (ΦT )
6. so R1 : CNF1 (ΦT )⊢D′ 2, where D = [f (X1′ , . . . , Xk′ )/c](D).
Lifting for R1
Theorem C.4.10. If R1 : (θ(Φ))⊢Dθ 2 for a set θ(Φ) of formulae, then there is a
R1 -refutation for Φ.
Proof: by induction over Dθ we construct a R1 -derivation R1 : Φ⊢D C and a θ-
compatible clause set isomorphism Ω : D → Dθ
Dθ′ Dθ′′
T F
1. If Dθ ends in ((θ(A)) ∨ (θ(C))) (θ(B)) ∨ (θ(D))
res
(σ(θ(C))) ∨ (σ(θ(B)))
T
then we have (IH) clause isormorphisms ω ′ : AT ∨ C → (θ(A)) ∨ (θ(C)) and
T
ω ′ : BT ∨ D → (θ(B)) , θ(D)
AT ∨ C BF ∨ D
2. thus Res where ρ = mgu(A, B)(exists, as σ ◦ θ unifier)
(ρ(C)) ∨ (ρ(B))